[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940974#comment-16940974 ] Thomas Graves commented on SPARK-27396: --- The main objectives have actually already been implemented, see the linked jiras. I will close this. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same > functionality for different processing because they can only do it through > code generation. There really is no clean separate path in the code > generation for columnar vs row based. Additionally, because it is only > supported through code generation if for any reason code generation would > fail there is no backup. This is typically fine for input formats but can be > problematic when we get into more extensive processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar. This is similar
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940647#comment-16940647 ] Renjie Liu commented on SPARK-27396: [~revans2] Any update on this epic? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same > functionality for different processing because they can only do it through > code generation. There really is no clean separate path in the code > generation for columnar vs row based. Additionally, because it is only > supported through code generation if for any reason code generation would > fail there is no backup. This is typically fine for input formats but can be > problematic when we get into more extensive processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar. This is similar to the first area and has the same > limitations because the cache acts
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851445#comment-16851445 ] Thomas Graves commented on SPARK-27396: --- The vote passed. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same > functionality for different processing because they can only do it through > code generation. There really is no clean separate path in the code > generation for columnar vs row based. Additionally, because it is only > supported through code generation if for any reason code generation would > fail there is no backup. This is typically fine for input formats but can be > problematic when we get into more extensive processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar. This is similar to the first area and has the same > limitations because the cache acts as an
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835014#comment-16835014 ] Thomas Graves commented on SPARK-27396: --- I just started a vote thread on this SPIP please take a look and vote. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same > functionality for different processing because they can only do it through > code generation. There really is no clean separate path in the code > generation for columnar vs row based. Additionally, because it is only > supported through code generation if for any reason code generation would > fail there is no backup. This is typically fine for input formats but can be > problematic when we get into more extensive processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar. This is similar to the first area and
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832776#comment-16832776 ] Robert Joseph Evans commented on SPARK-27396: - [~bryanc] The nice to have arrow formatting is for internal use only. Any user that wants to use a generic ColumnVector will have to use the java APIs to pull out the data and convert it to whatever format that they expect, like what is already done today to go to Pandas UDFs. Eventually I would like to expose that, but we cannot really do that until we get some kind of guarantee on the stability of the arrow memory layout. This would just be a performance boost for Pandas or anything else that uses arrow formatted data internally. This is why it is just a nice to have. There are already multiple different batch size configs in play here. When the data is first read in the batch size is set by a config for parquet or orc based off of the number of rows. When the data is cached, there is a separate batch size config for that. When converting from rows to columns for pandas there is spark.sql.execution.arrow.maxRecordsPerBatch. So initially I was going to try to honor these as much as possible for backwards compatibility. For the translation from rows to columns, out of the box this would just be for pandas. So I would use that config. I will likely create a new config that is more generic for all of these use cases and have the code fall back to the old configs if it is not set. One of the hard parts is sizing the batches properly. If they are too big you run out of memory and you just cannot run at all, which is why I don't see them being one batch per partition. If they are too small you could lose of the advantages of columnar processing. Also the ideal size will depend on the accelerator you are using. CPU based SIMD will want everything to fit in cache. Anything doing a DMA transfer will typically want as large of a size as possible without running out of memory, and that is likely to be memory of the device not on the host. So a single generic config for now feels like the preferable option until we can get more experience to see what else can be done. Also be aware that each time we do an exchange the data is partitioned and many very small batches may come back out. In the Facebook talk on their exchange work at Spark+AI Summit 2019 they said that the average compressed exchange size was a few hundred KB, and that would be for the entire partition. As such anyone implementing a partitioning for an exchange on the receiving end probably also wants to concatenate the small batches into something that will fit the size that they want. But that is something for the extensions to handle, at least initially. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832136#comment-16832136 ] Bryan Cutler commented on SPARK-27396: -- The revisions sound good to me, it has a little more focus now and I agree it's better to handle exporting columnar data on the other SPIP. I have a question on the nice-to-have point of exposing data in Arrow format. You are meaning that the Arrow Java APIs will not be publicly exposed, only access to raw data I believe, so the user can process the data with Arrow without further conversion? If this is not done, then the user would have to do some minor conversions from Spark internal format to Arrow? Also, how are the batch sizes set? Is it basically one batch per partition or can it be configured to break it up to smaller batches? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same >
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829631#comment-16829631 ] Robert Joseph Evans commented on SPARK-27396: - I have updated this SPIP to clarify some things and to reduce the scope of it. We are no longer trying to tackle anything around data movement to/from AI/ML systems. We are no longer trying to expose any Arrow APIs or formatting to end users. To avoid any stability issues with either the Arrow API or the Arrow memory layout specification we are going to split that part off as a separate SPIP that can be added in later. We will work with the Arrow community to see what if any stability guarantees we can get before putting that SPIP forward. The only APIs that we are going to expose in this first SPIP is through the spark sql extensions config/API so that groups that want to do accelerated columnar ETL can have an API to do it, even if it is an unstable API. Please take another look, and if there is not much in the way of comments we will put it up for another vote. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823097#comment-16823097 ] Thomas Graves commented on SPARK-27396: --- thanks for the questions and commenting, please also vote on the DEV list email chain - subject: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support I'm going to extend that vote by a few days to give more people time to comment as I know its a busy time of year. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822616#comment-16822616 ] binwei yang commented on SPARK-27396: - This is the same proposal we are working on. I have a topic in Spark Summit, we may have a talk after session. h4. [Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators|https://databricks.com/sparkaisummit/sessions-single-2019/?id=42] > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822517#comment-16822517 ] Xiangrui Meng commented on SPARK-27396: --- [~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it would be nice to state it clearly in the SPIP. Here are the parts that made me confused: {quote} Q1. Allow for simple data exchange with other systems, DL/ML libraries, pandas, etc. by having clean APIs to transform the columnar data into an Apache Arrow compatible layout. Q5 Anyone who wants to experiment with accelerated computing, either on a CPU or GPU, will benefit as it provides a supported way to make this work instead of trying to hack something in that was not available before. Q8 The first check for success would be to successfully transition the existing columnar code over to using this. That would be transforming and sending the data to python for processing by Pandas UDFs. {quote} And if you want to propose an interface that is Arrow compatible, I don't think "end users" like data engineers/scientists want to use it to write UDFs. Correct me if I'm wrong, your target personas are developers who understand Arrow format and be able to plug in vectorized code. If so, it would be nice to make it clear in the SPIP too. I'm +1 on columnar format. My doubt is how much benefit we get for ETL use cases from making it public. You mid-term goal (Pandas UDF) can be achieved without exposing any public API. And I mentioned the current issue of Pandas UDF is not the data conversion but data pipelining and per-batch overhead. You didn't give a concrete example for the final success. Q4, you mentioned "we have already completed a proof of concept that shows columnar processing can be efficiently done in Spark". Could you say more about the POC and what do you mean by "in Spark"? Could you provide a concrete ETL use case that benefits from this public API, does vectorization, and significantly boosts the performance? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822411#comment-16822411 ] Robert Joseph Evans commented on SPARK-27396: - [~mengxr], My goal is to provide a framework for columnar processing, just like the SPIP says. SPARK-24579 is for data exchange. If you have issues with exposing the Arrow Java APIs then mention it on that JIRA. My primary goal is actually not DL/ML but ETL. Columnar happens to be a common thread between them because it has traditionally been the preferred memory layout for doing accelerated numerical computation. Python already supports a public columnar API. These are the pandas UDFs. The issue is the overhead of doing a data exchange. And you are right about the data pipelineing in the GPU model, which is why ultimetly you will need some form of a cost based optimizer to determine if the operations within a stage are enough to justify the data movement and or data translatations involved. But there are more choices for accelrated computing besides just GPUs, which is why besides adding in the framework we are also allowing extensions to override the coputation model so anyone can provide a new backend to Spark. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822367#comment-16822367 ] Xiangrui Meng commented on SPARK-27396: --- [~revans2] What would end users do with public APIs for columnar data? I want to understand the main use cases you are considering here. Are they mainly ML/DL use cases or you also consider general data processing? It seems to me you are considering data exchange between Spark and other systems that can utilize vectorization more than end users writing vectorized operations directly. My main concern is which language to expose columnar API. It was a major concern in SPARK-24579. Your proposal wants to expose ArrowRecordBatch in Scala/Java. But Spark rarely exposes third-party APIs because they break, while Spark needs to maintain binary compatibility. E.g., see SPARK-4819 (Guava's Optional). Arrow is still in development. Is it possible that it might change APIs in 1.0.0 release? The next question is what other systems can fully utilize vectorization. I'm mainly thinking of DL/GPU use cases, where Python is the dominant language. Most Scala/Java libraries are bad at numerical computation. Some call into native libraries to improve performance. But most of them already offer Python APIs (like TensorFlow Java API vs TensorFlow Python API). So any important use cases that can benefit from a public Scala/Java columnar API but not in Python? PySpark does add an extra layer of serialization. We did some investigation and found the main issue is data pipelining in GPU model inference, not the raw speed of processing data. For simple compute operations, it might not be worth exchanging the data. So what if we only expose Arrow format in PySpark? Would it suffice for the main use cases you are considering? SPARK-26412 is a proposal along that direction, that fixes some known issues in current Pandas UDF (per batch initialization / data pipelining). > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and >
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819592#comment-16819592 ] Robert Joseph Evans commented on SPARK-27396: - This SPIP is to put a framework in place to be able to support columnar processing, but not to actually implement that processing. Implementations would be provided by extensions. Those extensions would have a few options on how to allocate memory for results and/or intermediate results. The contract really is just around the ColumnarBatch and ColumnVector classes, so an extension could use built in implementations, similar to the on heap, off heap, and arrow column vector implementations in the current vesion of Spark. We would provide a config for what the default should be and an API to be able to allocate one of these vectors based off of that config, but in some cases the extension may want to supply their own implementation, similar to how the ORC FileFormat currently does. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819181#comment-16819181 ] Kazuaki Ishizaki commented on SPARK-27396: -- I have one question regarding low-level API. In my understanding, this SPIP proposes code generation API for each operation at low-level for exploiting columnar storage. How does this SPIP support to store the result of the generated code into columnar storage? In particular, for {{genColumnarCode()}}. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819153#comment-16819153 ] Thomas Graves commented on SPARK-27396: --- Since I don't hear any strong objections against the idea, I'm going to put the SPIP up for vote on the mailing list. We can continue discussions here or on the list. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more extensive > processing. > # When caching data it can optionally be cached in a columnar format if the >
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819125#comment-16819125 ] Robert Joseph Evans commented on SPARK-27396: - [~bryanc], I see your point that if this is for data exchange the API should be {{RDD[ArrowRecordBatch]}} and {{arrow_udf}}. {{ArrowRecordBatch}} is an ipc class and is made for exchanging data so it should work out of the box and be simper for end users to deal with. If it is for doing in place data processing, not sending it to another system, then I think we want something based off of {{ColumnarBatch}}. Since in place columnar data processing is hard, initially limiting it to just the extensions API feels preferable. If others are okay with that I will drop {{columnar_udf}} and {{RDD[ColumnarBatch]}} from this proposal, and just make sure that we have a good way for translating between {{ColumnarBatch}} and {{ArrowRecordBatch}} so we can play nicely with SPARK-24579. In the future if we find that advanced users do want columnar processing UDFs we can discuss ways to properly expose it at that point. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818649#comment-16818649 ] Bryan Cutler commented on SPARK-27396: -- Thanks for this [~revans2], overall I think the proposal sounds good and there will be some nice benefits for increasing support for columnar processing in Spark. I saw you mentioned Pandas UDF a couple times, but does this SPIP also cover the related changes for the PySpark API? Some things might not translate exactly from the Scala APIs, like what does a {{RDD[ColumnarBatch]}} mean in Python? I also have some concerns about the APIs for goal #3, data transfers to DL/ML frameworks. This is essentially, SPARK-24579 IIUC. My concern is that creating a {{columnar_udf}} might not be the best interface for this. First off, if done in PySpark then the user would provide a Python function which means all the data must be sent to the Python worker process before it is sent elsewhere. This prevents the user from doing a more optimal data exchange that might go directly from the Spark JVM to another framework, like a TensorFlow C++ kernel, skipping Python entirely. Secondly, I'm not sure that the average user will take to working with low-level {{ColumnarBatches}} in a UDF. Even if the data is in Arrow form, there are a lot of differences between the Arrow Java and Python implementations which could be confusing for the end user. I think something like a plugin interface, where specialized connectors would handle the low-level transfer and could be invoked the same in Python or Java/Scala, might be better in the long run. Having a connector which executes a UDF would still be useful for advanced users. I don't know if this is out of scope for the SPIP, but I wouldn't want us to get stuck with a {{columnar_udf}} api that is limited and not user friendly. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? >
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817990#comment-16817990 ] Robert Joseph Evans commented on SPARK-27396: - There are actually a few public facing APIs I think we would need to expose corresponding to different use cases. The ML use case needs access to the data for easy transfer, but also needs to know when it has all of the data it is expecting. Typically this is done by converting the dataset into an {{RDD}} which can expose an iterator so you know you have all of the data. The simplest, and probably the most memory efficient solution is to expose the underlying {{RDD[ColumanrBatch]}}. We could do this with a new API on dataset similar to {{rdd()}} or {{javaRdd()}} perhaps something called {{columnarRdd()}} or {{columnarJavaRdd()}}, but I would need to think about the exact name a bit. This is the use cases that really drives me wanting to have a columnar shuffle implementation on the CPU because the typical use case is to load the data from a file, and then repartition the data for training, before doing the training. Many of the input formats (parquet, and orc) already are columnar, so we could support that entire use case without transforming the data from columns to rows. An RDD based API, however, is far from ideal for data processing, as we would lose all of the advantages of the DataSet for query optimization and rewriting. For that use case I think providing a columnar UDF API, based off of the pandas udf APIs. At a minimum I would say that we start off with a scalar columnar udf that would look like a regular udf, but the input instead of being simple types would be {{ColumnVector}} and so would the output. But because a ColumnVector is not typed itself we could not infer the output format from the API, so we would need them to declare it for us, something kind of like. {code:java} def columnar_udf((a) -> /* do something with a*/, BooleanType) {code} In the future we could add in support for the more complex forms if they prove popular like grouped map and grouped aggregate udfs. These should be fairly simple to do, but I don't really want to commit to them in the first version of the code, but we would want the API written with that in mind. The final API is to allow advanced framework users to add or replace the columnar processing of a physical plan. This would allow someone to add in say GPU support as the backend for data processing, or SIMD optimized CPU, or any other type of accelerated columnar processing. This would be done through the experimental and unstable `spark.sql.extensions` config and the SparkSessionsExtensions API. It would primarily provide some places to insert a `Rule[SparkPlan]` into the execution phase of Catalyst so framework writers could use the APIs I described previously to implement columnar versions of the various operators. This would also allow the operators to implement columnar specific logical plan optimizations using the same API. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817152#comment-16817152 ] Kazuaki Ishizaki commented on SPARK-27396: -- Thank you for sharing low-level APIs for Spark implementors. Could you please share the current thought on the high-level APIs that Spark application developers will use from their Spark applications? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more extensive > processing. > # When caching data it can optionally
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815846#comment-16815846 ] Robert Joseph Evans commented on SPARK-27396: - [~kiszk], The exact detail of some of the APIs are going to be worked out once I better understand some of the internal details of how code generation works, and how we are going to be able to support frameworks inserting columnar execution into the physical plan. I am looking at these right now, but the high level APIs should remain more or less the same. In general the APIs should look a lot like the existing columnar APIs. There will be a ColumnarBatch and ColumnVectors, but new versions in a new package to allow for backwards compatibility with a clean transition. Also because Expression will need access to these classes and the current classes are in the wrong jar for that to work. We will likely clean up the APIs some to make it cleaner as a more formal public API. We would formalize an Arrow formatted ColumnVector implementation for more general use and provide APIs to translate the existing columnar data into arrow formatted data. If we do our job right those translations will be noops in many cases because the data will be Arrow formatted already. Catalyst Expression and SparkPlan will have a few new methods added to them that should follow the existing form of evaluation, etc. {code} def supportsColumnar: Boolean = false // For Expression def evalColumnar(batch: ColumnarBatch): Any = { // Throw exception not supported } def genColumnarCode(ctx: CodegenContext): ExprCode = { ... } // For SparkPlan def executeColumnar() : RDD[ColumnarBatch] = { ... } // For CodegenSupport doColumnarProduce... doColumnarConsume... {code} We would then be able to look at the SparkPlan and Expressions similar to what is done for WholeStageCodeGen and insert transitions between columnar and row based data formats and execution paths as needed. I hope that gives you enough detail. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815812#comment-16815812 ] Kazuaki Ishizaki commented on SPARK-27396: -- Sorry for being late to comment since I am traveling. Could you please let us know the proposed API changes regarding Q1-4 and Q1-5? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more extensive > processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar.
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815426#comment-16815426 ] Robert Joseph Evans commented on SPARK-27396: - Thanks [~tgraves] I updated the JIRA with you as the shepard. Does anyone have any thing more to discuss or can we call a vote on it? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more extensive > processing. > # When caching data it can optionally be cached in a columnar format if the > input is also
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815403#comment-16815403 ] Thomas Graves commented on SPARK-27396: --- I can shephard it. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more extensive > processing. > # When caching data it can optionally be cached in a columnar format if the > input is also columnar. This is similar to the first area and has the same > limitations because the cache acts as an input, but it is the only
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814883#comment-16814883 ] Robert Joseph Evans commented on SPARK-27396: - This SPIP has been up for 5 days and I see 10 people following it, but there has been no discussion since. Are you still digesting the proposal? Are you just busy with an upcoming conference and haven't had time to look at it? From the previous discussion it sounded like the core of what I am proposing is not that controversial, so I would like to move forward with it sooner than later, but I also want to give everyone time to understand it and ask questions. Also I am looking for someone to be the shepherd. I can technically do it, being a PMC member, but I have not been that active until recently so to avoid any concerns I would prefer to find someone else to be the shepard. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811183#comment-16811183 ] Robert Joseph Evans commented on SPARK-27396: - I have kept this at a high level just answering the questions in the SPIP questionnaire initially. I am not totally sure where a design doc would fit into all of this, or an example of some of the things we want to do with the APIs. I am happy to work on design docs or share more details about how I see the APIs working as needed. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is