Re: [I] [substrait] improve handling for user-defined functionality [datafusion]

2024-11-20 Thread via GitHub


Blizzara commented on issue #13318:
URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2487850444

   I read through the proposal again now and I'm fully in favor of it. 
Providing the ability to handle UDTs, UDFs, and the extension types would be 
useful, but also having the ability to override specific parts of the consumer 
is something I'd have liked to have multiple times, to e.g. test out things or 
to hack something together quickly before contributing upstream.
   
   I think the idea of a trait with default impls makes sense. There'll 
probably also be helper functions which don't need to be part of the trait, but 
we could make those public as well so that people can re-use as much code as 
possible while overriding just the behavior they need.
   
   Re API breaks, what you suggest makes a lot of sense to me - we can keep the 
"current API" working and quite stable while the new one is developing. I think 
that should be fine; whenever I upgrade DF there's anyways some compile-time 
breaks to take care of, and I think anyone using customized Substrait handling 
can be expected to deal with those as well 😄 Obviously will be nice to one day 
have something stable, but it doesn't need to be the first goal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [substrait] improve handling for user-defined functionality [datafusion]

2024-11-09 Thread via GitHub


Blizzara commented on issue #13318:
URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2466277492

   I'm broadly very much in agreement and excited about this! It's been on the 
back of my mind that we should provide more extendability, but haven't made any 
concrete progress towards it myself. I'll review the idea in detail next week!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [substrait] improve handling for user-defined functionality [datafusion]

2024-11-08 Thread via GitHub


vbarua commented on issue #13318:
URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2465908769

cc: @westonpace @tokoko @Blizzara 
   
   As users of the Substrait consumer I'm interested to hear if any of you have 
thoughts, concerns or opinions about this before I get too far along with 
anything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [substrait] improve handling for user-defined functionality [datafusion]

2024-11-08 Thread via GitHub


vbarua commented on issue #13318:
URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2465908028

   
   What I'm hoping to drive with the following wall of text is:
   1. Agreement that this is worth improving.
   2. Feedback and suggestions for possible approaches.
   
   # Current Handling State
   The following documents the current state of handling for the various 
extension points and discusses some of the issues with them.
   
   ## Extension Relations
   Substrait has 3 relation types:
   * 
[ExtensionLeafRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L403-L407)
   * 
[ExtensionSingleRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L396-L401)
   * 
[ExtensionMultiRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L409C1-L414C2)
   
   which consume 0, 1 or more than 1 inputs respectively alongside a Protobuf 
Any detail message.
   
   The current API to decode these extensions re-uses the following method from 
the 
[SerializerRegistry](https://github.com/apache/datafusion/blob/main/datafusion/expr/src/registry.rs#L126-L142)
 trait:
   
   ```rust
   fn deserialize_logical_plan(
   &self,
   name: &str,
   bytes: &[u8],
   ) -> Result>;
   ```
   The use of this method assumes that the `detail` message maps directly to a 
`UserDefineLogicalNode`, which isn't always the case. Consider the following 
message intended for use with `ExtensionSingleRel`
   
   ```proto
   message UnnestDetail {
 repeated uint32 col_refs = 1;
   }
   ```
   there isn't enough information in the message alone to produce a valid 
`UserDefinedLogicalNode`, however if we had access to the input node it becomes 
relatively straightforward.
   
   A simpler API for this would be a hook like:
   ```rust
   fn handle_extension_single_rel(e: &ExtensionSingleRel) -> Result
   ```
   to allow users to fully own the deserialization logic.
   
   ## Extension Table
   The Substrait consumer does not provide any hooks to handle extension 
tables. An API for this could be something like:
   ```rust
   fn handle_extension_table(e: &ExtensionTable) -> Result
   ```
   
   This may be more flexible that is actually needed.
   
   ## User-Defined Types
   The Substrait consumer does not provide any hooks to handle user-defined 
types. An API for this could be something like.
   ```rust
   fn handle_user_defined_type(ud: &UserDefined) -> Result
   ```
   
   ## User-Defined Functions
   The Substrait consumer decodes Substrait functions based on name of the 
function in Substrait. The following is part of the handling code:
   ```rust
   if let Some(func) = ctx.state().scalar_functions().get(fn_name) {
   Ok(Expr::ScalarFunction(expr::ScalarFunction::new_udf(
   func.to_owned(),
   args,
   )))
   ```
   
   To map a Substrait function to a DataFusion function, users can bind the 
DataFusion implementation to the same name in the session context. This works 
reasonable well, but the fact the function resolution ignores the extension 
file can cause issues.
   
   For example, it's perfectly valid to have 2 function with the same name 
exist in different extensions. The spec defines an integer division function:
   ```yaml
   # functions_arithmetic.yaml
   %YAML 1.2
   ---
   scalar_functions:
 -
   name: "divide"
   impls:
 - args:
 - name: x
   value: i64
 - name: y
   value: i64
   return: i64
   ```
   but we may wish to provide a variant that returns a float as in MySQL:
   ```yaml
   # functions_mysql.yaml
   %YAML 1.2
   ---
   scalar_functions:
 -
   name: "divide"
 - args:
 - name: x
   value: i64
 - name: y
   value: i64
   return: fp64
   ```
   However, these cannot be distinguished by just the name.
   
   Improving the API for function resolution should probably be its own issue 
as it entails a fair bit of work.
   
   ## Advanced Extensions
   DataFusion currently ignores Advanced Extensions entirely. Advanced 
Extensions come in two forms:
   * optimizations: additional metadata which *does not* influence semantics 
and are ignorable.
   * enhancements: additional metadata which *does* influence semantics and 
cannot be ignored.
   
   In my opinion, these are the hardest types of extensions to build around 
because they contain arbitrary metadata associated with existing relations. 
This may be applied/needed before the conversion, after the conversion or 
*during* the conversion. Trying to account for all possible uses of this 
metadata is quite tricky.
   
   In practice, what has worked well for the Java library is to make the code 
for converting the various components of a rela

Re: [I] [substrait] improve handling for user-defined functionality [datafusion]

2024-11-08 Thread via GitHub


vbarua commented on issue #13318:
URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2465644281

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [substrait] improve handling for user-defined functionality [datafusion]

2024-11-08 Thread via GitHub


vbarua opened a new issue, #13318:
URL: https://github.com/apache/datafusion/issues/13318

   ### Is your feature request related to a problem or challenge?
   
   Substrait has a number of mechanisms that allow users to define custom 
functionality. These include:
   * [Extension 
Tables](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L132-L136)
 for defining new table types.
   * [Extension 
Relations](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L396-L414)
 for adding entirely new relation types.
   * [User-Defined 
Types](https://substrait.io/types/type_classes/#user-defined-types) for adding 
custom types.
   * [User-Define 
Functions](https://substrait.io/types/type_classes/#user-defined-types) for 
adding custom functions.
   * [Advanced 
Extensions](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/extensions/extensions.proto#L70-L79)
 for modifying existing relations.
   
   The current Substrait consumer has limited support for handling this custom 
functionality.
   
   ### Describe the solution you'd like
   
   As a user I would like to be able to re-use most of the DataFusion Substrait 
consumer code, but also be able to configure and/or provide handlers for our 
custom functionality.
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org