Re: [I] [substrait] improve handling for user-defined functionality [datafusion]
Blizzara commented on issue #13318: URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2487850444 I read through the proposal again now and I'm fully in favor of it. Providing the ability to handle UDTs, UDFs, and the extension types would be useful, but also having the ability to override specific parts of the consumer is something I'd have liked to have multiple times, to e.g. test out things or to hack something together quickly before contributing upstream. I think the idea of a trait with default impls makes sense. There'll probably also be helper functions which don't need to be part of the trait, but we could make those public as well so that people can re-use as much code as possible while overriding just the behavior they need. Re API breaks, what you suggest makes a lot of sense to me - we can keep the "current API" working and quite stable while the new one is developing. I think that should be fine; whenever I upgrade DF there's anyways some compile-time breaks to take care of, and I think anyone using customized Substrait handling can be expected to deal with those as well 😄 Obviously will be nice to one day have something stable, but it doesn't need to be the first goal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [substrait] improve handling for user-defined functionality [datafusion]
Blizzara commented on issue #13318: URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2466277492 I'm broadly very much in agreement and excited about this! It's been on the back of my mind that we should provide more extendability, but haven't made any concrete progress towards it myself. I'll review the idea in detail next week! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [substrait] improve handling for user-defined functionality [datafusion]
vbarua commented on issue #13318: URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2465908769 cc: @westonpace @tokoko @Blizzara As users of the Substrait consumer I'm interested to hear if any of you have thoughts, concerns or opinions about this before I get too far along with anything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [substrait] improve handling for user-defined functionality [datafusion]
vbarua commented on issue #13318: URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2465908028 What I'm hoping to drive with the following wall of text is: 1. Agreement that this is worth improving. 2. Feedback and suggestions for possible approaches. # Current Handling State The following documents the current state of handling for the various extension points and discusses some of the issues with them. ## Extension Relations Substrait has 3 relation types: * [ExtensionLeafRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L403-L407) * [ExtensionSingleRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L396-L401) * [ExtensionMultiRel](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L409C1-L414C2) which consume 0, 1 or more than 1 inputs respectively alongside a Protobuf Any detail message. The current API to decode these extensions re-uses the following method from the [SerializerRegistry](https://github.com/apache/datafusion/blob/main/datafusion/expr/src/registry.rs#L126-L142) trait: ```rust fn deserialize_logical_plan( &self, name: &str, bytes: &[u8], ) -> Result>; ``` The use of this method assumes that the `detail` message maps directly to a `UserDefineLogicalNode`, which isn't always the case. Consider the following message intended for use with `ExtensionSingleRel` ```proto message UnnestDetail { repeated uint32 col_refs = 1; } ``` there isn't enough information in the message alone to produce a valid `UserDefinedLogicalNode`, however if we had access to the input node it becomes relatively straightforward. A simpler API for this would be a hook like: ```rust fn handle_extension_single_rel(e: &ExtensionSingleRel) -> Result ``` to allow users to fully own the deserialization logic. ## Extension Table The Substrait consumer does not provide any hooks to handle extension tables. An API for this could be something like: ```rust fn handle_extension_table(e: &ExtensionTable) -> Result ``` This may be more flexible that is actually needed. ## User-Defined Types The Substrait consumer does not provide any hooks to handle user-defined types. An API for this could be something like. ```rust fn handle_user_defined_type(ud: &UserDefined) -> Result ``` ## User-Defined Functions The Substrait consumer decodes Substrait functions based on name of the function in Substrait. The following is part of the handling code: ```rust if let Some(func) = ctx.state().scalar_functions().get(fn_name) { Ok(Expr::ScalarFunction(expr::ScalarFunction::new_udf( func.to_owned(), args, ))) ``` To map a Substrait function to a DataFusion function, users can bind the DataFusion implementation to the same name in the session context. This works reasonable well, but the fact the function resolution ignores the extension file can cause issues. For example, it's perfectly valid to have 2 function with the same name exist in different extensions. The spec defines an integer division function: ```yaml # functions_arithmetic.yaml %YAML 1.2 --- scalar_functions: - name: "divide" impls: - args: - name: x value: i64 - name: y value: i64 return: i64 ``` but we may wish to provide a variant that returns a float as in MySQL: ```yaml # functions_mysql.yaml %YAML 1.2 --- scalar_functions: - name: "divide" - args: - name: x value: i64 - name: y value: i64 return: fp64 ``` However, these cannot be distinguished by just the name. Improving the API for function resolution should probably be its own issue as it entails a fair bit of work. ## Advanced Extensions DataFusion currently ignores Advanced Extensions entirely. Advanced Extensions come in two forms: * optimizations: additional metadata which *does not* influence semantics and are ignorable. * enhancements: additional metadata which *does* influence semantics and cannot be ignored. In my opinion, these are the hardest types of extensions to build around because they contain arbitrary metadata associated with existing relations. This may be applied/needed before the conversion, after the conversion or *during* the conversion. Trying to account for all possible uses of this metadata is quite tricky. In practice, what has worked well for the Java library is to make the code for converting the various components of a rela
Re: [I] [substrait] improve handling for user-defined functionality [datafusion]
vbarua commented on issue #13318: URL: https://github.com/apache/datafusion/issues/13318#issuecomment-2465644281 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [substrait] improve handling for user-defined functionality [datafusion]
vbarua opened a new issue, #13318: URL: https://github.com/apache/datafusion/issues/13318 ### Is your feature request related to a problem or challenge? Substrait has a number of mechanisms that allow users to define custom functionality. These include: * [Extension Tables](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L132-L136) for defining new table types. * [Extension Relations](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/algebra.proto#L396-L414) for adding entirely new relation types. * [User-Defined Types](https://substrait.io/types/type_classes/#user-defined-types) for adding custom types. * [User-Define Functions](https://substrait.io/types/type_classes/#user-defined-types) for adding custom functions. * [Advanced Extensions](https://github.com/substrait-io/substrait/blob/9cccb04fba336489b70ed42b71f73a0a1e34f9f5/proto/substrait/extensions/extensions.proto#L70-L79) for modifying existing relations. The current Substrait consumer has limited support for handling this custom functionality. ### Describe the solution you'd like As a user I would like to be able to re-use most of the DataFusion Substrait consumer code, but also be able to configure and/or provide handlers for our custom functionality. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org