RE: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-10-17 Thread Yang, Binwei
it. The inputs to python is 1) data source or shuffled data, 2) the query plan. Thanks Binwei -Original Message- From: Jacques Nadeau Sent: Wednesday, September 8, 2021 07:06 To: dev Subject: Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decoupl

Substrait compute IR initiative [was Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines]

2021-09-14 Thread Wes McKinney
Renaming the subject to increase visibility. As we've dug deeper into this topic over the last 5-6 weeks, there have been several learnings/observations: * There are projects beyond Arrow, and which do not use Arrow at all, which could make use of portable "compute IR". This speaks to a need to

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-09-07 Thread Jacques Nadeau
As Phillip mentioned, I think there is something powerful in producing a standard serialized representation of compute operations beyond just Arrow and I'd really like to create a broader community around it. This has been something I had been independently thinking about for the last several

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-09-01 Thread Phillip Cloud
Hey everyone, As many of you know, the compute IR project has a lot of interested parties and has generated a lot of feedback. In light of some of the feedback we’ve received, we want to stress that the specification is intended to have input from many diverse points of view and that we welcome

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-30 Thread Weston Pace
My (incredibly naive) interpretation is that there are three problems to tackle. 1) How do you represent a graph and relational operators (join, union, groupby, etc.) - The PR appears to be addressing this question fairly well 2) How does a frontend query a backend to know what UDFs are

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-30 Thread Phillip Cloud
Hey everyone, There's some interesting discussion around types and where their location is in the current PR [1] (and in fact whether to store them at all). It would be great to get some community feedback on this [2] part of the PR in particular, because the choice of whether to store types at

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-27 Thread Micah Kornfield
As an FYI, Iceberg is also considering an IR in relation to view support [1]. I chimed in and pointed them to this thread and Wes's doc. Phillip and Jacques chimed in there as well. [1]

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-26 Thread Phillip Cloud
Thanks for the feedback Jacques, very helpful. In the latest version of the PR, I've tried to incorporate nearly all of these points. - I've incorporated most of what you had for dereferencing operations into the PR, and gotten rid of schemas except on Read/Write relations. - With respect to

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-23 Thread Jacques Nadeau
In a lucky turn of events, Phillip actually turned out to be in my neck of the woods on Friday so we had a chance to sit down and discuss this. To help, I actually shared something I had been working on a few months ago independently (before this discussion started). For reference: Wes PR:

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-17 Thread Phillip Cloud
On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney wrote: > Looking at Ben's alternate PR [1], having an IR that leans heavily on > memory references to an out-of-band data sidecar seems like an > approach that would substantially ratchet up the implementation > complexity as producing the IR would

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-17 Thread Benjamin Kietzman
WRT out-of-band data: if encapsulation is the priority over reuse of Buffer etc that's straightforward to accommodate by replacement with an alternative to Buffer. I have made that change to my PR in https://github.com/apache/arrow/pull/10934/commits/ebd4fc665579dd6bba29c5c4731c2350ea0fa70a > as

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-17 Thread Wes McKinney
Looking at Ben's alternate PR [1], having an IR that leans heavily on memory references to an out-of-band data sidecar seems like an approach that would substantially ratchet up the implementation complexity as producing the IR would then have the level of complexity of producing the Arrow IPC

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-16 Thread Arun Sharma
Thank you for putting together this proposal. Very exciting development. I left some comments in the RFC doc, summarized here as: * Flatbuffer is usable as a serialization agnostic IDL ( https://adsharma.github.io/flattools/) * serde library + msgpack is a worthy candidate to consider for

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-13 Thread Phillip Cloud
Hey all, Just wanted to give an update on the effort here. Ben Kietzman has created an alternative proposal to the initial design [1]. It largely overlaps with the original, but differs in a few important ways: * A big focus of the design is on flexibility, allowing producers, consumers and

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-12 Thread Julian Hyde
> Wes wrote: > > Supporting this kind of intra-application engine > heterogeneity is one of the motivations for the project. +1 The data format is the natural interface between tasks. (Defining “task” here as “something that is programmed using the IR”.) That is Arrow’s strength. So I think

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-12 Thread Wes McKinney
On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud wrote: > > On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Couple of questions > > > > 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the > > operation "(IR,data) - engine ->

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-11 Thread Phillip Cloud
On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Couple of questions > > 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the > operation "(IR,data) - engine -> result" MUST be the same for all "engine"? > I think that might be a

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-11 Thread Jorge Cardoso Leitão
Couple of questions 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the operation "(IR,data) - engine -> result" MUST be the same for all "engine"? 2. if yes, imo we may need to worry about: * a definition of equality that implementations agree on. * agreement over what the

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-11 Thread Phillip Cloud
Thanks Wes, Great to be back working on Arrow again and engaging with the community. I am really excited about this effort. I think there are a number of concerns I see as important to address in the compute IR proposal: 1. Requirement for output types. I think that so far there's been many

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-10 Thread Wes McKinney
Thank you for all the feedback and comments on the document. I'm on vacation this week, so I'm delayed responding to everything, but I will get to it as quickly as I can. I will be at VLDB in Copenhagen next week if anyone would like to chat in person about it, and we can relay the content of any

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-10 Thread Dimitri Vorona
Hi Wes, cool initiative! Reminded me of "Building Advanced SQL Analytics From Low-Level Plan Operators" from SIGMOD 2021 ( http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which proposes a set of building block for advanced aggregation. Cheers, Dimitri. On Thu, Aug 5, 2021 at 7:59 PM

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-05 Thread Julian Hyde
Wes, Thanks for this. I’ve added comments to the doc and to the PR. The biggest surprise is that this language does full relational operations. I was expecting that it would do fragments of the operations. Consider join. A distributed hybrid hash join needs to partition rows into output

[DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-02 Thread Wes McKinney
hi folks, This idea came up in passing in the past -- given that there are multiple independent efforts to develop Arrow-native query engines (and surely many more to come), it seems like it would be valuable to have a way to enable user languages (like Java, Python, R, or Rust, for example) to