ptimize it. The inputs to python is 1) data source
or shuffled data, 2) the query plan.
Thanks
Binwei
-Original Message-
From: Jacques Nadeau
Sent: Wednesday, September 8, 2021 07:06
To: dev
Subject: Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate
Representation]"
Renaming the subject to increase visibility.
As we've dug deeper into this topic over the last 5-6 weeks, there
have been several learnings/observations:
* There are projects beyond Arrow, and which do not use Arrow at all,
which could make use of portable "compute IR". This speaks to a need
to p
As Phillip mentioned, I think there is something powerful in producing a
standard serialized representation of compute operations beyond just Arrow
and I'd really like to create a broader community around it. This has been
something I had been independently thinking about for the last several
month
Hey everyone,
As many of you know, the compute IR project has a lot of interested parties
and has generated a lot of feedback. In light of some of the feedback we’ve
received, we want to stress that the specification is intended to have
input from many diverse points of view and that we welcome fo
My (incredibly naive) interpretation is that there are three problems to tackle.
1) How do you represent a graph and relational operators (join, union,
groupby, etc.)
- The PR appears to be addressing this question fairly well
2) How does a frontend query a backend to know what UDFs are supported
Hey everyone,
There's some interesting discussion around types and where their location
is in the current PR [1] (and in fact whether to store them at all).
It would be great to get some community feedback on this [2] part of the PR
in particular, because the choice of whether to store types at a
As an FYI, Iceberg is also considering an IR in relation to view support
[1]. I chimed in and pointed them to this thread and Wes's doc. Phillip
and Jacques chimed in there as well.
[1]
https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZG
Thanks for the feedback Jacques, very helpful. In the latest version of the
PR, I've tried to incorporate nearly all of these points.
- I've incorporated most of what you had for dereferencing operations into
the PR, and gotten rid of schemas except on Read/Write relations.
- With respect to prope
In a lucky turn of events, Phillip actually turned out to be in my neck of
the woods on Friday so we had a chance to sit down and discuss this. To
help, I actually shared something I had been working on a few months ago
independently (before this discussion started).
For reference:
Wes PR: https:/
On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney wrote:
> Looking at Ben's alternate PR [1], having an IR that leans heavily on
> memory references to an out-of-band data sidecar seems like an
> approach that would substantially ratchet up the implementation
> complexity as producing the IR would th
WRT out-of-band data: if encapsulation is the priority over reuse of
Buffer etc that's straightforward to accommodate by replacement
with an alternative to Buffer. I have made that change to my PR in
https://github.com/apache/arrow/pull/10934/commits/ebd4fc665579dd6bba29c5c4731c2350ea0fa70a
> as m
Looking at Ben's alternate PR [1], having an IR that leans heavily on
memory references to an out-of-band data sidecar seems like an
approach that would substantially ratchet up the implementation
complexity as producing the IR would then have the level of complexity
of producing the Arrow IPC form
Thank you for putting together this proposal. Very exciting development. I
left some comments in the RFC doc, summarized here as:
* Flatbuffer is usable as a serialization agnostic IDL (
https://adsharma.github.io/flattools/)
* serde library + msgpack is a worthy candidate to consider for
serializ
Hey all,
Just wanted to give an update on the effort here.
Ben Kietzman has created an alternative proposal to the initial design [1].
It largely overlaps with the original, but differs in a few important ways:
* A big focus of the design is on flexibility, allowing producers,
consumers and ulti
> Wes wrote:
>
> Supporting this kind of intra-application engine
> heterogeneity is one of the motivations for the project.
+1
The data format is the natural interface between tasks. (Defining “task” here
as “something that is programmed using the IR”.) That is Arrow’s strength.
So I think the
On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud wrote:
>
> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Couple of questions
> >
> > 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
> > operation "(IR,data) - engine -> result"
On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:
> Couple of questions
>
> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
> operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
>
I think that might be a non-sta
Couple of questions
1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
2. if yes, imo we may need to worry about:
* a definition of equality that implementations agree on.
* agreement over what the sem
Thanks Wes,
Great to be back working on Arrow again and engaging with the community. I
am really excited about this effort.
I think there are a number of concerns I see as important to address in the
compute IR proposal:
1. Requirement for output types.
I think that so far there's been many rea
Thank you for all the feedback and comments on the document. I'm on
vacation this week, so I'm delayed responding to everything, but I
will get to it as quickly as I can. I will be at VLDB in Copenhagen
next week if anyone would like to chat in person about it, and we can
relay the content of any d
Hi Wes,
cool initiative! Reminded me of "Building Advanced SQL Analytics From
Low-Level Plan Operators" from SIGMOD 2021 (
http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which proposes a
set of building block for advanced aggregation.
Cheers,
Dimitri.
On Thu, Aug 5, 2021 at 7:59 PM Juli
Wes,
Thanks for this. I’ve added comments to the doc and to the PR.
The biggest surprise is that this language does full relational operations. I
was expecting that it would do fragments of the operations. Consider join. A
distributed hybrid hash join needs to partition rows into output buffers
hi folks,
This idea came up in passing in the past -- given that there are
multiple independent efforts to develop Arrow-native query engines
(and surely many more to come), it seems like it would be valuable to
have a way to enable user languages (like Java, Python, R, or Rust,
for example) to co
23 matches
Mail list logo