There is work ongoing to implement a pandas-like interface to the Arrow C++ compute engine using Ibis [1] and Substrait [2].
The gist of this is: Python users write Ibis code; the ibis-substrait submodule of Ibis [3] compiles this into a Substrait plan; and the Arrow C++ compute engine parses the Substrait plan and executes it, returning an Arrow table. There is a recent Voltron Data blog post co-authored by Phillip Cloud (the lead developer of Ibis) and Jacques Nadeau (the originator of Substrait) with some additional details [4]. Since Substrait is an open standard, this can all be implemented without introducing tight coupling between the interface and the engine, at least in theory. Other Substrait-compatible engines will be able to consume and execute Substrait plans, and other Substrait-compatible interfaces will be able to produce Substrait plans. There is already some work ongoing to support Substrait in other interfaces and engines including the Arrow R interface [5], the DataFusion engine [6], and the DuckDB engine [7]. I am not sure when the Ibis-Substrait-Arrow stack is expected to reach milestones of "real-world usability." I will leave it to others to comment about that Thanks, Ian [1] https://ibis-project.org [2] https://substrait.io [3] https://github.com/ibis-project/ibis-substrait [4] https://voltrondata.com/news/introducing-substrait-an-interoperable-data-to-engine-connector/ [5] https://github.com/voltrondata/substrait-r [6] https://github.com/datafusion-contrib/datafusion-substrait [7] https://github.com/duckdblabs/duckdb-substrait-demo On Thu, Mar 31, 2022 at 7:25 AM Suresh V <[email protected]> wrote: > > Thanks Aldrin for the response. My goal is to have something like > numexpr/pandas equivalent where I can simply pass strings to get the results > and if arrow gods have any plans to create something like that in the near > future :) > > On Thu, Mar 31, 2022 at 3:15 AM Aldrin <[email protected]> wrote: >> >> I think everything Weston said is totally good. >> >> I just wanted to add that there may be a couple of "simpler" options >> available without using SQL-like approaches: >> (1) just use the compute API (at least for examples such as "sum(...) / >> sum(...)") >> (2) when converting strings to expressions, do it in 2 passes: (a) put >> expressions supported by the dataset API into projections and scans on the >> source table, then (b) use the compute API for any other expressions and >> apply them to the final relation from (a). So, for one of the given >> examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying >> expressions: once via the dataset API, and once via the compute API. >> >> Note that (1) seems simple in my naive opinion, but (2) may actually be >> quite complicated. But, how much extra work it is probably depends on how >> you were planning on converting strings to expressions anyways. And note >> that you can do many iterations of (2) for using aggregates in filters or >> projections, etc. >> >> ---- >> >> The expression: >> >> sum(a * b) / sum(a) >> >> would be: >> >> ``` >> import pyarrow >> from pyarrow import dataset >> from pyarrow import compute >> >> tab = pyarrow.Table.from_arrays( >> [ >> pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >> ,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) >> ] >> ,names=['a', 'b'] >> ) >> >> # some chain of scans and projections via the dataset API >> relational_result = dataset.Scanner.from_batches( >> tab.to_batches() >> ,schema=tab.schema >> ,columns={ >> 'a': dataset.field('a') >> ,'product': dataset.field('a') * dataset.field('b') >> } >> ).to_table() >> >> # a single "expression" that uses the compute API and the final relation >> from the previous "phase" >> expr_result = ( >> compute.sum(relational_result.column('product')) >> / compute.sum(relational_result.column('a')) >> ) >> ``` >> >> Aldrin Montana >> Computer Science PhD Student >> UC Santa Cruz >> >> >> On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]> wrote: >>> >>> Yes and no :) Disclaimer: this answer is a little out of my >>> wheelhouse as I've learned relational algebra through doing and so my >>> formal theory may be off. Anyone is welcome to correct me. Also, >>> this answer turned into a bit of ramble and is a bit scattershot. You >>> may already be very familiar with some of these concepts. I'm not >>> trying to be patronizing but I do tend to ramble. >>> >>> TL;DR: Pyarrow has some "relational algebra interfaces" today but not >>> a lot of "dataframe interfaces". What you're asking for is a little >>> bit more of a "dataframe" type question and you will probably need to >>> go beyond pyarrow to get exactly what you are asking for. That being >>> said, pyarrow has a lot of the primitives that can solve parts and >>> pieces of your problem. >>> >>> In relational algebra there is a special class of functions called >>> "scalar functions". These are functions that create a single value >>> for each row. For example, "less than", "addition", and "to upper >>> case". A scalar expression is then an expression that consists only >>> of scalar functions. Projections in dataset scanners can only contain >>> scalar expressions. In pyarrow you can scan an in-memory table and >>> apply a scalar expression: >>> >>> ``` >>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> >>> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >>> tab = pa.Table.from_arrays([arr], names=["x"]) >>> expr = (ds.field("x") > 5) & (ds.field("x") < 8) >>> # This is a little bit unwieldy, we could maybe investigate a better >>> API for this kind of thing >>> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema, >>> columns={'y': expr3}).to_table() >>> # pyarrow.Table >>> # y: bool >>> # ---- >>> # y: [[false,false,false,false,false,true,true,false,false,false]] >>> ``` >>> >>> However, the functions that you are listing (sum, avg) are not scalar >>> functions. They do, however, fit a special class of functions known >>> as "aggregates" (functions that consume values from multiple rows to >>> create a single output). Aggregates in relational algebra are used in >>> a "group by" node. In pyarrow 7 we added the ability to run group_by >>> functions very easily on in-memory tables but it looks like it >>> requires at least one key column so that isn't going to help us here. >>> >>> Both of these features are powered by the pyarrow compute functions. >>> These are slightly lower level compute primitives. We can use these >>> here to get some of the values you want: >>> >>> ``` >>> # Continuing from above example >>> pc.sum(tab.column("x")) >>> # <pyarrow.Int64Scalar: 55> >>> pc.mean(tab.column("x")) >>> # <pyarrow.DoubleScalar: 5.5> >>> ``` >>> >>> But...while this does give you the capability to run functions, this >>> doesn't give you the capability to run "expressions". >>> >>> Running expressions that contain both aggregate and scalar functions >>> is a little trickier than it may seem. This is often done by creating >>> a relational algebra query from the expression. For example, consider >>> the expression `sum(a * b)/sum(a)`. >>> >>> We can create the query (this is totally shorthand pseudocode, the >>> Substrait text format isn't ready yet) >>> >>> SCAN_TABLE(table) -> >>> PROJECT({"a*b": multiply(field("a"), field("b"))}) -> >>> GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) -> >>> PROJECT({"sum(a*b)/sum(a)": divide(field("sum(a*b)"),field("sum(a)"))}) >>> >>> So if you wanted to create such a query plan then you could express it >>> in Substrait, which is a spec for expression query plans, and use our >>> "very new and not quite ready yet" Substrait consumer API to process >>> that query plan. So if your goal is purely "how do I express a series >>> of compute operations as a string" then I will point out that SQL is a >>> very standard answer for that question and the Substrait text format >>> will be a new way to answer that question. >>> >>> --- >>> >>> Ok, now everything I've written has been exclusive to pyarrow. If I >>> step back and look at your original question "how can I evaluate a >>> dataframe-like expression against an Arrow table" I think the answer >>> is going to lie outside of pyarrow (some cool functions like table >>> group by are being added but I'm not aware of any developers whose >>> goal is to make a full fledged dataframe API inside of pyarrow). >>> Fortunately, the whole point of Arrow is to make it super easy for >>> libraries to add new functionality to your data. There may be some >>> library out there that does this already, there are libraries out >>> there that solve similar problems like "how do I run this SQL query >>> against Arrow data", and there are probably libraries in development >>> to tackle this. Unfortunately, I'm not knowledgeable nor qualified >>> enough to speak on any of these. >>> >>> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote: >>> > >>> > Hi, >>> > >>> > Is there a way to evaluate mathematical expressions against columns of a >>> > pyarrow table which is in memory already similar to how projections work >>> > for dataset scanners? >>> > >>> > The goal is to have specify a bunch of strings like sum(a * b)/sum(a), or >>> > avg(a[:10]) etc. Convert these into expressions and run against the table. >>> > >>> > Thanks >>> > >>> >
