I have been playing around with polars for a few weeks. They have an expression system that is different from pandas , but it feels really clean and is built on arrow in rust. If you need something now, this would be a good option. It does everything mentioned here and adds a lazy evaluation system and is very fast.
pola-rs/polars: Fast multi-threaded DataFrame library in Rust | Python | Node.js (github.com) <https://github.com/pola-rs/polars> I just thought I would mention it. Thanks Matt On Thu, Mar 31, 2022 at 8:54 AM Ian Cook <[email protected]> wrote: > There is work ongoing to implement a pandas-like interface to the > Arrow C++ compute engine using Ibis [1] and Substrait [2]. > > The gist of this is: Python users write Ibis code; the ibis-substrait > submodule of Ibis [3] compiles this into a Substrait plan; and the > Arrow C++ compute engine parses the Substrait plan and executes it, > returning an Arrow table. > > There is a recent Voltron Data blog post co-authored by Phillip Cloud > (the lead developer of Ibis) and Jacques Nadeau (the originator of > Substrait) with some additional details [4]. > > Since Substrait is an open standard, this can all be implemented > without introducing tight coupling between the interface and the > engine, at least in theory. Other Substrait-compatible engines will be > able to consume and execute Substrait plans, and other > Substrait-compatible interfaces will be able to produce Substrait > plans. There is already some work ongoing to support Substrait in > other interfaces and engines including the Arrow R interface [5], the > DataFusion engine [6], and the DuckDB engine [7]. > > I am not sure when the Ibis-Substrait-Arrow stack is expected to reach > milestones of "real-world usability." I will leave it to others to > comment about that > > Thanks, > Ian > > [1] https://ibis-project.org > [2] https://substrait.io > [3] https://github.com/ibis-project/ibis-substrait > [4] > https://voltrondata.com/news/introducing-substrait-an-interoperable-data-to-engine-connector/ > [5] https://github.com/voltrondata/substrait-r > [6] https://github.com/datafusion-contrib/datafusion-substrait > [7] https://github.com/duckdblabs/duckdb-substrait-demo > > > On Thu, Mar 31, 2022 at 7:25 AM Suresh V <[email protected]> wrote: > > > > Thanks Aldrin for the response. My goal is to have something like > numexpr/pandas equivalent where I can simply pass strings to get the > results and if arrow gods have any plans to create something like that in > the near future :) > > > > On Thu, Mar 31, 2022 at 3:15 AM Aldrin <[email protected]> wrote: > >> > >> I think everything Weston said is totally good. > >> > >> I just wanted to add that there may be a couple of "simpler" options > available without using SQL-like approaches: > >> (1) just use the compute API (at least for examples such as "sum(...) / > sum(...)") > >> (2) when converting strings to expressions, do it in 2 passes: (a) put > expressions supported by the dataset API into projections and scans on the > source table, then (b) use the compute API for any other expressions and > apply them to the final relation from (a). So, for one of the given > examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying > expressions: once via the dataset API, and once via the compute API. > >> > >> Note that (1) seems simple in my naive opinion, but (2) may actually be > quite complicated. But, how much extra work it is probably depends on how > you were planning on converting strings to expressions anyways. And note > that you can do many iterations of (2) for using aggregates in filters or > projections, etc. > >> > >> ---- > >> > >> The expression: > >> > >> sum(a * b) / sum(a) > >> > >> would be: > >> > >> ``` > >> import pyarrow > >> from pyarrow import dataset > >> from pyarrow import compute > >> > >> tab = pyarrow.Table.from_arrays( > >> [ > >> pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) > >> ,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) > >> ] > >> ,names=['a', 'b'] > >> ) > >> > >> # some chain of scans and projections via the dataset API > >> relational_result = dataset.Scanner.from_batches( > >> tab.to_batches() > >> ,schema=tab.schema > >> ,columns={ > >> 'a': dataset.field('a') > >> ,'product': dataset.field('a') * dataset.field('b') > >> } > >> ).to_table() > >> > >> # a single "expression" that uses the compute API and the final > relation from the previous "phase" > >> expr_result = ( > >> compute.sum(relational_result.column('product')) > >> / compute.sum(relational_result.column('a')) > >> ) > >> ``` > >> > >> Aldrin Montana > >> Computer Science PhD Student > >> UC Santa Cruz > >> > >> > >> On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]> > wrote: > >>> > >>> Yes and no :) Disclaimer: this answer is a little out of my > >>> wheelhouse as I've learned relational algebra through doing and so my > >>> formal theory may be off. Anyone is welcome to correct me. Also, > >>> this answer turned into a bit of ramble and is a bit scattershot. You > >>> may already be very familiar with some of these concepts. I'm not > >>> trying to be patronizing but I do tend to ramble. > >>> > >>> TL;DR: Pyarrow has some "relational algebra interfaces" today but not > >>> a lot of "dataframe interfaces". What you're asking for is a little > >>> bit more of a "dataframe" type question and you will probably need to > >>> go beyond pyarrow to get exactly what you are asking for. That being > >>> said, pyarrow has a lot of the primitives that can solve parts and > >>> pieces of your problem. > >>> > >>> In relational algebra there is a special class of functions called > >>> "scalar functions". These are functions that create a single value > >>> for each row. For example, "less than", "addition", and "to upper > >>> case". A scalar expression is then an expression that consists only > >>> of scalar functions. Projections in dataset scanners can only contain > >>> scalar expressions. In pyarrow you can scan an in-memory table and > >>> apply a scalar expression: > >>> > >>> ``` > >>> import pyarrow as pa > >>> import pyarrow.dataset as ds > >>> > >>> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) > >>> tab = pa.Table.from_arrays([arr], names=["x"]) > >>> expr = (ds.field("x") > 5) & (ds.field("x") < 8) > >>> # This is a little bit unwieldy, we could maybe investigate a better > >>> API for this kind of thing > >>> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema, > >>> columns={'y': expr3}).to_table() > >>> # pyarrow.Table > >>> # y: bool > >>> # ---- > >>> # y: [[false,false,false,false,false,true,true,false,false,false]] > >>> ``` > >>> > >>> However, the functions that you are listing (sum, avg) are not scalar > >>> functions. They do, however, fit a special class of functions known > >>> as "aggregates" (functions that consume values from multiple rows to > >>> create a single output). Aggregates in relational algebra are used in > >>> a "group by" node. In pyarrow 7 we added the ability to run group_by > >>> functions very easily on in-memory tables but it looks like it > >>> requires at least one key column so that isn't going to help us here. > >>> > >>> Both of these features are powered by the pyarrow compute functions. > >>> These are slightly lower level compute primitives. We can use these > >>> here to get some of the values you want: > >>> > >>> ``` > >>> # Continuing from above example > >>> pc.sum(tab.column("x")) > >>> # <pyarrow.Int64Scalar: 55> > >>> pc.mean(tab.column("x")) > >>> # <pyarrow.DoubleScalar: 5.5> > >>> ``` > >>> > >>> But...while this does give you the capability to run functions, this > >>> doesn't give you the capability to run "expressions". > >>> > >>> Running expressions that contain both aggregate and scalar functions > >>> is a little trickier than it may seem. This is often done by creating > >>> a relational algebra query from the expression. For example, consider > >>> the expression `sum(a * b)/sum(a)`. > >>> > >>> We can create the query (this is totally shorthand pseudocode, the > >>> Substrait text format isn't ready yet) > >>> > >>> SCAN_TABLE(table) -> > >>> PROJECT({"a*b": multiply(field("a"), field("b"))}) -> > >>> GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) -> > >>> PROJECT({"sum(a*b)/sum(a)": > divide(field("sum(a*b)"),field("sum(a)"))}) > >>> > >>> So if you wanted to create such a query plan then you could express it > >>> in Substrait, which is a spec for expression query plans, and use our > >>> "very new and not quite ready yet" Substrait consumer API to process > >>> that query plan. So if your goal is purely "how do I express a series > >>> of compute operations as a string" then I will point out that SQL is a > >>> very standard answer for that question and the Substrait text format > >>> will be a new way to answer that question. > >>> > >>> --- > >>> > >>> Ok, now everything I've written has been exclusive to pyarrow. If I > >>> step back and look at your original question "how can I evaluate a > >>> dataframe-like expression against an Arrow table" I think the answer > >>> is going to lie outside of pyarrow (some cool functions like table > >>> group by are being added but I'm not aware of any developers whose > >>> goal is to make a full fledged dataframe API inside of pyarrow). > >>> Fortunately, the whole point of Arrow is to make it super easy for > >>> libraries to add new functionality to your data. There may be some > >>> library out there that does this already, there are libraries out > >>> there that solve similar problems like "how do I run this SQL query > >>> against Arrow data", and there are probably libraries in development > >>> to tackle this. Unfortunately, I'm not knowledgeable nor qualified > >>> enough to speak on any of these. > >>> > >>> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote: > >>> > > >>> > Hi, > >>> > > >>> > Is there a way to evaluate mathematical expressions against columns > of a pyarrow table which is in memory already similar to how projections > work for dataset scanners? > >>> > > >>> > The goal is to have specify a bunch of strings like sum(a * > b)/sum(a), or avg(a[:10]) etc. Convert these into expressions and run > against the table. > >>> > > >>> > Thanks > >>> > > >>> > > -- ------------------------------------------------------- Matthew Peters [email protected]
