Thanks Aldrin for the response. My goal is to have something like numexpr/pandas equivalent where I can simply pass strings to get the results and if arrow gods have any plans to create something like that in the near future :)
On Thu, Mar 31, 2022 at 3:15 AM Aldrin <[email protected]> wrote: > I think everything Weston said is totally good. > > I just wanted to add that there may be a couple of "simpler" options > available without using SQL-like approaches: > (1) just use the compute API (at least for examples such as "sum(...) / > sum(...)") > (2) when converting strings to expressions, do it in 2 passes: (a) put > expressions supported by the dataset API into projections and scans on the > source table, then (b) use the compute API for any other expressions and > apply them to the final relation from (a). So, for one of the given > examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying > expressions: once via the dataset API, and once via the compute API. > > Note that (1) seems simple in my naive opinion, but (2) may actually be > quite complicated. But, how much extra work it is probably depends on how > you were planning on converting strings to expressions anyways. And note > that you can do many iterations of (2) for using aggregates in filters or > projections, etc. > > ---- > > The expression: > > sum(a * b) / sum(a) > > would be: > > ``` > import pyarrow > from pyarrow import dataset > from pyarrow import compute > > tab = pyarrow.Table.from_arrays( > [ > pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) > ,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) > ] > ,names=['a', 'b'] > ) > > # some chain of scans and projections via the dataset API > relational_result = dataset.Scanner.from_batches( > tab.to_batches() > ,schema=tab.schema > ,columns={ > 'a': dataset.field('a') > ,'product': dataset.field('a') * dataset.field('b') > } > ).to_table() > > # a single "expression" that uses the compute API and the final > relation from the previous "phase" > expr_result = ( > compute.sum(relational_result.column('product')) > / compute.sum(relational_result.column('a')) > ) > ``` > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]> wrote: > >> Yes and no :) Disclaimer: this answer is a little out of my >> wheelhouse as I've learned relational algebra through doing and so my >> formal theory may be off. Anyone is welcome to correct me. Also, >> this answer turned into a bit of ramble and is a bit scattershot. You >> may already be very familiar with some of these concepts. I'm not >> trying to be patronizing but I do tend to ramble. >> >> TL;DR: Pyarrow has some "relational algebra interfaces" today but not >> a lot of "dataframe interfaces". What you're asking for is a little >> bit more of a "dataframe" type question and you will probably need to >> go beyond pyarrow to get exactly what you are asking for. That being >> said, pyarrow has a lot of the primitives that can solve parts and >> pieces of your problem. >> >> In relational algebra there is a special class of functions called >> "scalar functions". These are functions that create a single value >> for each row. For example, "less than", "addition", and "to upper >> case". A scalar expression is then an expression that consists only >> of scalar functions. Projections in dataset scanners can only contain >> scalar expressions. In pyarrow you can scan an in-memory table and >> apply a scalar expression: >> >> ``` >> import pyarrow as pa >> import pyarrow.dataset as ds >> >> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >> tab = pa.Table.from_arrays([arr], names=["x"]) >> expr = (ds.field("x") > 5) & (ds.field("x") < 8) >> # This is a little bit unwieldy, we could maybe investigate a better >> API for this kind of thing >> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema, >> columns={'y': expr3}).to_table() >> # pyarrow.Table >> # y: bool >> # ---- >> # y: [[false,false,false,false,false,true,true,false,false,false]] >> ``` >> >> However, the functions that you are listing (sum, avg) are not scalar >> functions. They do, however, fit a special class of functions known >> as "aggregates" (functions that consume values from multiple rows to >> create a single output). Aggregates in relational algebra are used in >> a "group by" node. In pyarrow 7 we added the ability to run group_by >> functions very easily on in-memory tables but it looks like it >> requires at least one key column so that isn't going to help us here. >> >> Both of these features are powered by the pyarrow compute functions. >> These are slightly lower level compute primitives. We can use these >> here to get some of the values you want: >> >> ``` >> # Continuing from above example >> pc.sum(tab.column("x")) >> # <pyarrow.Int64Scalar: 55> >> pc.mean(tab.column("x")) >> # <pyarrow.DoubleScalar: 5.5> >> ``` >> >> But...while this does give you the capability to run functions, this >> doesn't give you the capability to run "expressions". >> >> Running expressions that contain both aggregate and scalar functions >> is a little trickier than it may seem. This is often done by creating >> a relational algebra query from the expression. For example, consider >> the expression `sum(a * b)/sum(a)`. >> >> We can create the query (this is totally shorthand pseudocode, the >> Substrait text format isn't ready yet) >> >> SCAN_TABLE(table) -> >> PROJECT({"a*b": multiply(field("a"), field("b"))}) -> >> GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) -> >> PROJECT({"sum(a*b)/sum(a)": divide(field("sum(a*b)"),field("sum(a)"))}) >> >> So if you wanted to create such a query plan then you could express it >> in Substrait, which is a spec for expression query plans, and use our >> "very new and not quite ready yet" Substrait consumer API to process >> that query plan. So if your goal is purely "how do I express a series >> of compute operations as a string" then I will point out that SQL is a >> very standard answer for that question and the Substrait text format >> will be a new way to answer that question. >> >> --- >> >> Ok, now everything I've written has been exclusive to pyarrow. If I >> step back and look at your original question "how can I evaluate a >> dataframe-like expression against an Arrow table" I think the answer >> is going to lie outside of pyarrow (some cool functions like table >> group by are being added but I'm not aware of any developers whose >> goal is to make a full fledged dataframe API inside of pyarrow). >> Fortunately, the whole point of Arrow is to make it super easy for >> libraries to add new functionality to your data. There may be some >> library out there that does this already, there are libraries out >> there that solve similar problems like "how do I run this SQL query >> against Arrow data", and there are probably libraries in development >> to tackle this. Unfortunately, I'm not knowledgeable nor qualified >> enough to speak on any of these. >> >> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote: >> > >> > Hi, >> > >> > Is there a way to evaluate mathematical expressions against columns of >> a pyarrow table which is in memory already similar to how projections work >> for dataset scanners? >> > >> > The goal is to have specify a bunch of strings like sum(a * b)/sum(a), >> or avg(a[:10]) etc. Convert these into expressions and run against the >> table. >> > >> > Thanks >> > >> > >> >
