Re: Evaluate expressions on a pyarrow table in-memory

Suresh V Thu, 31 Mar 2022 04:25:17 -0700

Thanks Aldrin for the response. My goal is to have something like
numexpr/pandas equivalent where I can simply pass strings to get the
results and if arrow gods have any plans to create something like that in
the near future :)


On Thu, Mar 31, 2022 at 3:15 AM Aldrin <[email protected]> wrote:

> I think everything Weston said is totally good.
>
> I just wanted to add that there may be a couple of "simpler" options
> available without using SQL-like approaches:
> (1) just use the compute API (at least for examples such as "sum(...) /
> sum(...)")
> (2) when converting strings to expressions, do it in 2 passes: (a) put
> expressions supported by the dataset API into projections and scans on the
> source table, then (b) use the compute API for any other expressions and
> apply them to the final relation from (a). So, for one of the given
> examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying
> expressions: once via the dataset API, and once via the compute API.
>
> Note that (1) seems simple in my naive opinion, but (2) may actually be
> quite complicated. But, how much extra work it is probably depends on how
> you were planning on converting strings to expressions anyways. And note
> that you can do many iterations of (2) for using aggregates in filters or
> projections, etc.
>
> ----
>
> The expression:
>
>     sum(a * b) / sum(a)
>
> would be:
>
>     ```
>     import pyarrow
>     from pyarrow import dataset
>     from pyarrow import compute
>
>     tab = pyarrow.Table.from_arrays(
>          [
>            pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>           ,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
>          ]
>         ,names=['a', 'b']
>     )
>
>     # some chain of scans and projections via the dataset API
>     relational_result = dataset.Scanner.from_batches(
>          tab.to_batches()
>         ,schema=tab.schema
>         ,columns={
>               'a': dataset.field('a')
>              ,'product': dataset.field('a') * dataset.field('b')
>          }
>     ).to_table()
>
>     # a single "expression" that uses the compute API and the final
> relation from the previous "phase"
>     expr_result = (
>           compute.sum(relational_result.column('product'))
>         / compute.sum(relational_result.column('a'))
>     )
>     ```
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]> wrote:
>
>> Yes and no :)  Disclaimer: this answer is a little out of my
>> wheelhouse as I've learned relational algebra through doing and so my
>> formal theory may be off.  Anyone is welcome to correct me.  Also,
>> this answer turned into a bit of ramble and is a bit scattershot.  You
>> may already be very familiar with some of these concepts.  I'm not
>> trying to be patronizing but I do tend to ramble.
>>
>> TL;DR: Pyarrow has some "relational algebra interfaces" today but not
>> a lot of "dataframe interfaces".  What you're asking for is a little
>> bit more of a "dataframe" type question and you will probably need to
>> go beyond pyarrow to get exactly what you are asking for.  That being
>> said, pyarrow has a lot of the primitives that can solve parts and
>> pieces of your problem.
>>
>> In relational algebra there is a special class of functions called
>> "scalar functions".  These are functions that create a single value
>> for each row.  For example, "less than", "addition", and "to upper
>> case".  A scalar expression is then an expression that consists only
>> of scalar functions.  Projections in dataset scanners can only contain
>> scalar expressions.  In pyarrow you can scan an in-memory table and
>> apply a scalar expression:
>>
>> ```
>> import pyarrow as pa
>> import pyarrow.dataset as ds
>>
>> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>> tab = pa.Table.from_arrays([arr], names=["x"])
>> expr = (ds.field("x") > 5) & (ds.field("x") < 8)
>> # This is a little bit unwieldy, we could maybe investigate a better
>> API for this kind of thing
>> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>> columns={'y': expr3}).to_table()
>> # pyarrow.Table
>> # y: bool
>> # ----
>> # y: [[false,false,false,false,false,true,true,false,false,false]]
>> ```
>>
>> However, the functions that you are listing (sum, avg) are not scalar
>> functions.  They do, however, fit a special class of functions known
>> as "aggregates" (functions that consume values from multiple rows to
>> create a single output).  Aggregates in relational algebra are used in
>> a "group by" node.  In pyarrow 7 we added the ability to run group_by
>> functions very easily on in-memory tables but it looks like it
>> requires at least one key column so that isn't going to help us here.
>>
>> Both of these features are powered by the pyarrow compute functions.
>> These are slightly lower level compute primitives.  We can use these
>> here to get some of the values you want:
>>
>> ```
>> # Continuing from above example
>> pc.sum(tab.column("x"))
>> # <pyarrow.Int64Scalar: 55>
>> pc.mean(tab.column("x"))
>> # <pyarrow.DoubleScalar: 5.5>
>> ```
>>
>> But...while this does give you the capability to run functions, this
>> doesn't give you the capability to run "expressions".
>>
>> Running expressions that contain both aggregate and scalar functions
>> is a little trickier than it may seem.  This is often done by creating
>> a relational algebra query from the expression.  For example, consider
>> the expression `sum(a * b)/sum(a)`.
>>
>> We can create the query (this is totally shorthand pseudocode, the
>> Substrait text format isn't ready yet)
>>
>> SCAN_TABLE(table) ->
>>   PROJECT({"a*b": multiply(field("a"), field("b"))}) ->
>>   GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) ->
>>   PROJECT({"sum(a*b)/sum(a)": divide(field("sum(a*b)"),field("sum(a)"))})
>>
>> So if you wanted to create such a query plan then you could express it
>> in Substrait, which is a spec for expression query plans, and use our
>> "very new and not quite ready yet" Substrait consumer API to process
>> that query plan.  So if your goal is purely "how do I express a series
>> of compute operations as a string" then I will point out that SQL is a
>> very standard answer for that question and the Substrait text format
>> will be a new way to answer that question.
>>
>> ---
>>
>> Ok, now everything I've written has been exclusive to pyarrow.  If I
>> step back and look at your original question "how can I evaluate a
>> dataframe-like expression against an Arrow table" I think the answer
>> is going to lie outside of pyarrow (some cool functions like table
>> group by are being added but I'm not aware of any developers whose
>> goal is to make a full fledged dataframe API inside of pyarrow).
>> Fortunately, the whole point of Arrow is to make it super easy for
>> libraries to add new functionality to your data.  There may be some
>> library out there that does this already, there are libraries out
>> there that solve similar problems like "how do I run this SQL query
>> against Arrow data", and there are probably libraries in development
>> to tackle this.  Unfortunately, I'm not knowledgeable nor qualified
>> enough to speak on any of these.
>>
>> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > Is there a way to evaluate mathematical expressions against columns of
>> a pyarrow table which is in memory already similar to how projections work
>> for dataset scanners?
>> >
>> > The goal is to have specify a bunch of strings like sum(a * b)/sum(a),
>> or avg(a[:10]) etc. Convert these into expressions and run against the
>> table.
>> >
>> > Thanks
>> >
>> >
>>
>

Re: Evaluate expressions on a pyarrow table in-memory

Reply via email to