RE: Producing combinations of array elements in SQL++

Müller Ingo Tue, 03 Aug 2021 10:20:11 -0700

Quick update: The "AT posVar" construct seems to produce the correct result in 
some simple test cases and are much cleaner that my previous construct (see 
[6])! In particular, the fact that I don't even need UNNEST makes some queries 
really concise :)


Small follow-up question: Is there an issue tracking the status of this 
feature? I couldn't find it in JIRA...

Cheers,
Ingo


[6] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/commit/1023d97


> -----Original Message-----
> From: Müller Ingo <[email protected]>
> Sent: Tuesday, August 3, 2021 6:03 PM
> To: [email protected]
> Subject: RE: Producing combinations of array elements in SQL++
> 
> Dear Mike, dear Dmitry,
> 
> Thanks a lot for the quick and helpful answers! I think what I was looking 
> for is
> the "AT posVar". We are using similar constructs in several other SQL
> dialects/query languages. I will try it out right away, but I am not sure yet 
> how
> we will deal with such a "beta" feature in our study.
> 
> @Mike: I indeed "just" need all combinations and the position of array 
> elements
> is just a means to get that -- their order isn't important. (In fact, queries 
> typically
> compute some metric for each combination and then the combination that
> minimizes or maximizes that metric.) If you know a better way to express 
> that, I
> am very interested!
> 
> All the best,
> Ingo
> 
> 
> > -----Original Message-----
> > From: Mike Carey <[email protected]>
> > Sent: Tuesday, August 3, 2021 5:48 PM
> > To: [email protected]
> > Subject: Re: Producing combinations of array elements in SQL++
> >
> > Ingo,
> >
> > ALSO:  Perhaps clarify what you mean by "unique"?  Suppose an array
> > has [1, 2, 2, 3] as its contents.  What would the desired result be?
> > The best approach to this will depend on what you mean by that.
> >
> >
> > Cheers,
> >
> > Mike
> >
> >
> > On 8/3/21 8:38 AM, Dmitry Lychagin wrote:
> >
> >
> >     Hi Ingo,
> >
> >
> >
> >     1) SQL++ supports positional variables in the FROM clause, as follows:
> >
> >     FROM dataset AS ds, ds.array_field AS elementVar AT posVar
> >
> >     (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178
> > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178>   --
> FromTerm)
> >
> >     or using UNNEST instead of “,”
> >
> >     FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
> >
> >     (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181
> > <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181>  –
> > UnnestClause)
> >
> >
> >
> >     Positional variable is bound to a position of an element inside an
> > array which is being unnested.
> >
> >
> >
> >     Here’s how your query could look like using these variables:
> >
> >     SELECT …
> >
> >     FROM base_table, base_table.array_field AS  array_element1 AT pos1,
> > base_table.array_field AS  array_element2 AT pos2
> >
> >     WHERE pos1 < pos2
> >
> >
> >
> >     2) As for window function calls. I can confirm that if there is no
> > ORDER BY sub-clause inside OVER clause then the order of the elements
> > processed by window functions (row_number(), rank(), etc) is not guaranteed.
> >
> >
> >
> >     Thanks,
> >
> >     -- Dmitry
> >
> >
> >
> >
> >
> >     From: Müller Ingo <[email protected]>
> > <mailto:[email protected]>
> >     Reply-To: "[email protected]"
> > <mailto:[email protected]>  <[email protected]>
> > <mailto:[email protected]>
> >     Date: Tuesday, August 3, 2021 at 1:39 AM
> >     To: "[email protected]"
> > <mailto:[email protected]>  <[email protected]>
> > <mailto:[email protected]>
> >     Subject: Producing combinations of array elements in SQL++
> >
> >
> >
> >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
> > links
> >
> >
> >
> >
> >
> >     Dear SQL++ experts,
> >
> >
> >
> >     I am working on an academic study [1] comparing the expressiveness of
> > various query languages for nested data in a high-energy physics use case.
> > Among other systems, I have implemented the queries of the use case in
> > SQL++ [2]. In general, SQL++ seems to be quite well suited for those
> > queries, but one frequent pattern seems to have only cumbersome
> formulations.
> >
> >
> >
> >     In short, I am missing the functionality to produce (deterministic)
> > element identifiers for array elements. In BigQuery’s SQL dialect [3],
> > this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For
> > now, I am using a construct similar to this: “FROM (SELECT
> > array_field.*, row_number() OVER () AS idx FROM
> > base_table.array_field) AS array_element” (see full query here [4]). I
> > have two issues with this: First, it needs undebatably more characters
> > than the BigQuery version and it is arguably also more cumbersome.
> > Second, I don’t think that it is even correct since the element
> > numbers are not guaranteed to be deterministic (i.e., if I use that
> > construct in several places in the query, the same array element may
> > get different values for row_number()). The documentation on
> > row_number says “If the window order clause is omitted, the return
> > values may be unpredictable.“ However, introducing an order clause not
> > only makes the pattern even more cumbersome (the elements are objects
> > with at least four fields), it also does not guarantee a deterministic
> > order still. The documentation of the Window Order Clause says: “The
> > row_number() function returns a distinct number for each tuple. If
> > tuples are tied, the results may be unpredictable.” (To be fair, if
> > two array elements agree on all fields, I can probably treat them
> > interchangeably in my use case.)
> >
> >
> >
> >     With that introduction, two questions: (1) Can you confirm that there
> > is really no language construct better than what I currently use (such
> > as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is
> > really undeterministic in the way I use it?
> >
> >
> >
> >     I need these element identifiers for producing combinations (in the
> > mathematical sense [5]) of the elements in an array field, but maybe
> > there is a better way. In the simple case of producing all (unique)
> > pairs of elements of a particular array, I use a construct like the 
> > following:
> >
> >
> >
> >           FROM (SELECT array_field.*, row_number() OVER () AS idx FROM
> > base_table.array_field) AS array_element1,
> >
> >                (SELECT array_field.*, row_number() OVER () AS idx FROM
> > base_table. array_field) AS array_element2
> >
> >           WHERE
> >
> >             array_element1.idx < array_element2.idx
> >
> >
> >
> >     So a last question: (3) Can you think of a better way to compute
> > combinations?
> >
> >
> >
> >     Thanks a lot in advance and best regards,
> >
> >     Ingo
> >
> >
> >
> >
> >
> >     [1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason
> > Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for
> > High- Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021.
> > https://arxiv.org/abs/2104.12615
> >
> >     [2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
> > <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
> >
> >     [3] https://cloud.google.com/bigquery/docs/reference/standard-
> > sql/arrays#flattening_arrays
> > <https://cloud.google.com/bigquery/docs/reference/standard-
> > sql/arrays#flattening_arrays>
> >
> >     [4] https://github.com/RumbleDB/iris-hep-benchmark-
> > sqlpp/blob/master/queries/query-5/query.sqlpp
> > <https://github.com/RumbleDB/iris-hep-benchmark-
> > sqlpp/blob/master/queries/query-5/query.sqlpp>
> >
> >     [5] https://en.wikipedia.org/wiki/Combination
> >
> >
> >
> >

RE: Producing combinations of array elements in SQL++

Reply via email to