RE: Producing combinations of array elements in SQL++

Müller Ingo Tue, 03 Aug 2021 09:02:57 -0700

Dear Mike, dear Dmitry,

Thanks a lot for the quick and helpful answers! I think what I was looking for 
is the "AT posVar". We are using similar constructs in several other SQL 
dialects/query languages. I will try it out right away, but I am not sure yet 
how we will deal with such a "beta" feature in our study.


@Mike: I indeed "just" need all combinations and the position of array elements 
is just a means to get that -- their order isn't important. (In fact, queries 
typically compute some metric for each combination and then the combination 
that minimizes or maximizes that metric.) If you know a better way to express 
that, I am very interested!

All the best,
Ingo


> -----Original Message-----
> From: Mike Carey <[email protected]>
> Sent: Tuesday, August 3, 2021 5:48 PM
> To: [email protected]
> Subject: Re: Producing combinations of array elements in SQL++
> 
> Ingo,
> 
> ALSO:  Perhaps clarify what you mean by "unique"?  Suppose an array has [1, 2,
> 2, 3] as its contents.  What would the desired result be?  The best approach 
> to
> this will depend on what you mean by that.
> 
> 
> Cheers,
> 
> Mike
> 
> 
> On 8/3/21 8:38 AM, Dmitry Lychagin wrote:
> 
> 
>       Hi Ingo,
> 
> 
> 
>       1) SQL++ supports positional variables in the FROM clause, as follows:
> 
>       FROM dataset AS ds, ds.array_field AS elementVar AT posVar
> 
>       (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178
> <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod178>   -- FromTerm)
> 
>       or using UNNEST instead of “,”
> 
>       FROM dataset AS ds UNNEST ds.array_field AS elementVar AT posVar
> 
>       (see https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181
> <https://asterixdb.apache.org/docs/0.9.7/SQLPP.html#prod181>  –
> UnnestClause)
> 
> 
> 
>       Positional variable is bound to a position of an element inside an array
> which is being unnested.
> 
> 
> 
>       Here’s how your query could look like using these variables:
> 
>       SELECT …
> 
>       FROM base_table, base_table.array_field AS  array_element1 AT pos1,
> base_table.array_field AS  array_element2 AT pos2
> 
>       WHERE pos1 < pos2
> 
> 
> 
>       2) As for window function calls. I can confirm that if there is no ORDER
> BY sub-clause inside OVER clause then the order of the elements processed by
> window functions (row_number(), rank(), etc) is not guaranteed.
> 
> 
> 
>       Thanks,
> 
>       -- Dmitry
> 
> 
> 
> 
> 
>       From: Müller Ingo <[email protected]>
> <mailto:[email protected]>
>       Reply-To: "[email protected]"
> <mailto:[email protected]>  <[email protected]>
> <mailto:[email protected]>
>       Date: Tuesday, August 3, 2021 at 1:39 AM
>       To: "[email protected]"
> <mailto:[email protected]>  <[email protected]>
> <mailto:[email protected]>
>       Subject: Producing combinations of array elements in SQL++
> 
> 
> 
>  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
> 
> 
> 
> 
> 
>       Dear SQL++ experts,
> 
> 
> 
>       I am working on an academic study [1] comparing the expressiveness of
> various query languages for nested data in a high-energy physics use case.
> Among other systems, I have implemented the queries of the use case in SQL++
> [2]. In general, SQL++ seems to be quite well suited for those queries, but 
> one
> frequent pattern seems to have only cumbersome formulations.
> 
> 
> 
>       In short, I am missing the functionality to produce (deterministic)
> element identifiers for array elements. In BigQuery’s SQL dialect [3], this 
> can be
> done with “UNNEST(array_field) WITH OFFSET AS idx”. For now, I am using a
> construct similar to this: “FROM (SELECT array_field.*, row_number() OVER () 
> AS
> idx FROM base_table.array_field) AS array_element” (see full query here [4]). 
> I
> have two issues with this: First, it needs undebatably more characters than 
> the
> BigQuery version and it is arguably also more cumbersome. Second, I don’t 
> think
> that it is even correct since the element numbers are not guaranteed to be
> deterministic (i.e., if I use that construct in several places in the query, 
> the same
> array element may get different values for row_number()). The documentation
> on row_number says “If the window order clause is omitted, the return values
> may be unpredictable.“ However, introducing an order clause not only makes
> the pattern even more cumbersome (the elements are objects with at least four
> fields), it also does not guarantee a deterministic order still. The 
> documentation
> of the Window Order Clause says: “The row_number() function returns a distinct
> number for each tuple. If tuples are tied, the results may be unpredictable.” 
> (To
> be fair, if two array elements agree on all fields, I can probably treat them
> interchangeably in my use case.)
> 
> 
> 
>       With that introduction, two questions: (1) Can you confirm that there is
> really no language construct better than what I currently use (such as 
> BigQuery’s
> “OFFSET”)? (2) Can you confirm that row_number() is really undeterministic in
> the way I use it?
> 
> 
> 
>       I need these element identifiers for producing combinations (in the
> mathematical sense [5]) of the elements in an array field, but maybe there is 
> a
> better way. In the simple case of producing all (unique) pairs of elements of 
> a
> particular array, I use a construct like the following:
> 
> 
> 
>             FROM (SELECT array_field.*, row_number() OVER () AS idx FROM
> base_table.array_field) AS array_element1,
> 
>                  (SELECT array_field.*, row_number() OVER () AS idx FROM
> base_table. array_field) AS array_element2
> 
>             WHERE
> 
>               array_element1.idx < array_element2.idx
> 
> 
> 
>       So a last question: (3) Can you think of a better way to compute
> combinations?
> 
> 
> 
>       Thanks a lot in advance and best regards,
> 
>       Ingo
> 
> 
> 
> 
> 
>       [1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason
> Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for High-
> Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021.
> https://arxiv.org/abs/2104.12615
> 
>       [2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp
> <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>
> 
>       [3] https://cloud.google.com/bigquery/docs/reference/standard-
> sql/arrays#flattening_arrays
> <https://cloud.google.com/bigquery/docs/reference/standard-
> sql/arrays#flattening_arrays>
> 
>       [4] https://github.com/RumbleDB/iris-hep-benchmark-
> sqlpp/blob/master/queries/query-5/query.sqlpp
> <https://github.com/RumbleDB/iris-hep-benchmark-
> sqlpp/blob/master/queries/query-5/query.sqlpp>
> 
>       [5] https://en.wikipedia.org/wiki/Combination
> 
> 
> 
>

RE: Producing combinations of array elements in SQL++

Reply via email to