Ingo,

Very interesting!

QUESTION:  Is order important in your use case, or do you "just" need all unique pairs of elements from the array (in nor particular order).  If order is just a means to that end in your use case, yes, there are easier/lower-cost ways.  :-)

BTW, SQL++ has an unpublished AT extension to the FROM clause that is very much like what BigQuery has - inherited from its AQL (XQuery-based) roots - but it is buggy in the parallel case (clusters) so it's not advertised at the moment. There is a fix on a side branch that a UC Riverside student has, but it's not checked in or well-tested yet.  (What's there might actually work for your limited case though - hmm - but if order isn't needed, there are better ways.)

Cheers,

Mike

On 8/3/21 1:39 AM, Müller Ingo wrote:

Dear SQL++ experts,

I am working on an academic study [1] comparing the expressiveness of various query languages for nested data in a high-energy physics use case. Among other systems, I have implemented the queries of the use case in SQL++ [2]. In general, SQL++ seems to be quite well suited for those queries, but one frequent pattern seems to have only cumbersome formulations.

In short, I am missing the functionality to produce (deterministic) element identifiers for array elements. In BigQuery’s SQL dialect [3], this can be done with “UNNEST(array_field) WITH OFFSET AS idx”. For now, I am using a construct similar to this: “FROM (SELECT array_field.*, row_number() OVER () AS idx FROM base_table.array_field) AS array_element” (see full query here [4]). I have two issues with this: First, it needs undebatably more characters than the BigQuery version and it is arguably also more cumbersome. Second, I don’t think that it is even correct since the element numbers are not guaranteed to be deterministic (i.e., if I use that construct in several places in the query, the same array element may get different values for row_number()). The documentation on row_number says “If the window order clause is omitted, the return values may be unpredictable.“ However, introducing an order clause not only makes the pattern even more cumbersome (the elements are objects with at least four fields), it also does not guarantee a deterministic order still. The documentation of the Window Order Clause says: “The row_number() function returns a distinct number for each tuple. If tuples are tied, the results may be unpredictable.” (To be fair, if two array elements agree on all fields, I can probably treat them interchangeably in my use case.)

With that introduction, two questions: (1) Can you confirm that there is really no language construct better than what I currently use (such as BigQuery’s “OFFSET”)? (2) Can you confirm that row_number() is really undeterministic in the way I use it?

I need these element identifiers for producing combinations (in the mathematical sense [5]) of the elements in an array field, but maybe there is a better way. In the simple case of producing all (unique) pairs of elements of a particular array, I use a construct like the following:

      FROM (SELECT array_field.*, row_number() OVER () AS idx FROM base_table.array_field) AS array_element1,

           (SELECT array_field.*, row_number() OVER () AS idx FROM base_table. array_field) AS array_element2

      WHERE

        array_element1.idx < array_element2.idx

So a last question: (3) Can you think of a better way to compute combinations?

Thanks a lot in advance and best regards,

Ingo

[1] Dan Graur, Ingo Müller, Ghislain Fourny, Gordon T. Watts, Mason Proffitt, Gustavo Alonso. "Evaluating Query Languages and Systems for High-​Energy Physics Data." arXiv: 2104.12615 [cs.DB], 2021. https://arxiv.org/abs/2104.12615 <https://arxiv.org/abs/2104.12615>

[2] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp>

[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays <https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays>

[4] https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-5/query.sqlpp <https://github.com/RumbleDB/iris-hep-benchmark-sqlpp/blob/master/queries/query-5/query.sqlpp>

[5] https://en.wikipedia.org/wiki/Combination <https://en.wikipedia.org/wiki/Combination>

Reply via email to