On Thursday, 7 November, 2019 04:55, Merijn Verstraaten <mer...@inconsistent.nl> wrote:
>I'm trying sample a (deterministically) random subset of a SELECT query, You cannot have something that is both RANDOM and DETERMINISTIC at the same time. >the most common solution on the internet to get random samples seems to >be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has >some question marks, since it relies on seeding RANDOM and knowing the >RANDOM function is always evaluated in the same order every query run), This must be from StackOverflow because it is a terrible idea. >but looking at the query plans this materialises the entire result set in >memory for the sort (no surprise, I can't think of anyway that could work >otherwise) which is rather undesirable if the sample size becomes large >(i.e. several million rows). Exactly. >Now, I already know different ways to implement a predicate function that >can deterministically keep elements from a stream, however that relies on >having a deterministic order for the stream. Which brings us to SQLite. I >can easily write something like: >SELECT * >FROM (...) >WHERE my_predicate_fun() >ORDER BY column1, column2,... >And this *seems* to evaluate the where clause for each row in the order >determined by ORDER BY, but this doesn't seem at all guaranteed by the >SQL spec. That is correct. ORDER BY is executed as the LAST STEP in returning the query results. However, based on availability of indexes, the optimizer my decide to utilize those indexes in order to obtain results in already sorted or partially sorted order. The various WHERE clause terms will be checked at the appropriate place to minimize the number of candidates at each step. Since my_predicate_fun() is not dependent on any item it will be executed at the last step on each candidate (that has passed all the other where clause constraints) before that candidate is passed to the sorter. >So, is this behaviour documented/guaranteed somewhere? If not, >is there some way to guarantee my where clause is evaluated for each row >in a deterministic order? Yes, and no. For a given (constant) query of a given set of (constant) data with a given set of (constant) indexes on that data, the (same version) optimizer will always arrive at the same method of answering your query. If you change the query or the data or the indexes available or the version of the optimizer, then the most efficient plan for obtaining the results for which you asked will be used, which will thereafter be constant. That is, if the input is constant, and the software is constant, then the output and the how the software arrives at that output, will be constant. See https://www.sqlite.org/queryplanner-ng.html#qpstab You might want to read the whole page. >In the simple case like above I could always just evaluate the query >without the ORDER BY, step through the entire query, and evaluate the >predicate in the application, but if I want to use this random selection >as a subquery, then that doesn't work. You have changed the query. Each query is optimized individually. >And while I'm asking questions: What if I want to do the above, but >selecting groups of rows? So, sort of like: >SELECT * >FROM (...) >GROUP BY groupColumn >HAVING my_predicate_fun(); >But where I want to return all rows in the group, rather than an >aggregate. Then you must use the group result to select the data. Use of GROUP BY/HAVING implies returning GROUPs, not rows. You can of course use the selected groups to find the matching members of the group and return those. Example: select <stuff> from <places> join (select a, b from <places> where <conditions> group by a, b having my_predicate_function()) as g on <place>.a = g.a AND <place>.b = g.b where <conditions> ; If you wish to generate a "static random sample" of the data, then why do you not just do it once and record the result, thereafter using that keyset to re-retrieve the same data whenever you need it? Eg: create table KeySet as select a.RowID as aRowID, b.RowID as bRowID, c.RowID as cRowID from a, b, c where <conditions>; and thereafter select <stuff> from KeySet join a on a.rowid == KeySet.aRowID join b on b.rowid == KeySet.bRowID join c on c.rowid == KeySet.cRowID -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users