Re: [sqlite] Deterministic random sampling via SELECT

Keith Medcalf Thu, 07 Nov 2019 07:43:47 -0800

On Thursday, 7 November, 2019 04:55, Merijn Verstraaten 
<mer...@inconsistent.nl> wrote:


>I'm trying sample a (deterministically) random subset of a SELECT query,

You cannot have something that is both RANDOM and DETERMINISTIC at the same 
time.

>the most common solution on the internet to get random samples seems to
>be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has
>some question marks, since it relies on seeding RANDOM and knowing the
>RANDOM function is always evaluated in the same order every query run),

This must be from StackOverflow because it is a terrible idea.

>but looking at the query plans this materialises the entire result set in
>memory for the sort (no surprise, I can't think of anyway that could work
>otherwise) which is rather undesirable if the sample size becomes large
>(i.e. several million rows).

Exactly.

>Now, I already know different ways to implement a predicate function that
>can deterministically keep elements from a stream, however that relies on
>having a deterministic order for the stream. Which brings us to SQLite. I
>can easily write something like:

>SELECT *
>FROM (...)
>WHERE my_predicate_fun()
>ORDER BY column1, column2,...

>And this *seems* to evaluate the where clause for each row in the order
>determined by ORDER BY, but this doesn't seem at all guaranteed by the
>SQL spec. 

That is correct.  ORDER BY is executed as the LAST STEP in returning the query 
results.  However, based on availability of indexes, the optimizer my decide to 
utilize those indexes in order to obtain results in already sorted or partially 
sorted order.  The various WHERE clause terms will be checked at the 
appropriate place to minimize the number of candidates at each step.  Since 
my_predicate_fun() is not dependent on any item it will be executed at the last 
step on each candidate (that has passed all the other where clause constraints) 
before that candidate is passed to the sorter.

>So, is this behaviour documented/guaranteed somewhere? If not,
>is there some way to guarantee my where clause is evaluated for each row
>in a deterministic order?

Yes, and no.  For a given (constant) query of a given set of (constant) data 
with a given set of (constant) indexes on that data, the (same version) 
optimizer will always arrive at the same method of answering your query.  If 
you change the query or the data or the indexes available or the version of the 
optimizer, then the most efficient plan for obtaining the results for which you 
asked will be used, which will thereafter be constant.  That is, if the input 
is constant, and the software is constant, then the output and the how the 
software arrives at that output, will be constant.

See https://www.sqlite.org/queryplanner-ng.html#qpstab

You might want to read the whole page.

>In the simple case like above I could always just evaluate the query
>without the ORDER BY, step through the entire query, and evaluate the
>predicate in the application, but if I want to use this random selection
>as a subquery, then that doesn't work.

You have changed the query.  Each query is optimized individually.

>And while I'm asking questions: What if I want to do the above, but
>selecting groups of rows? So, sort of like:

>SELECT *
>FROM (...)
>GROUP BY groupColumn
>HAVING my_predicate_fun();

>But where I want to return all rows in the group, rather than an
>aggregate.

Then you must use the group result to select the data.  Use of GROUP BY/HAVING 
implies returning GROUPs, not rows.  You can of course use the selected groups 
to find the matching members of the group and return those.  Example:

select <stuff>
  from <places>
  join (select a, b 
          from <places> 
         where <conditions> 
      group by a, b 
        having my_predicate_function()) as g
    on <place>.a = g.a AND <place>.b = g.b
 where <conditions>
;

If you wish to generate a "static random sample" of the data, then why do you 
not just do it once and record the result, thereafter using that keyset to 
re-retrieve the same data whenever you need it?  Eg:

create table KeySet as
select a.RowID as aRowID, b.RowID as bRowID, c.RowID as cRowID
  from a, b, c
 where <conditions>;

and thereafter

select <stuff>
  from KeySet
  join a on a.rowid == KeySet.aRowID
  join b on b.rowid == KeySet.bRowID
  join c on c.rowid == KeySet.cRowID

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Deterministic random sampling via SELECT

Reply via email to