Re: [sqlite] Deterministic random sampling via SELECT
On 7 Nov 2019, at 20:47, Chris Peachment wrote: > 1. generate a list of pseudo-random numbers, using a pre-defined > seed value, over the range 1 .. count(*) of records in table, > > 2. use that list as record id values to select the desired subset > of the data in the table. > > This would be done in two separate operations, possibly with a > storage of the generated numbers in a separate table which could > be used in the query of the main data. Yeah, this and some of the other ideas were some things I considered as fallback ideas if things weren't possible within the query, although it does complicate things a bit. I actually just had another idea: How is the behaviour of window functions defined? i.e. if there is an ORDER BY clause are rows added/removed from the window in the order of the order by clause, or is the behaviour of row_number/rank/dense_rank special cased and only those functions guarantee the order? Because if window functions do follow there own order by, you could easily recover a deterministic evaluation order via the window specification. If not, I guess I'll have to give up and do it the "hard way". - Merijn ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
On 11/7/19 5:13 PM, Doug Currie wrote: > On Thu, Nov 7, 2019 at 4:23 PM Richard Damon > wrote: > >> One thought would be to generate a ‘hash’ from part of the record, maybe >> the record ID, and select records based on that value. The simplest would >> be something like id%100 == 0 would get you 1% of the records. That >> admittedly isn’t that random. >> >> Put the ID through a linear congruential generator, something like >> >> mod(a * Id + b, c) % 100 == 0 >> >> And you will pretty well scramble the selection >> >> > Yes, and if a, b, and c come from a randomization table, they can be > modified to obtain a different pseudo-random set. > > e a, b, and c should be chosen to give a reasonable random number generator (there are tables of good values), not be arbitrary values. The 100 and the 0 can be changed (or use some other test on the random number) to get different selections. -- Richard Damon ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
On Thu, Nov 7, 2019 at 4:23 PM Richard Damon wrote: > > One thought would be to generate a ‘hash’ from part of the record, maybe > the record ID, and select records based on that value. The simplest would > be something like id%100 == 0 would get you 1% of the records. That > admittedly isn’t that random. > > Put the ID through a linear congruential generator, something like > > mod(a * Id + b, c) % 100 == 0 > > And you will pretty well scramble the selection > > Yes, and if a, b, and c come from a randomization table, they can be modified to obtain a different pseudo-random set. e ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
> On Nov 7, 2019, at 2:15 PM, Merijn Verstraaten wrote: > > >> On 7 Nov 2019, at 19:16, David Raymond wrote: >> >> Along those lines SQLite includes the reverse_unordered_selects pragma >> https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects >> which will flip the order it sends rows in queries that don't explicitly >> specify an ordering. It's there to assist you in finding spots in your code >> where you might be relying on implicit ordering when you really shouldn't be. > > Like the rest of this threads, this is just pointing out why the things in my > initial email don't work, but I already knew that. Which is why I asked for > help to see if there is a way to do what I want that *does* work. I don't > care particularly about the details of "can I control the order the condition > is evaluated", it's just that all reasonable ways to sample large streams > that I know would require a deterministic order. > > If someone has a different/better idea on how to return just a random sample > from a query in a repeatable way, I'm all ears. > > So far the only suggestion was "use some non-deterministic random sampling > method and store the result", but since my samples are large and I have lots > of them, this would balloon my storage by >100x and I don't have the > available storage to make that work. > > - Merijn > One thought would be to generate a ‘hash’ from part of the record, maybe the record ID, and select records based on that value. The simplest would be something like id%100 == 0 would get you 1% of the records. That admittedly isn’t that random. Put the ID through a linear congruential generator, something like mod(a * Id + b, c) % 100 == 0 And you will pretty well scramble the selection ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
We went off on a tangent, apologies. If you have contiguous integer primary keys, you could randomly sample that range of integers, then pull the records with those keys. Or in your external language of choice, sample the integers from 1 to the record count deterministically, select ordered by the primary key, and take the ones with the sampled offsets. (Stepping through 1 query and NOT doing a bunch of ...order by pk limit 1 offset n... queries) Making something quick in Python I might do something like: import random import sqlite3 conn = sqlite3.connect(dbFile, isolation_level = None) cur = conn.cursor() cur.execute("select count(*) from foo;") numRecords = cur.fetchone()[0] sampleSize = 10 random.seed(5) #Your deterministic seed here SampleOffsets = random.sample(range(1, numRecords + 1), sampleSize) SampleOffsets.sort() cur.execute("select * from foo order by primary_key;") currentOffset = 0 for selectedOffset in SampleOffsets: for _ in range(selectedOffset - currentOffset - 1): cur.fetchone() nextSampleRecord = cur.fetchone() currentOffset = selectedOffset doSomethingWithSample(nextSampleRecord) -Original Message- From: sqlite-users On Behalf Of Merijn Verstraaten Sent: Thursday, November 7, 2019 2:16 PM To: SQLite mailing list Subject: Re: [sqlite] Deterministic random sampling via SELECT > On 7 Nov 2019, at 19:16, David Raymond wrote: > > Along those lines SQLite includes the reverse_unordered_selects pragma > https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects > which will flip the order it sends rows in queries that don't explicitly > specify an ordering. It's there to assist you in finding spots in your code > where you might be relying on implicit ordering when you really shouldn't be. Like the rest of this threads, this is just pointing out why the things in my initial email don't work, but I already knew that. Which is why I asked for help to see if there is a way to do what I want that *does* work. I don't care particularly about the details of "can I control the order the condition is evaluated", it's just that all reasonable ways to sample large streams that I know would require a deterministic order. If someone has a different/better idea on how to return just a random sample from a query in a repeatable way, I'm all ears. So far the only suggestion was "use some non-deterministic random sampling method and store the result", but since my samples are large and I have lots of them, this would balloon my storage by >100x and I don't have the available storage to make that work. - Merijn ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
In the very old days before computers were common, a random number table appeared at the back of many statistical texts. This was used to select a series of random numbers which would then be used as look-up indices into some other data set. You could do the same: 1. generate a list of pseudo-random numbers, using a pre-defined seed value, over the range 1 .. count(*) of records in table, 2. use that list as record id values to select the desired subset of the data in the table. This would be done in two separate operations, possibly with a storage of the generated numbers in a separate table which could be used in the query of the main data. Since it is a pseudo-random number series, you can repeat it as often as needed using the same seed value. Chris On Thu, 7 Nov 2019, at 15:15, Merijn Verstraaten wrote: > > > On 7 Nov 2019, at 19:16, David Raymond wrote: > > > > Along those lines SQLite includes the reverse_unordered_selects pragma > > https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects > > which will flip the order it sends rows in queries that don't explicitly > > specify an ordering. It's there to assist you in finding spots in your code > > where you might be relying on implicit ordering when you really shouldn't > > be. > > Like the rest of this threads, this is just pointing out why the things > in my initial email don't work, but I already knew that. Which is why I > asked for help to see if there is a way to do what I want that *does* > work. I don't care particularly about the details of "can I control the > order the condition is evaluated", it's just that all reasonable ways > to sample large streams that I know would require a deterministic order. > > If someone has a different/better idea on how to return just a random > sample from a query in a repeatable way, I'm all ears. > > So far the only suggestion was "use some non-deterministic random > sampling method and store the result", but since my samples are large > and I have lots of them, this would balloon my storage by >100x and I > don't have the available storage to make that work. > > - Merijn > > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > > Attachments: > * signature.asc ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
> > Regarding: "So far the only suggestion was "use some non-deterministic > random sampling method and store the result", but since my samples are > large and I have lots of them, this would balloon my storage by >100x and I > don't have the available storage to make that work." But Keith Medcalf was suggesting your table KeySet store only the *keys*, not the data itself, right? Are you saying storing the ROWIDs would still be prohibitively expensive? > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
> On 7 Nov 2019, at 19:16, David Raymond wrote: > > Along those lines SQLite includes the reverse_unordered_selects pragma > https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects > which will flip the order it sends rows in queries that don't explicitly > specify an ordering. It's there to assist you in finding spots in your code > where you might be relying on implicit ordering when you really shouldn't be. Like the rest of this threads, this is just pointing out why the things in my initial email don't work, but I already knew that. Which is why I asked for help to see if there is a way to do what I want that *does* work. I don't care particularly about the details of "can I control the order the condition is evaluated", it's just that all reasonable ways to sample large streams that I know would require a deterministic order. If someone has a different/better idea on how to return just a random sample from a query in a repeatable way, I'm all ears. So far the only suggestion was "use some non-deterministic random sampling method and store the result", but since my samples are large and I have lots of them, this would balloon my storage by >100x and I don't have the available storage to make that work. - Merijn signature.asc Description: Message signed with OpenPGP ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
Along those lines SQLite includes the reverse_unordered_selects pragma https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects which will flip the order it sends rows in queries that don't explicitly specify an ordering. It's there to assist you in finding spots in your code where you might be relying on implicit ordering when you really shouldn't be. Also available as a compile time option: SQLITE_REVERSE_UNORDERED_SELECTS -Original Message- From: sqlite-users On Behalf Of Simon Slavin Sent: Thursday, November 7, 2019 12:16 PM To: SQLite mailing list Subject: Re: [sqlite] Deterministic random sampling via SELECT On 7 Nov 2019, at 1:56pm, David Raymond wrote: > Others will correct me if I'm wrong on that. No correction, but I wanted to add something. According to the theory of how SQL (not just SQLite, SQL) works, tables have no order. You can, in theory, query a table of 100 rows with SELECT a,b FROM MyTable LIMIT 5 ten times and get ten different answers, including different rows and/or the same rows in different orders. Given some of the text in the post that started this thread, I just wanted to make sure this was understood. In practise I have never seen a SQL engine which does this. Each SQL implementation seems to return the same result every time you repeat the same query. Though different SQL implementations can return different results. You'd have thought that at least one server/client system would return five rows which happened to be in the cache, but I've never seen this. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
On 7 Nov 2019, at 1:56pm, David Raymond wrote: > Others will correct me if I'm wrong on that. No correction, but I wanted to add something. According to the theory of how SQL (not just SQLite, SQL) works, tables have no order. You can, in theory, query a table of 100 rows with SELECT a,b FROM MyTable LIMIT 5 ten times and get ten different answers, including different rows and/or the same rows in different orders. Given some of the text in the post that started this thread, I just wanted to make sure this was understood. In practise I have never seen a SQL engine which does this. Each SQL implementation seems to return the same result every time you repeat the same query. Though different SQL implementations can return different results. You'd have thought that at least one server/client system would return five rows which happened to be in the cache, but I've never seen this. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
On Thursday, 7 November, 2019 04:55, Merijn Verstraaten wrote: >I'm trying sample a (deterministically) random subset of a SELECT query, You cannot have something that is both RANDOM and DETERMINISTIC at the same time. >the most common solution on the internet to get random samples seems to >be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has >some question marks, since it relies on seeding RANDOM and knowing the >RANDOM function is always evaluated in the same order every query run), This must be from StackOverflow because it is a terrible idea. >but looking at the query plans this materialises the entire result set in >memory for the sort (no surprise, I can't think of anyway that could work >otherwise) which is rather undesirable if the sample size becomes large >(i.e. several million rows). Exactly. >Now, I already know different ways to implement a predicate function that >can deterministically keep elements from a stream, however that relies on >having a deterministic order for the stream. Which brings us to SQLite. I >can easily write something like: >SELECT * >FROM (...) >WHERE my_predicate_fun() >ORDER BY column1, column2,... >And this *seems* to evaluate the where clause for each row in the order >determined by ORDER BY, but this doesn't seem at all guaranteed by the >SQL spec. That is correct. ORDER BY is executed as the LAST STEP in returning the query results. However, based on availability of indexes, the optimizer my decide to utilize those indexes in order to obtain results in already sorted or partially sorted order. The various WHERE clause terms will be checked at the appropriate place to minimize the number of candidates at each step. Since my_predicate_fun() is not dependent on any item it will be executed at the last step on each candidate (that has passed all the other where clause constraints) before that candidate is passed to the sorter. >So, is this behaviour documented/guaranteed somewhere? If not, >is there some way to guarantee my where clause is evaluated for each row >in a deterministic order? Yes, and no. For a given (constant) query of a given set of (constant) data with a given set of (constant) indexes on that data, the (same version) optimizer will always arrive at the same method of answering your query. If you change the query or the data or the indexes available or the version of the optimizer, then the most efficient plan for obtaining the results for which you asked will be used, which will thereafter be constant. That is, if the input is constant, and the software is constant, then the output and the how the software arrives at that output, will be constant. See https://www.sqlite.org/queryplanner-ng.html#qpstab You might want to read the whole page. >In the simple case like above I could always just evaluate the query >without the ORDER BY, step through the entire query, and evaluate the >predicate in the application, but if I want to use this random selection >as a subquery, then that doesn't work. You have changed the query. Each query is optimized individually. >And while I'm asking questions: What if I want to do the above, but >selecting groups of rows? So, sort of like: >SELECT * >FROM (...) >GROUP BY groupColumn >HAVING my_predicate_fun(); >But where I want to return all rows in the group, rather than an >aggregate. Then you must use the group result to select the data. Use of GROUP BY/HAVING implies returning GROUPs, not rows. You can of course use the selected groups to find the matching members of the group and return those. Example: select from join (select a, b from where group by a, b having my_predicate_function()) as g on .a = g.a AND .b = g.b where ; If you wish to generate a "static random sample" of the data, then why do you not just do it once and record the result, thereafter using that keyset to re-retrieve the same data whenever you need it? Eg: create table KeySet as select a.RowID as aRowID, b.RowID as bRowID, c.RowID as cRowID from a, b, c where ; and thereafter select from KeySet join a on a.rowid == KeySet.aRowID join b on b.rowid == KeySet.bRowID join c on c.rowid == KeySet.cRowID -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Deterministic random sampling via SELECT
"So, is this behaviour documented/guaranteed somewhere?" Short version is: Nope. The engine is free to do whatever it wants as long as it gives the correct result in the end. Consider a simple select * from foo where predicate order by non_indexed_field; Since there is no nice ordering of the data already it's going to have to sort it. In which case it's probably going to check the predicate against records on the way _in_ to the sorter rather than _after_ sorting. Think "I might only have to sort 5 things instead of 5 million, so let's filter out as much as we can as soon as possible." And since the same data could be on the disk with its pages in any order you could conceive a situation where the same data could be processed differently depending on the specific file that it's in. The same query could process the data in a different order before and after a vacuum for example. Or maybe it does sort first then check. But that's an internal detail which could change every version. And all that is all considered fine, as the end result of the query will still be correct and in the order specified. The closest thing you can do is limit a query to using a specific index during a query, but even then you're basically relying on implementation details, and not a guarantee. Others will correct me if I'm wrong on that. -Original Message- From: sqlite-users On Behalf Of Merijn Verstraaten Sent: Thursday, November 7, 2019 6:55 AM To: sqlite-users@mailinglists.sqlite.org Subject: [sqlite] Deterministic random sampling via SELECT I'm trying sample a (deterministically) random subset of a SELECT query, the most common solution on the internet to get random samples seems to be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, since it relies on seeding RANDOM and knowing the RANDOM function is always evaluated in the same order every query run), but looking at the query plans this materialises the entire result set in memory for the sort (no surprise, I can't think of anyway that could work otherwise) which is rather undesirable if the sample size becomes large (i.e. several million rows). Now, I already know different ways to implement a predicate function that can deterministically keep elements from a stream, however that relies on having a deterministic order for the stream. Which brings us to SQLite. I can easily write something like: SELECT * FROM (...) WHERE my_predicate_fun() ORDER BY column1, column2,... And this *seems* to evaluate the where clause for each row in the order determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL spec. So, is this behaviour documented/guaranteed somewhere? If not, is there some way to guarantee my where clause is evaluated for each row in a deterministic order? In the simple case like above I could always just evaluate the query without the ORDER BY, step through the entire query, and evaluate the predicate in the application, but if I want to use this random selection as a subquery, then that doesn't work. And while I'm asking questions: What if I want to do the above, but selecting groups of rows? So, sort of like: SELECT * FROM (...) GROUP BY groupColumn HAVING my_predicate_fun(); But where I want to return all rows in the group, rather than an aggregate. Thanks in advance, Merijn ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Deterministic random sampling via SELECT
I'm trying sample a (deterministically) random subset of a SELECT query, the most common solution on the internet to get random samples seems to be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, since it relies on seeding RANDOM and knowing the RANDOM function is always evaluated in the same order every query run), but looking at the query plans this materialises the entire result set in memory for the sort (no surprise, I can't think of anyway that could work otherwise) which is rather undesirable if the sample size becomes large (i.e. several million rows). Now, I already know different ways to implement a predicate function that can deterministically keep elements from a stream, however that relies on having a deterministic order for the stream. Which brings us to SQLite. I can easily write something like: SELECT * FROM (...) WHERE my_predicate_fun() ORDER BY column1, column2,... And this *seems* to evaluate the where clause for each row in the order determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL spec. So, is this behaviour documented/guaranteed somewhere? If not, is there some way to guarantee my where clause is evaluated for each row in a deterministic order? In the simple case like above I could always just evaluate the query without the ORDER BY, step through the entire query, and evaluate the predicate in the application, but if I want to use this random selection as a subquery, then that doesn't work. And while I'm asking questions: What if I want to do the above, but selecting groups of rows? So, sort of like: SELECT * FROM (...) GROUP BY groupColumn HAVING my_predicate_fun(); But where I want to return all rows in the group, rather than an aggregate. Thanks in advance, Merijn ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users