Re: [sqlite] Deterministic random sampling via SELECT

2019-11-08 Thread Merijn Verstraaten
On 7 Nov 2019, at 20:47, Chris Peachment  wrote:
>  1. generate a list of pseudo-random numbers, using a pre-defined
> seed value, over the range 1 .. count(*) of records in table,
> 
>  2. use that list as record id values to select the desired subset
> of the data in the table.
> 
> This would be done in two separate operations, possibly with a
> storage of the generated numbers in a separate table which could
> be used in the query of the main data.

Yeah, this and some of the other ideas were some things I considered as 
fallback ideas if things weren't possible within the query, although it does 
complicate things a bit.

I actually just had another idea: How is the behaviour of window functions 
defined? i.e. if there is an ORDER BY clause are rows added/removed from the 
window in the order of the order by clause, or is the behaviour of 
row_number/rank/dense_rank special cased and only those functions guarantee the 
order?

Because if window functions do follow there own order by, you could easily 
recover a deterministic evaluation order via the window specification.

If not, I guess I'll have to give up and do it the "hard way".

- Merijn
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Richard Damon
On 11/7/19 5:13 PM, Doug Currie wrote:
> On Thu, Nov 7, 2019 at 4:23 PM Richard Damon 
> wrote:
>
>> One thought would be to generate a ‘hash’ from part of the record, maybe
>> the record ID, and select records based on that value. The simplest would
>> be something like id%100 == 0 would get you 1% of the records. That
>> admittedly isn’t that random.
>>
>> Put the ID through a linear congruential generator, something like
>>
>> mod(a * Id + b, c) % 100 == 0
>>
>> And you will pretty well scramble the selection
>>
>>
> Yes, and if a, b, and c come from a randomization table, they can be
> modified to obtain a different pseudo-random set.
>
> e
a, b, and c should be chosen to give a reasonable random number
generator (there are tables of good values), not be arbitrary values.
The 100 and the 0 can be changed (or use some other test on the random
number) to get different selections.

-- 
Richard Damon

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Doug Currie
On Thu, Nov 7, 2019 at 4:23 PM Richard Damon 
wrote:

>
> One thought would be to generate a ‘hash’ from part of the record, maybe
> the record ID, and select records based on that value. The simplest would
> be something like id%100 == 0 would get you 1% of the records. That
> admittedly isn’t that random.
>
> Put the ID through a linear congruential generator, something like
>
> mod(a * Id + b, c) % 100 == 0
>
> And you will pretty well scramble the selection
>
>
Yes, and if a, b, and c come from a randomization table, they can be
modified to obtain a different pseudo-random set.

e
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Richard Damon
> On Nov 7, 2019, at 2:15 PM, Merijn Verstraaten  wrote:
> 
> 
>> On 7 Nov 2019, at 19:16, David Raymond  wrote:
>> 
>> Along those lines SQLite includes the reverse_unordered_selects pragma
>> https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
>> which will flip the order it sends rows in queries that don't explicitly 
>> specify an ordering. It's there to assist you in finding spots in your code 
>> where you might be relying on implicit ordering when you really shouldn't be.
> 
> Like the rest of this threads, this is just pointing out why the things in my 
> initial email don't work, but I already knew that. Which is why I asked for 
> help to see if there is a way to do what I want that *does* work. I don't 
> care particularly about the details of "can I control the order the condition 
> is evaluated", it's just that all reasonable ways to sample large streams 
> that I know would require a deterministic order.
> 
> If someone has a different/better idea on how to return just a random sample 
> from a query in a repeatable way, I'm all ears.
> 
> So far the only suggestion was "use some non-deterministic random sampling 
> method and store the result", but since my samples are large and I have lots 
> of them, this would balloon my storage by >100x and I don't have the 
> available storage to make that work.
> 
> - Merijn
> 

One thought would be to generate a ‘hash’ from part of the record, maybe the 
record ID, and select records based on that value. The simplest would be 
something like id%100 == 0 would get you 1% of the records. That admittedly 
isn’t that random.

Put the ID through a linear congruential generator, something like

mod(a * Id + b, c) % 100 == 0

And you will pretty well scramble the selection
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread David Raymond
We went off on a tangent, apologies.

If you have contiguous integer primary keys, you could randomly sample that 
range of integers, then pull the records with those keys.

Or in your external language of choice, sample the integers from 1 to the 
record count deterministically, select ordered by the primary key, and take the 
ones with the sampled offsets. (Stepping through 1 query and NOT doing a bunch 
of ...order by pk limit 1 offset n... queries)

Making something quick in Python I might do something like:


import random
import sqlite3

conn = sqlite3.connect(dbFile, isolation_level = None)
cur = conn.cursor()
cur.execute("select count(*) from foo;")
numRecords = cur.fetchone()[0]
sampleSize = 10
random.seed(5) #Your deterministic seed here
SampleOffsets = random.sample(range(1, numRecords + 1), sampleSize)
SampleOffsets.sort()
cur.execute("select * from foo order by primary_key;")
currentOffset = 0
for selectedOffset in SampleOffsets:
for _ in range(selectedOffset - currentOffset - 1):
cur.fetchone()
nextSampleRecord = cur.fetchone()
currentOffset = selectedOffset
doSomethingWithSample(nextSampleRecord)



-Original Message-
From: sqlite-users  On Behalf Of 
Merijn Verstraaten
Sent: Thursday, November 7, 2019 2:16 PM
To: SQLite mailing list 
Subject: Re: [sqlite] Deterministic random sampling via SELECT


> On 7 Nov 2019, at 19:16, David Raymond  wrote:
> 
> Along those lines SQLite includes the reverse_unordered_selects pragma
> https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
> which will flip the order it sends rows in queries that don't explicitly 
> specify an ordering. It's there to assist you in finding spots in your code 
> where you might be relying on implicit ordering when you really shouldn't be.

Like the rest of this threads, this is just pointing out why the things in my 
initial email don't work, but I already knew that. Which is why I asked for 
help to see if there is a way to do what I want that *does* work. I don't care 
particularly about the details of "can I control the order the condition is 
evaluated", it's just that all reasonable ways to sample large streams that I 
know would require a deterministic order.

If someone has a different/better idea on how to return just a random sample 
from a query in a repeatable way, I'm all ears.

So far the only suggestion was "use some non-deterministic random sampling 
method and store the result", but since my samples are large and I have lots of 
them, this would balloon my storage by >100x and I don't have the available 
storage to make that work.

- Merijn
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Chris Peachment
In the very old days before computers were common, a random number
table appeared at the back of many statistical texts. This was used
to select a series of random numbers which would then be used as
look-up indices into some other data set.

You could do the same:

  1. generate a list of pseudo-random numbers, using a pre-defined
 seed value, over the range 1 .. count(*) of records in table,

  2. use that list as record id values to select the desired subset
 of the data in the table.

This would be done in two separate operations, possibly with a
storage of the generated numbers in a separate table which could
be used in the query of the main data.

Since it is a pseudo-random number series, you can repeat it as
often as needed using the same seed value.

Chris


On Thu, 7 Nov 2019, at 15:15, Merijn Verstraaten wrote:
> 
> > On 7 Nov 2019, at 19:16, David Raymond  wrote:
> > 
> > Along those lines SQLite includes the reverse_unordered_selects pragma
> > https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
> > which will flip the order it sends rows in queries that don't explicitly 
> > specify an ordering. It's there to assist you in finding spots in your code 
> > where you might be relying on implicit ordering when you really shouldn't 
> > be.
> 
> Like the rest of this threads, this is just pointing out why the things 
> in my initial email don't work, but I already knew that. Which is why I 
> asked for help to see if there is a way to do what I want that *does* 
> work. I don't care particularly about the details of "can I control the 
> order the condition is evaluated", it's just that all reasonable ways 
> to sample large streams that I know would require a deterministic order.
> 
> If someone has a different/better idea on how to return just a random 
> sample from a query in a repeatable way, I'm all ears.
> 
> So far the only suggestion was "use some non-deterministic random 
> sampling method and store the result", but since my samples are large 
> and I have lots of them, this would balloon my storage by >100x and I 
> don't have the available storage to make that work.
> 
> - Merijn
> 
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> 
> Attachments:
> * signature.asc
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Donald Griggs
>
> Regarding: "So far the only suggestion was "use some non-deterministic
> random sampling method and store the result", but since my samples are
> large and I have lots of them, this would balloon my storage by >100x and I
> don't have the available storage to make that work."


But Keith Medcalf was suggesting your table KeySet store only the *keys*,
not the data itself, right?
Are you saying storing the ROWIDs would still be prohibitively expensive?

>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Merijn Verstraaten

> On 7 Nov 2019, at 19:16, David Raymond  wrote:
> 
> Along those lines SQLite includes the reverse_unordered_selects pragma
> https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
> which will flip the order it sends rows in queries that don't explicitly 
> specify an ordering. It's there to assist you in finding spots in your code 
> where you might be relying on implicit ordering when you really shouldn't be.

Like the rest of this threads, this is just pointing out why the things in my 
initial email don't work, but I already knew that. Which is why I asked for 
help to see if there is a way to do what I want that *does* work. I don't care 
particularly about the details of "can I control the order the condition is 
evaluated", it's just that all reasonable ways to sample large streams that I 
know would require a deterministic order.

If someone has a different/better idea on how to return just a random sample 
from a query in a repeatable way, I'm all ears.

So far the only suggestion was "use some non-deterministic random sampling 
method and store the result", but since my samples are large and I have lots of 
them, this would balloon my storage by >100x and I don't have the available 
storage to make that work.

- Merijn


signature.asc
Description: Message signed with OpenPGP
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread David Raymond
Along those lines SQLite includes the reverse_unordered_selects pragma
https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
which will flip the order it sends rows in queries that don't explicitly 
specify an ordering. It's there to assist you in finding spots in your code 
where you might be relying on implicit ordering when you really shouldn't be.

Also available as a compile time option: SQLITE_REVERSE_UNORDERED_SELECTS


-Original Message-
From: sqlite-users  On Behalf Of 
Simon Slavin
Sent: Thursday, November 7, 2019 12:16 PM
To: SQLite mailing list 
Subject: Re: [sqlite] Deterministic random sampling via SELECT

On 7 Nov 2019, at 1:56pm, David Raymond  wrote:

> Others will correct me if I'm wrong on that.

No correction, but I wanted to add something.

According to the theory of how SQL (not just SQLite, SQL) works, tables have no 
order.  You can, in theory, query a table of 100 rows with

SELECT a,b FROM MyTable LIMIT 5

ten times and get ten different answers, including different rows and/or the 
same rows in different orders.  Given some of the text in the post that started 
this thread, I just wanted to make sure this was understood.

In practise I have never seen a SQL engine which does this.  Each SQL 
implementation seems to return the same result every time you repeat the same 
query.  Though different SQL implementations can return different results.

You'd have thought that at least one server/client system would return five 
rows which happened to be in the cache, but I've never seen this.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Simon Slavin
On 7 Nov 2019, at 1:56pm, David Raymond  wrote:

> Others will correct me if I'm wrong on that.

No correction, but I wanted to add something.

According to the theory of how SQL (not just SQLite, SQL) works, tables have no 
order.  You can, in theory, query a table of 100 rows with

SELECT a,b FROM MyTable LIMIT 5

ten times and get ten different answers, including different rows and/or the 
same rows in different orders.  Given some of the text in the post that started 
this thread, I just wanted to make sure this was understood.

In practise I have never seen a SQL engine which does this.  Each SQL 
implementation seems to return the same result every time you repeat the same 
query.  Though different SQL implementations can return different results.

You'd have thought that at least one server/client system would return five 
rows which happened to be in the cache, but I've never seen this.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Keith Medcalf
On Thursday, 7 November, 2019 04:55, Merijn Verstraaten 
 wrote:

>I'm trying sample a (deterministically) random subset of a SELECT query,

You cannot have something that is both RANDOM and DETERMINISTIC at the same 
time.

>the most common solution on the internet to get random samples seems to
>be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has
>some question marks, since it relies on seeding RANDOM and knowing the
>RANDOM function is always evaluated in the same order every query run),

This must be from StackOverflow because it is a terrible idea.

>but looking at the query plans this materialises the entire result set in
>memory for the sort (no surprise, I can't think of anyway that could work
>otherwise) which is rather undesirable if the sample size becomes large
>(i.e. several million rows).

Exactly.

>Now, I already know different ways to implement a predicate function that
>can deterministically keep elements from a stream, however that relies on
>having a deterministic order for the stream. Which brings us to SQLite. I
>can easily write something like:

>SELECT *
>FROM (...)
>WHERE my_predicate_fun()
>ORDER BY column1, column2,...

>And this *seems* to evaluate the where clause for each row in the order
>determined by ORDER BY, but this doesn't seem at all guaranteed by the
>SQL spec. 

That is correct.  ORDER BY is executed as the LAST STEP in returning the query 
results.  However, based on availability of indexes, the optimizer my decide to 
utilize those indexes in order to obtain results in already sorted or partially 
sorted order.  The various WHERE clause terms will be checked at the 
appropriate place to minimize the number of candidates at each step.  Since 
my_predicate_fun() is not dependent on any item it will be executed at the last 
step on each candidate (that has passed all the other where clause constraints) 
before that candidate is passed to the sorter.

>So, is this behaviour documented/guaranteed somewhere? If not,
>is there some way to guarantee my where clause is evaluated for each row
>in a deterministic order?

Yes, and no.  For a given (constant) query of a given set of (constant) data 
with a given set of (constant) indexes on that data, the (same version) 
optimizer will always arrive at the same method of answering your query.  If 
you change the query or the data or the indexes available or the version of the 
optimizer, then the most efficient plan for obtaining the results for which you 
asked will be used, which will thereafter be constant.  That is, if the input 
is constant, and the software is constant, then the output and the how the 
software arrives at that output, will be constant.

See https://www.sqlite.org/queryplanner-ng.html#qpstab

You might want to read the whole page.

>In the simple case like above I could always just evaluate the query
>without the ORDER BY, step through the entire query, and evaluate the
>predicate in the application, but if I want to use this random selection
>as a subquery, then that doesn't work.

You have changed the query.  Each query is optimized individually.

>And while I'm asking questions: What if I want to do the above, but
>selecting groups of rows? So, sort of like:

>SELECT *
>FROM (...)
>GROUP BY groupColumn
>HAVING my_predicate_fun();

>But where I want to return all rows in the group, rather than an
>aggregate.

Then you must use the group result to select the data.  Use of GROUP BY/HAVING 
implies returning GROUPs, not rows.  You can of course use the selected groups 
to find the matching members of the group and return those.  Example:

select 
  from 
  join (select a, b 
  from  
 where  
  group by a, b 
having my_predicate_function()) as g
on .a = g.a AND .b = g.b
 where 
;

If you wish to generate a "static random sample" of the data, then why do you 
not just do it once and record the result, thereafter using that keyset to 
re-retrieve the same data whenever you need it?  Eg:

create table KeySet as
select a.RowID as aRowID, b.RowID as bRowID, c.RowID as cRowID
  from a, b, c
 where ;

and thereafter

select 
  from KeySet
  join a on a.rowid == KeySet.aRowID
  join b on b.rowid == KeySet.bRowID
  join c on c.rowid == KeySet.cRowID

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread David Raymond
"So, is this behaviour documented/guaranteed somewhere?"
Short version is: Nope. The engine is free to do whatever it wants as long as 
it gives the correct result in the end.

Consider a simple
select * from foo where predicate order by non_indexed_field;
Since there is no nice ordering of the data already it's going to have to sort 
it. In which case it's probably going to check the predicate against records on 
the way _in_ to the sorter rather than _after_ sorting. Think "I might only 
have to sort 5 things instead of 5 million, so let's filter out as much as we 
can as soon as possible." And since the same data could be on the disk with its 
pages in any order you could conceive a situation where the same data could be 
processed differently depending on the specific file that it's in. The same 
query could process the data in a different order before and after a vacuum for 
example.

Or maybe it does sort first then check. But that's an internal detail which 
could change every version. And all that is all considered fine, as the end 
result of the query will still be correct and in the order specified.

The closest thing you can do is limit a query to using a specific index during 
a query, but even then you're basically relying on implementation details, and 
not a guarantee.

Others will correct me if I'm wrong on that.



-Original Message-
From: sqlite-users  On Behalf Of 
Merijn Verstraaten
Sent: Thursday, November 7, 2019 6:55 AM
To: sqlite-users@mailinglists.sqlite.org
Subject: [sqlite] Deterministic random sampling via SELECT

I'm trying sample a (deterministically) random subset of a SELECT query, the 
most common solution on the internet to get random samples seems to be "SELECT 
* FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, 
since it relies on seeding RANDOM and knowing the RANDOM function is always 
evaluated in the same order every query run), but looking at the query plans 
this materialises the entire result set in memory for the sort (no surprise, I 
can't think of anyway that could work otherwise) which is rather undesirable if 
the sample size becomes large (i.e. several million rows).

Now, I already know different ways to implement a predicate function that can 
deterministically keep elements from a stream, however that relies on having a 
deterministic order for the stream. Which brings us to SQLite. I can easily 
write something like:

SELECT *
FROM (...)
WHERE my_predicate_fun()
ORDER BY column1, column2,...

And this *seems* to evaluate the where clause for each row in the order 
determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL 
spec. So, is this behaviour documented/guaranteed somewhere? If not, is there 
some way to guarantee my where clause is evaluated for each row in a 
deterministic order?

In the simple case like above I could always just evaluate the query without 
the ORDER BY, step through the entire query, and evaluate the predicate in the 
application, but if I want to use this random selection as a subquery, then 
that doesn't work.

And while I'm asking questions: What if I want to do the above, but selecting 
groups of rows? So, sort of like:

SELECT *
FROM (...)
GROUP BY groupColumn
HAVING my_predicate_fun();

But where I want to return all rows in the group, rather than an aggregate.

Thanks in advance,
Merijn
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Deterministic random sampling via SELECT

2019-11-07 Thread Merijn Verstraaten
I'm trying sample a (deterministically) random subset of a SELECT query, the 
most common solution on the internet to get random samples seems to be "SELECT 
* FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, 
since it relies on seeding RANDOM and knowing the RANDOM function is always 
evaluated in the same order every query run), but looking at the query plans 
this materialises the entire result set in memory for the sort (no surprise, I 
can't think of anyway that could work otherwise) which is rather undesirable if 
the sample size becomes large (i.e. several million rows).

Now, I already know different ways to implement a predicate function that can 
deterministically keep elements from a stream, however that relies on having a 
deterministic order for the stream. Which brings us to SQLite. I can easily 
write something like:

SELECT *
FROM (...)
WHERE my_predicate_fun()
ORDER BY column1, column2,...

And this *seems* to evaluate the where clause for each row in the order 
determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL 
spec. So, is this behaviour documented/guaranteed somewhere? If not, is there 
some way to guarantee my where clause is evaluated for each row in a 
deterministic order?

In the simple case like above I could always just evaluate the query without 
the ORDER BY, step through the entire query, and evaluate the predicate in the 
application, but if I want to use this random selection as a subquery, then 
that doesn't work.

And while I'm asking questions: What if I want to do the above, but selecting 
groups of rows? So, sort of like:

SELECT *
FROM (...)
GROUP BY groupColumn
HAVING my_predicate_fun();

But where I want to return all rows in the group, rather than an aggregate.

Thanks in advance,
Merijn
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users