[sqlite] Efficient random sampling in a large table using builtin functions.

Benoit Mortgat Thu, 08 Mar 2012 02:10:48 -0800

I have a table with millions of records.

When I run a query with


    ORDER BY random() LIMIT N;

the RANDOM() function is evaluated against all rows of my result set,
then sorting occurs, and as a result the query is slow.

In this case the query could be rewritten as:

 * Generate N, random, row numbers between 1 and (SELECT COUNT(*) FROM
   the_table_name). Maybe using remainder operator % and builtin ABS()
   and RANDOM() functions can help (see below)
 * SELECT FROM the_table WHERE rowid IN (those random numbers)

For the moment the most simple query I can think of is:

   SELECT *
     FROM my_table
    WHERE rowid IN
        ( SELECT 1 + (ABS(r.x) % c.num_rows)
            FROM (SELECT COUNT(*) AS num_rows
                    FROM my_table
                 ) AS c
           CROSS JOIN
                 (SELECT random() x
                    FROM my_table
                   LIMIT N
                 ) AS r
        );

This can however return less than N rows if by chance two random numbers
have the same remainder modulo COUNT(*) FROM my_table.

Note that the generation of N random numbers is quick because there is
no ORDER BY involved that would require computation of as many random
numbers as there are rows in the original table.

This could maybe be optimized inside SQLite if those conditions are met:

 * random() hasn't been overridden with sqlite_create_function() nor
   sqlite_create_function_v2()
 * the user SELECTs FROM a table without joins and w/o WHERE conditions.

-- 
Benoit Mortgat
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Efficient random sampling in a large table using builtin functions.

Reply via email to