[EMAIL PROTECTED] writes:

> Thomas Chust <[EMAIL PROTECTED]> wrote:
>> Hello,
>> 
>> I have a table of strings and integer primary keys from which I would like 
>> to retrieve a string at random. The best solution I could think of was to 
>> first do a
>>     SELECT count(id) FROM strings;
>> and then a
>>     SELECT string FROM strings LIMIT 1 OFFSET ?;
>> where the parameter is set to an integer in the range [0, count(id)[. This 
>> is not exactly very fast, though. I guess because the second statement 
>> steps through all the rows up to the one that is to be returned.
>> 
>> I can't really use a random primary key to look up a value, because not 
>> every key in the interval [0, count(id)[ is necessarily in use.
>> 
>> And all the constructions I tried that involved random() or a randomly 
>> filled column in an ORDER BY clause yield even worse performance than the 
>> above approach.
>> 
>> Does anybody have a good idea how this could be implemented more 
>> efficiently?
>> 
>
> I cannot think of a more efficient way to do this if you
> require each output to have equal probability.  If you do
> not need each string to have exactly the same probability,
> however, you could do this:
>
>    SELECT string FROM strings 
>    WHERE rowid>=random() % (SELECT max(rowid) FROM strings)
>    LIMIT 1;

If the ROWID values (your INTEGER PRIMARY KEY) are always auto-generated,
i.e. if you never set them explicitely and don't care about their actual
value, then the only time you'll have holes in the ROWID sequence is due to
deletes.  You can solve that by adding an ON DELETE trigger that replaces
MAX(ROWID) with old.ROWID.  This keeps the ROWID sequence sequential so a
simple "WHERE ROWID=random() % (SELECT MAX(ROWID) FROM strings)" clause will
have equal probability.

Note that you do NOT want to use AUTOINCREMENT in creating your table in this
case, because you want new rows to have the next available ROWID value, vs the
next never-used ROWID value.

    sqlite> CREATE TABLE strings
       ...> (
       ...>   pk INTEGER PRIMARY KEY,
       ...>   string TEXT
       ...> );
    sqlite> CREATE TRIGGER strings_delete_tr
       ...>   AFTER DELETE
       ...>   ON strings
       ...>   FOR EACH ROW
       ...>   BEGIN
       ...>     UPDATE strings
       ...>       SET pk = old.pk
       ...>       WHERE pk = (SELECT MAX(pk) FROM strings);
       ...>   END;
    sqlite> INSERT INTO strings (pk, string) VALUES (NULL, 'first string');
    sqlite> INSERT INTO strings (pk, string) VALUES (NULL, 'second string');
    sqlite> INSERT INTO strings (pk, string) VALUES (NULL, 'third string');
    sqlite> INSERT INTO strings (pk, string) VALUES (NULL, 'fourth string');
    sqlite> .mode line
    sqlite> SELECT * FROM strings;
        pk = 1
    string = first string

        pk = 2
    string = second string

        pk = 3
    string = third string

        pk = 4
    string = fourth string
    sqlite> DELETE FROM strings WHERE pk = 2;
    sqlite> SELECT * FROM strings;
        pk = 1
    string = first string

        pk = 2
    string = fourth string

        pk = 3
    string = third string
    sqlite>

I'm not sure of the efficiency of MAX(ROWID) in the trigger.  If you have lots
of deletes, it's possible that you can get more speed out of the trigger by
selecting the first row in descending order:

    sqlite> CREATE TRIGGER strings_delete_tr
       ...>   AFTER DELETE
       ...>   ON strings
       ...>   FOR EACH ROW
       ...>   BEGIN
       ...>   UPDATE strings
       ...>     SET pk = old.pk
       ...>     WHERE pk = (SELECT pk FROM strings ORDER BY pk DESC LIMIT 1);
       ...>   END;

I'm also not sure if a primary key automatic index is efficient for descending
look-ups.  If not, then you can create an explicit index for that:

    sqlite> CREATE INDEX strings_desc_pk_idx
       ...>   ON strings(pk DESC);

If using the descending order select really is more efficient, then the
request for a random row could also be more efficiently written as:

  WHERE ROWID=random() % (SELECT pk FROM strings ORDER BY pk DESC LIMIT 1)

Hopefully Dr. Hipp will comment on these three efficiency questions.

Cheers,

Derrell

Reply via email to