ORDER BY RAND() performance

2004-03-08 Thread Neil Gunton
Hi all,

I am using MySQL 4.0.x to run a community website which has (among other
things) over 19,000 pictures. There is a page that selects 30 random
thumbnails. I have noticed that the performance of ORDER BY RAND() on
this table has a significant impact on performace. I have all the
relevant indexes defined, and I have researched this issue on the Web.
It seems that other people have also encountered a performance hit while
using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN
on a query using this, MySQL reports Using temporary; Using filesort,
which is the worst possible result. Also, the number of rows reported is
pretty much the entire set. So, presumably, the current implementation
of ORDER BY RAND() means that MySQL has to traverse the entire table,
regardless of other indexes.

There are, of course, other ways to get around this, but they are all
more complex than simply using ORDER BY RAND(). I think that selecting a
random number of records from a table is something that a lot of
websites would like to be able to do, and so as datasets get larger it
would be nice to see this function scale well. For anyone who has a
website with a large archive of data, the ability to present a random
selection of this data is very useful.

I would like to know if anyone knows if the MySQL team is aware of this
problem, and if so whether they are planning on improving it at any
point. I ask mainly because if I am told that yes, it'll be much better
in version X then I can live with the couple of seconds that it takes
currently, knowing that this will be better down the line. However if I
am advised that this is a fundamentally hard problem for whatever
reason, then I will put the effort into reworking my tables to use an
alternative solution.

The only real solution that I can see which is fast is to make another
table which contains just the unique IDs of the pictures that are
visible (there are others which are not publicly visible, and which
shouldn't be included in the random query, so making a separate table
with the appropriate subset makes sense for performance). This new table
will have a primary key which is a numeric sequence field. Every
record will have its own sequence number, going from 1 up to the number
of records. Then, instead of doing one query with ORDER BY RAND() LIMIT
30, I can instead do 30 queries, each with a different random sequence
(generated from Perl), which will look up the unique sequence number.
Since this is a primary key, it will be very fast, so that doing 30
queries will not have a big performance impact. However this scheme
requires that the sequences in the new table be kept very consistent -
for example, if a picture is removed from the sequence then the sequence
numbers above that record have to be updated. This introduces potential
for error, but it is a possible solution. I don't want to implement it,
obviously, if ORDER BY RAND() is slated for improvement.

Thanks for any ideas or insights...

-Neil Gunton

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: ORDER BY RAND() performance

2004-03-08 Thread colbey

If your infact (sounds like) storing the pictures meta-data (name, size,
owner, etc) and the data (blob of some kind) .. I would definately break
up the design into 2 tables.  That way when dealing with the meta-data
table (your RAND() query) there is much less data that needs to be
traversed to get your answer, which should result in a faster query.




On Mon, 8 Mar 2004, Neil Gunton wrote:

 Hi all,

 I am using MySQL 4.0.x to run a community website which has (among other
 things) over 19,000 pictures. There is a page that selects 30 random
 thumbnails. I have noticed that the performance of ORDER BY RAND() on
 this table has a significant impact on performace. I have all the
 relevant indexes defined, and I have researched this issue on the Web.
 It seems that other people have also encountered a performance hit while
 using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN
 on a query using this, MySQL reports Using temporary; Using filesort,
 which is the worst possible result. Also, the number of rows reported is
 pretty much the entire set. So, presumably, the current implementation
 of ORDER BY RAND() means that MySQL has to traverse the entire table,
 regardless of other indexes.

 There are, of course, other ways to get around this, but they are all
 more complex than simply using ORDER BY RAND(). I think that selecting a
 random number of records from a table is something that a lot of
 websites would like to be able to do, and so as datasets get larger it
 would be nice to see this function scale well. For anyone who has a
 website with a large archive of data, the ability to present a random
 selection of this data is very useful.

 I would like to know if anyone knows if the MySQL team is aware of this
 problem, and if so whether they are planning on improving it at any
 point. I ask mainly because if I am told that yes, it'll be much better
 in version X then I can live with the couple of seconds that it takes
 currently, knowing that this will be better down the line. However if I
 am advised that this is a fundamentally hard problem for whatever
 reason, then I will put the effort into reworking my tables to use an
 alternative solution.

 The only real solution that I can see which is fast is to make another
 table which contains just the unique IDs of the pictures that are
 visible (there are others which are not publicly visible, and which
 shouldn't be included in the random query, so making a separate table
 with the appropriate subset makes sense for performance). This new table
 will have a primary key which is a numeric sequence field. Every
 record will have its own sequence number, going from 1 up to the number
 of records. Then, instead of doing one query with ORDER BY RAND() LIMIT
 30, I can instead do 30 queries, each with a different random sequence
 (generated from Perl), which will look up the unique sequence number.
 Since this is a primary key, it will be very fast, so that doing 30
 queries will not have a big performance impact. However this scheme
 requires that the sequences in the new table be kept very consistent -
 for example, if a picture is removed from the sequence then the sequence
 numbers above that record have to be updated. This introduces potential
 for error, but it is a possible solution. I don't want to implement it,
 obviously, if ORDER BY RAND() is slated for improvement.

 Thanks for any ideas or insights...

 -Neil Gunton

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: ORDER BY RAND() performance

2004-03-08 Thread Neil Gunton
[EMAIL PROTECTED] wrote:
 
 If your infact (sounds like) storing the pictures meta-data (name, size,
 owner, etc) and the data (blob of some kind) .. I would definately break
 up the design into 2 tables.  That way when dealing with the meta-data
 table (your RAND() query) there is much less data that needs to be
 traversed to get your answer, which should result in a faster query.

Thanks! This is definitely good advice, but unfortunately it doesn't
solve the RAND() slowness. I have been testing with a separate table
that ONLY contains the id of the pics, and as it grows toward 100,000
records this simple query does get noticeably slower:

SELECT * FROM visible_pics ORDER BY RAND() LIMIT 30;

Where visible_pics just has two numeric ID fields (pic_id and doc_id).
It doesn't seem to matter if I make pic_id a primary key or not. I think
I've reduced it to pretty much the minimal case, given that I want a
random selection of ALL the records.

I don't know the internals of how MySQL could optimize this sort of
thing, but I was thinking that perhaps there was some kind of internal
trickery it could do to select random record positions and then get
those very quickly, without having to traverse the entire table. I think
if the table has no varchar fields then it should be easy (at least in
MyISAM) to calculate the record position based on the record number. So
I think it *should* in theory be possible to optimize this, but I just
don't know if anyone has realized that it's an issue, or if they are
planning on doing anything about it. Any insights from MySQL internal
developers? Or should I be posting this to the internals list?

Thanks again,

-Neil

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: ORDER BY RAND() performance

2004-03-08 Thread Donny Simonton
ORDER BY RAND(), just sucks in my opinion.  We have created our own internal
randomization system because pretty much everytime you use it will show up
in the slow query log, because of the using temporary, using filesort it
does.  Splitting your data into a hundred tables will still make it using
temporary, using filesort.

I just did a little test, where I only had 5 entries in a table, and I using
temp using filesort.

Will it ever be improved?  Probably the same time order by DESC is improved.

Donny



 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 08, 2004 2:05 PM
 To: Neil Gunton
 Cc: MySQL
 Subject: Re: ORDER BY RAND() performance


 If your infact (sounds like) storing the pictures meta-data (name, size,
 owner, etc) and the data (blob of some kind) .. I would definately break
 up the design into 2 tables.  That way when dealing with the meta-data
 table (your RAND() query) there is much less data that needs to be
 traversed to get your answer, which should result in a faster query.




 On Mon, 8 Mar 2004, Neil Gunton wrote:

  Hi all,
 
  I am using MySQL 4.0.x to run a community website which has (among other
  things) over 19,000 pictures. There is a page that selects 30 random
  thumbnails. I have noticed that the performance of ORDER BY RAND() on
  this table has a significant impact on performace. I have all the
  relevant indexes defined, and I have researched this issue on the Web.
  It seems that other people have also encountered a performance hit while
  using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN
  on a query using this, MySQL reports Using temporary; Using filesort,
  which is the worst possible result. Also, the number of rows reported is
  pretty much the entire set. So, presumably, the current implementation
  of ORDER BY RAND() means that MySQL has to traverse the entire table,
  regardless of other indexes.
 
  There are, of course, other ways to get around this, but they are all
  more complex than simply using ORDER BY RAND(). I think that selecting a
  random number of records from a table is something that a lot of
  websites would like to be able to do, and so as datasets get larger it
  would be nice to see this function scale well. For anyone who has a
  website with a large archive of data, the ability to present a random
  selection of this data is very useful.
 
  I would like to know if anyone knows if the MySQL team is aware of this
  problem, and if so whether they are planning on improving it at any
  point. I ask mainly because if I am told that yes, it'll be much better
  in version X then I can live with the couple of seconds that it takes
  currently, knowing that this will be better down the line. However if I
  am advised that this is a fundamentally hard problem for whatever
  reason, then I will put the effort into reworking my tables to use an
  alternative solution.
 
  The only real solution that I can see which is fast is to make another
  table which contains just the unique IDs of the pictures that are
  visible (there are others which are not publicly visible, and which
  shouldn't be included in the random query, so making a separate table
  with the appropriate subset makes sense for performance). This new table
  will have a primary key which is a numeric sequence field. Every
  record will have its own sequence number, going from 1 up to the number
  of records. Then, instead of doing one query with ORDER BY RAND() LIMIT
  30, I can instead do 30 queries, each with a different random sequence
  (generated from Perl), which will look up the unique sequence number.
  Since this is a primary key, it will be very fast, so that doing 30
  queries will not have a big performance impact. However this scheme
  requires that the sequences in the new table be kept very consistent -
  for example, if a picture is removed from the sequence then the sequence
  numbers above that record have to be updated. This introduces potential
  for error, but it is a possible solution. I don't want to implement it,
  obviously, if ORDER BY RAND() is slated for improvement.
 
  Thanks for any ideas or insights...
 
  -Neil Gunton
 
  --
  MySQL General Mailing List
  For list archives: http://lists.mysql.com/mysql
  To unsubscribe:
 http://lists.mysql.com/[EMAIL PROTECTED]
 

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
 http://lists.mysql.com/[EMAIL PROTECTED]




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: ORDER BY RAND() performance

2004-03-08 Thread colbey

Donny,  what do you do?  Throw all the values into an array or something
on the client side, and use a random number generator to pull out the
array elements?

I suppose (depending on resultset size) pulling that many rows from server
to client and handing on client side could be faster...




On Mon, 8 Mar 2004, Donny Simonton wrote:

 ORDER BY RAND(), just sucks in my opinion.  We have created our own internal
 randomization system because pretty much everytime you use it will show up
 in the slow query log, because of the using temporary, using filesort it
 does.  Splitting your data into a hundred tables will still make it using
 temporary, using filesort.

 I just did a little test, where I only had 5 entries in a table, and I using
 temp using filesort.

 Will it ever be improved?  Probably the same time order by DESC is improved.

 Donny



  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  Sent: Monday, March 08, 2004 2:05 PM
  To: Neil Gunton
  Cc: MySQL
  Subject: Re: ORDER BY RAND() performance
 
 
  If your infact (sounds like) storing the pictures meta-data (name, size,
  owner, etc) and the data (blob of some kind) .. I would definately break
  up the design into 2 tables.  That way when dealing with the meta-data
  table (your RAND() query) there is much less data that needs to be
  traversed to get your answer, which should result in a faster query.
 
 
 
 
  On Mon, 8 Mar 2004, Neil Gunton wrote:
 
   Hi all,
  
   I am using MySQL 4.0.x to run a community website which has (among other
   things) over 19,000 pictures. There is a page that selects 30 random
   thumbnails. I have noticed that the performance of ORDER BY RAND() on
   this table has a significant impact on performace. I have all the
   relevant indexes defined, and I have researched this issue on the Web.
   It seems that other people have also encountered a performance hit while
   using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN
   on a query using this, MySQL reports Using temporary; Using filesort,
   which is the worst possible result. Also, the number of rows reported is
   pretty much the entire set. So, presumably, the current implementation
   of ORDER BY RAND() means that MySQL has to traverse the entire table,
   regardless of other indexes.
  
   There are, of course, other ways to get around this, but they are all
   more complex than simply using ORDER BY RAND(). I think that selecting a
   random number of records from a table is something that a lot of
   websites would like to be able to do, and so as datasets get larger it
   would be nice to see this function scale well. For anyone who has a
   website with a large archive of data, the ability to present a random
   selection of this data is very useful.
  
   I would like to know if anyone knows if the MySQL team is aware of this
   problem, and if so whether they are planning on improving it at any
   point. I ask mainly because if I am told that yes, it'll be much better
   in version X then I can live with the couple of seconds that it takes
   currently, knowing that this will be better down the line. However if I
   am advised that this is a fundamentally hard problem for whatever
   reason, then I will put the effort into reworking my tables to use an
   alternative solution.
  
   The only real solution that I can see which is fast is to make another
   table which contains just the unique IDs of the pictures that are
   visible (there are others which are not publicly visible, and which
   shouldn't be included in the random query, so making a separate table
   with the appropriate subset makes sense for performance). This new table
   will have a primary key which is a numeric sequence field. Every
   record will have its own sequence number, going from 1 up to the number
   of records. Then, instead of doing one query with ORDER BY RAND() LIMIT
   30, I can instead do 30 queries, each with a different random sequence
   (generated from Perl), which will look up the unique sequence number.
   Since this is a primary key, it will be very fast, so that doing 30
   queries will not have a big performance impact. However this scheme
   requires that the sequences in the new table be kept very consistent -
   for example, if a picture is removed from the sequence then the sequence
   numbers above that record have to be updated. This introduces potential
   for error, but it is a possible solution. I don't want to implement it,
   obviously, if ORDER BY RAND() is slated for improvement.
  
   Thanks for any ideas or insights...
  
   -Neil Gunton
  
   --
   MySQL General Mailing List
   For list archives: http://lists.mysql.com/mysql
   To unsubscribe:
  http://lists.mysql.com/[EMAIL PROTECTED]
  
 
  --
  MySQL General Mailing List
  For list archives: http://lists.mysql.com/mysql
  To unsubscribe:
  http://lists.mysql.com/[EMAIL PROTECTED]




-- 
MySQL General Mailing List
For list archives

Re: ORDER BY RAND() performance

2004-03-08 Thread Ray
On Monday 08 March 2004 14:14, Neil Gunton wrote:
 [EMAIL PROTECTED] wrote:
  If your infact (sounds like) storing the pictures meta-data
  (name, size, owner, etc) and the data (blob of some kind) .. I
  would definately break up the design into 2 tables.  That way
  when dealing with the meta-data table (your RAND() query) there
  is much less data that needs to be traversed to get your answer,
  which should result in a faster query.

 Thanks! This is definitely good advice, but unfortunately it
 doesn't solve the RAND() slowness. I have been testing with a
 separate table that ONLY contains the id of the pics, and as it
 grows toward 100,000 records this simple query does get noticeably
 slower:

 SELECT * FROM visible_pics ORDER BY RAND() LIMIT 30;

 Where visible_pics just has two numeric ID fields (pic_id and
 doc_id). It doesn't seem to matter if I make pic_id a primary key
 or not. I think I've reduced it to pretty much the minimal case,
 given that I want a random selection of ALL the records.

 I don't know the internals of how MySQL could optimize this sort of
 thing, but I was thinking that perhaps there was some kind of
 internal trickery it could do to select random record positions and
 then get those very quickly, without having to traverse the entire
 table. I think if the table has no varchar fields then it should be
 easy (at least in MyISAM) to calculate the record position based on
 the record number. So I think it *should* in theory be possible to
 optimize this, but I just don't know if anyone has realized that
 it's an issue, or if they are planning on doing anything about it.
 Any insights from MySQL internal developers? Or should I be posting
 this to the internals list?

 Thanks again,

 -Neil

an alternative to the order by rand() with large record sets is to 
pick a random starting point limit $randPoint, 30  don't know if 
its a viable solution to your situation, but it limits you to 2 
querys (row count, fetch) rather then the 30 (fetch 1 x 30)

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: ORDER BY RAND() performance

2004-03-08 Thread Donny Simonton
Exactly, it's faster for us to pull all of the data and then randomize it
locally.  We have benchmarked it both ways and the local randomize was
faster.  Now that's if you want multiple records returned.

Now if you only want one record, what we do, is create a random number, and
then just do a limit 19345, 1 or something like that.

We have tried another option which we stopped using which was creating 30
random numbers and then doing 30 select statements.  But that was slower
overall than 1 select with order by rand.  One other option that we use
sometimes is say you need 30 results randomized, and you have an
auto-increment in your table.  Create 30 random numbers, then do a select
with something like this:

Select * from blabla where lkajsdlkjas IN (10, 43, 22, 8981, etc...)

This works fairly well, but then again, I haven't benchmarked it in a while
and don't really remember how well it works.  Actually, I just tried this on
a table with 43 million entries and it took 0.0004 seconds.

Just some ideas.

Donny

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 08, 2004 2:36 PM
 To: Donny Simonton
 Cc: 'Neil Gunton'; 'MySQL'
 Subject: RE: ORDER BY RAND() performance


 Donny,  what do you do?  Throw all the values into an array or something
 on the client side, and use a random number generator to pull out the
 array elements?

 I suppose (depending on resultset size) pulling that many rows from server
 to client and handing on client side could be faster...




 On Mon, 8 Mar 2004, Donny Simonton wrote:

  ORDER BY RAND(), just sucks in my opinion.  We have created our own
 internal
  randomization system because pretty much everytime you use it will show
 up
  in the slow query log, because of the using temporary, using filesort it
  does.  Splitting your data into a hundred tables will still make it
 using
  temporary, using filesort.
 
  I just did a little test, where I only had 5 entries in a table, and I
 using
  temp using filesort.
 
  Will it ever be improved?  Probably the same time order by DESC is
 improved.
 
  Donny
 
 
 
   -Original Message-
   From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
   Sent: Monday, March 08, 2004 2:05 PM
   To: Neil Gunton
   Cc: MySQL
   Subject: Re: ORDER BY RAND() performance
  
  
   If your infact (sounds like) storing the pictures meta-data (name,
 size,
   owner, etc) and the data (blob of some kind) .. I would definately
 break
   up the design into 2 tables.  That way when dealing with the meta-data
   table (your RAND() query) there is much less data that needs to be
   traversed to get your answer, which should result in a faster query.
  
  
  
  
   On Mon, 8 Mar 2004, Neil Gunton wrote:
  
Hi all,
   
I am using MySQL 4.0.x to run a community website which has (among
 other
things) over 19,000 pictures. There is a page that selects 30 random
thumbnails. I have noticed that the performance of ORDER BY RAND()
 on
this table has a significant impact on performace. I have all the
relevant indexes defined, and I have researched this issue on the
 Web.
It seems that other people have also encountered a performance hit
 while
using ORDER BY RAND(). The reason appears to be that when you do
 EXPLAIN
on a query using this, MySQL reports Using temporary; Using
 filesort,
which is the worst possible result. Also, the number of rows
 reported is
pretty much the entire set. So, presumably, the current
 implementation
of ORDER BY RAND() means that MySQL has to traverse the entire
 table,
regardless of other indexes.
   
There are, of course, other ways to get around this, but they are
 all
more complex than simply using ORDER BY RAND(). I think that
 selecting a
random number of records from a table is something that a lot of
websites would like to be able to do, and so as datasets get larger
 it
would be nice to see this function scale well. For anyone who has a
website with a large archive of data, the ability to present a
 random
selection of this data is very useful.
   
I would like to know if anyone knows if the MySQL team is aware of
 this
problem, and if so whether they are planning on improving it at any
point. I ask mainly because if I am told that yes, it'll be much
 better
in version X then I can live with the couple of seconds that it
 takes
currently, knowing that this will be better down the line. However
 if I
am advised that this is a fundamentally hard problem for whatever
reason, then I will put the effort into reworking my tables to use
 an
alternative solution.
   
The only real solution that I can see which is fast is to make
 another
table which contains just the unique IDs of the pictures that are
visible (there are others which are not publicly visible, and which
shouldn't be included in the random query, so making a separate
 table

Re: ORDER BY RAND() performance

2004-03-08 Thread Neil Gunton
Ray wrote:
 an alternative to the order by rand() with large record sets is to
 pick a random starting point limit $randPoint, 30  don't know if
 its a viable solution to your situation, but it limits you to 2
 querys (row count, fetch) rather then the 30 (fetch 1 x 30)

Thanks! I did see this suggested on another forum. However when I tried
it, I found that EXPLAIN wasn't very encouraging. Using this minimal
table:

CREATE TABLE visible_pics (
  pic_id int(10) unsigned NOT NULL default '0',
  doc_id int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (pic_id),
  KEY doc_id (doc_id)
) TYPE=MyISAM;

mysql explain select * from visible_pics limit 1,1;
+--+--+---+--+-+--+---+---+
| table| type | possible_keys | key  | key_len | ref  | rows  |
Extra |
+--+--+---+--+-+--+---+---+
| visible_pics | ALL  | NULL  | NULL |NULL | NULL | 19633
|   |
+--+--+---+--+-+--+---+---+
1 row in set (0.00 sec)

mysql explain select * from visible_pics order by pic_id limit 1,1;
+--+---+---+-+-+--+---+---+
| table| type  | possible_keys | key | key_len | ref  |
rows  | Extra |
+--+---+---+-+-+--+---+---+
| visible_pics | index | NULL  | PRIMARY |   4 | NULL |
19633 |   |
+--+---+---+-+-+--+---+---+
1 row in set (0.00 sec)

In both cases, the number of rows which will be scanned is close to the
total number of rows. I included the second EXPLAIN to see if using
pic_id (the primary key) would make any difference. It actually seems to
actually be faster without using the index, in my trivial tests:

mysql select * from visible_pics order by pic_id limit 1,1;
+++
| pic_id | doc_id |
+++
|  11669 |258 |
+++
1 row in set (0.09 sec)

mysql select * from visible_pics order by pic_id limit 10100,1;
+++
| pic_id | doc_id |
+++
|  11771 |258 |
+++
1 row in set (0.08 sec)

mysql select * from visible_pics limit 10100,1;
+++
| pic_id | doc_id |
+++
|  11750 |258 |
+++
1 row in set (0.02 sec)

mysql select * from visible_pics limit 12100,1;
+++
| pic_id | doc_id |
+++
|  14085 |269 |
+++
1 row in set (0.02 sec)

mysql select * from visible_pics limit 900,1;
+++
| pic_id | doc_id |
+++
|   1100 | 53 |
+++
1 row in set (0.01 sec)

mysql select * from visible_pics limit 18000,1;
+++
| pic_id | doc_id |
+++
|  20343 |387 |
+++
1 row in set (0.03 sec)

mysql 
mysql select * from visible_pics order by pic_id limit 12000,1;
+++
| pic_id | doc_id |
+++
|  13857 |325 |
+++
1 row in set (0.10 sec)

The last one was just to confirm that there wasn't some kind of disk
caching going on that affected the results. The query without using the
index was definitely faster. If the average query is about 0.05 second,
and you do 30 of them, then that would give about 1.5 seconds for the
whole thing. This is in fact worse than just doing the ORDER BY RAND()
LIMIT 30 on the same table:

mysql select * from visible_pics order by rand() limit 30;
+++
| pic_id | doc_id |
+++
|   4149 | 98 |
|   5030 |148 |
|   1911 | 69 |
|   4258 |105 |
|  14131 |170 |
|  17047 |165 |
|  12643 |319 |
|  14271 |180 |
|   1815 | 69 |
|  12768 |260 |
|   8118 |164 |
|   2339 | 87 |
|   3058 | 63 |
|   2573 | 46 |
|  11511 |230 |
|  16939 |335 |
|   7749 |113 |
|   6921 |164 |
|   2106 | 79 |
|   3609 | 91 |
|  12513 |259 |
|  18169 |234 |
|  19173 |372 |
|  11912 |305 |
|   2026 | 69 |
|   7697 |222 |
|  20834 |447 |
|977 | 53 |
|   1638 | 24 |
|  13986 |308 |
+++
30 rows in set (0.22 sec)

This isn't as simple as it appears at first, however - this is merely
the query to get 30 random pic_id's. I then have to do 30 more queries
to get the real records in the separate table, so that I can build the
HTML page with filenames, captions etc.

Thanks again - this is a good one to know about (for anyone else out
there who is encountering the same issues). But the above tests were on
a very minimal table, with no where clause, because the table was
specially prepared to only contain the relevant records in the first
place.

I am still wondering if I should post my original question about whether
ORDER BY RAND() will be optimized anytime soon to the internals
list... I don't want to 

Re: ORDER BY RAND() performance

2004-03-08 Thread Neil Gunton
Donny Simonton wrote:
 One other option that we use
 sometimes is say you need 30 results randomized, and you have an
 auto-increment in your table.  Create 30 random numbers, then do a select
 with something like this:
 
 Select * from blabla where lkajsdlkjas IN (10, 43, 22, 8981, etc...)
 
 This works fairly well, but then again, I haven't benchmarked it in a while
 and don't really remember how well it works.  Actually, I just tried this on
 a table with 43 million entries and it took 0.0004 seconds.

I was thinking about something similar, but how do you handle cases
where you might have gaps in the auto-increment sequence? For example,
if you delete record 100, and then one of the random numbers you
generate happens to be 100, you will be short 1 record because it
doesn't exist. If records never get deleted from the table then there's
no issue, but in my application it does happen, so gaps will occur. I
have looked around for an easy way to maintain a table with a key that
acts like a position marker, but it doesn't seem to be out there. In
other words, if you had a table with n records, then each record would
have a field which has a value corresponding to the record's position in
the table, from 1 to n. This position can be simply the order the
records were inserted or the order that they exist on the disk - it
doesn't really matter, since this position field would only be used for
quick lookups in random selects anyway. Then, if record 6 is removed,
record 7 would become record 6, record 8 would now be 7 and so on. I
know you can maintain this sort of thing yourself, but it takes work to
maintain consistency and it would be a nice feature to have. If this was
available then ORDER BY RAND() optimization would be easy, since you
could have the sequence field be a primary key and then just do
select where sequence in (...), and it would be very fast. This could
be done internally for ORDER BY RAND(), or you could do the select
yourself, using a better random number generator if you so wish.

Thanks for the suggestions,

-Neil

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: ORDER BY RAND() performance

2004-03-08 Thread Donny Simonton
Neil,
We never delete from primary tables.  No questions asked!  We would just
mark a entry as deleted, and not select from it.

Another option you can do to solve your deletion problem is, select 35 rows
for example, when you really only want 30.  That way, you can have extras,
if say #20 is not available.

There are many options, we have even in some cases, created a table and run
the order by rand query every 5 minutes and just have it update a table.
And then we just do a select from that secondary table.  So every 5 minutes
you have new random items.

Donny

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
 Neil Gunton
 Sent: Monday, March 08, 2004 3:11 PM
 To: Donny Simonton
 Cc: [EMAIL PROTECTED]; 'MySQL'
 Subject: Re: ORDER BY RAND() performance

 Donny Simonton wrote:
  One other option that we use
  sometimes is say you need 30 results randomized, and you have an
  auto-increment in your table.  Create 30 random numbers, then do a
 select
  with something like this:
 
  Select * from blabla where lkajsdlkjas IN (10, 43, 22, 8981, etc...)
 
  This works fairly well, but then again, I haven't benchmarked it in a
 while
  and don't really remember how well it works.  Actually, I just tried
 this on
  a table with 43 million entries and it took 0.0004 seconds.

 I was thinking about something similar, but how do you handle cases
 where you might have gaps in the auto-increment sequence? For example,
 if you delete record 100, and then one of the random numbers you
 generate happens to be 100, you will be short 1 record because it
 doesn't exist. If records never get deleted from the table then there's
 no issue, but in my application it does happen, so gaps will occur. I
 have looked around for an easy way to maintain a table with a key that
 acts like a position marker, but it doesn't seem to be out there. In
 other words, if you had a table with n records, then each record would
 have a field which has a value corresponding to the record's position in
 the table, from 1 to n. This position can be simply the order the
 records were inserted or the order that they exist on the disk - it
 doesn't really matter, since this position field would only be used for
 quick lookups in random selects anyway. Then, if record 6 is removed,
 record 7 would become record 6, record 8 would now be 7 and so on. I
 know you can maintain this sort of thing yourself, but it takes work to
 maintain consistency and it would be a nice feature to have. If this was
 available then ORDER BY RAND() optimization would be easy, since you
 could have the sequence field be a primary key and then just do
 select where sequence in (...), and it would be very fast. This could
 be done internally for ORDER BY RAND(), or you could do the select
 yourself, using a better random number generator if you so wish.

 Thanks for the suggestions,

 -Neil

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
 http://lists.mysql.com/[EMAIL PROTECTED]




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: ORDER BY RAND() performance

2004-03-08 Thread Neil Gunton
Donny Simonton wrote:
 
 Neil,
 We never delete from primary tables.  No questions asked!  We would just
 mark a entry as deleted, and not select from it.
 
 Another option you can do to solve your deletion problem is, select 35 rows
 for example, when you really only want 30.  That way, you can have extras,
 if say #20 is not available.
 
 There are many options, we have even in some cases, created a table and run
 the order by rand query every 5 minutes and just have it update a table.
 And then we just do a select from that secondary table.  So every 5 minutes
 you have new random items.

Thanks again - lots of great suggestions. I use a reverse proxy caching
front end Web server which allows me to reduce load on the back-end
MySQL/mod_perl Apache processes. Thus by setting the expiration time of
the web pages appropriately I can reduce the number of times the random
pics page is executed, which kinda/sorta does what your last suggestion
suggests, I think.

I am all in favor of simplifying application code wherever possible, and
I would still love to just use a simple query with ORDER BY RAND(), it
would make things SOOO much more straightforward. So if any of the core
MySQL developers are reading this, please take a look at the original
question and let me know if there are any plans in the works to make
this more efficient (or if it's even possible - if it's just inherently
difficult then that would be good to know).

Thanks,

-Neil

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: ORDER BY RAND() performance

2004-03-08 Thread Sasha Pachev
Donny Simonton wrote:
Neil,
We never delete from primary tables.  No questions asked!  We would just
mark a entry as deleted, and not select from it.
Another option you can do to solve your deletion problem is, select 35 rows
for example, when you really only want 30.  That way, you can have extras,
if say #20 is not available.
There are many options, we have even in some cases, created a table and run
the order by rand query every 5 minutes and just have it update a table.
And then we just do a select from that secondary table.  So every 5 minutes
you have new random items.
Another way is to guess a reasonably narrow fixed-width random range for a 
column for which you have a key, and do ORDER BY RAND() LIMIT 1 inside it. If 
you guess it too narrow, double it and try again until you get enough records.

The key range estimation technique is also useful in a number of other 
situations, eg. when paging through search results.

--
Sasha Pachev
Create online surveys at http://www.surveyz.com/
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]