ORDER BY RAND() performance
Hi all, I am using MySQL 4.0.x to run a community website which has (among other things) over 19,000 pictures. There is a page that selects 30 random thumbnails. I have noticed that the performance of ORDER BY RAND() on this table has a significant impact on performace. I have all the relevant indexes defined, and I have researched this issue on the Web. It seems that other people have also encountered a performance hit while using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN on a query using this, MySQL reports Using temporary; Using filesort, which is the worst possible result. Also, the number of rows reported is pretty much the entire set. So, presumably, the current implementation of ORDER BY RAND() means that MySQL has to traverse the entire table, regardless of other indexes. There are, of course, other ways to get around this, but they are all more complex than simply using ORDER BY RAND(). I think that selecting a random number of records from a table is something that a lot of websites would like to be able to do, and so as datasets get larger it would be nice to see this function scale well. For anyone who has a website with a large archive of data, the ability to present a random selection of this data is very useful. I would like to know if anyone knows if the MySQL team is aware of this problem, and if so whether they are planning on improving it at any point. I ask mainly because if I am told that yes, it'll be much better in version X then I can live with the couple of seconds that it takes currently, knowing that this will be better down the line. However if I am advised that this is a fundamentally hard problem for whatever reason, then I will put the effort into reworking my tables to use an alternative solution. The only real solution that I can see which is fast is to make another table which contains just the unique IDs of the pictures that are visible (there are others which are not publicly visible, and which shouldn't be included in the random query, so making a separate table with the appropriate subset makes sense for performance). This new table will have a primary key which is a numeric sequence field. Every record will have its own sequence number, going from 1 up to the number of records. Then, instead of doing one query with ORDER BY RAND() LIMIT 30, I can instead do 30 queries, each with a different random sequence (generated from Perl), which will look up the unique sequence number. Since this is a primary key, it will be very fast, so that doing 30 queries will not have a big performance impact. However this scheme requires that the sequences in the new table be kept very consistent - for example, if a picture is removed from the sequence then the sequence numbers above that record have to be updated. This introduces potential for error, but it is a possible solution. I don't want to implement it, obviously, if ORDER BY RAND() is slated for improvement. Thanks for any ideas or insights... -Neil Gunton -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: ORDER BY RAND() performance
If your infact (sounds like) storing the pictures meta-data (name, size, owner, etc) and the data (blob of some kind) .. I would definately break up the design into 2 tables. That way when dealing with the meta-data table (your RAND() query) there is much less data that needs to be traversed to get your answer, which should result in a faster query. On Mon, 8 Mar 2004, Neil Gunton wrote: Hi all, I am using MySQL 4.0.x to run a community website which has (among other things) over 19,000 pictures. There is a page that selects 30 random thumbnails. I have noticed that the performance of ORDER BY RAND() on this table has a significant impact on performace. I have all the relevant indexes defined, and I have researched this issue on the Web. It seems that other people have also encountered a performance hit while using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN on a query using this, MySQL reports Using temporary; Using filesort, which is the worst possible result. Also, the number of rows reported is pretty much the entire set. So, presumably, the current implementation of ORDER BY RAND() means that MySQL has to traverse the entire table, regardless of other indexes. There are, of course, other ways to get around this, but they are all more complex than simply using ORDER BY RAND(). I think that selecting a random number of records from a table is something that a lot of websites would like to be able to do, and so as datasets get larger it would be nice to see this function scale well. For anyone who has a website with a large archive of data, the ability to present a random selection of this data is very useful. I would like to know if anyone knows if the MySQL team is aware of this problem, and if so whether they are planning on improving it at any point. I ask mainly because if I am told that yes, it'll be much better in version X then I can live with the couple of seconds that it takes currently, knowing that this will be better down the line. However if I am advised that this is a fundamentally hard problem for whatever reason, then I will put the effort into reworking my tables to use an alternative solution. The only real solution that I can see which is fast is to make another table which contains just the unique IDs of the pictures that are visible (there are others which are not publicly visible, and which shouldn't be included in the random query, so making a separate table with the appropriate subset makes sense for performance). This new table will have a primary key which is a numeric sequence field. Every record will have its own sequence number, going from 1 up to the number of records. Then, instead of doing one query with ORDER BY RAND() LIMIT 30, I can instead do 30 queries, each with a different random sequence (generated from Perl), which will look up the unique sequence number. Since this is a primary key, it will be very fast, so that doing 30 queries will not have a big performance impact. However this scheme requires that the sequences in the new table be kept very consistent - for example, if a picture is removed from the sequence then the sequence numbers above that record have to be updated. This introduces potential for error, but it is a possible solution. I don't want to implement it, obviously, if ORDER BY RAND() is slated for improvement. Thanks for any ideas or insights... -Neil Gunton -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: ORDER BY RAND() performance
[EMAIL PROTECTED] wrote: If your infact (sounds like) storing the pictures meta-data (name, size, owner, etc) and the data (blob of some kind) .. I would definately break up the design into 2 tables. That way when dealing with the meta-data table (your RAND() query) there is much less data that needs to be traversed to get your answer, which should result in a faster query. Thanks! This is definitely good advice, but unfortunately it doesn't solve the RAND() slowness. I have been testing with a separate table that ONLY contains the id of the pics, and as it grows toward 100,000 records this simple query does get noticeably slower: SELECT * FROM visible_pics ORDER BY RAND() LIMIT 30; Where visible_pics just has two numeric ID fields (pic_id and doc_id). It doesn't seem to matter if I make pic_id a primary key or not. I think I've reduced it to pretty much the minimal case, given that I want a random selection of ALL the records. I don't know the internals of how MySQL could optimize this sort of thing, but I was thinking that perhaps there was some kind of internal trickery it could do to select random record positions and then get those very quickly, without having to traverse the entire table. I think if the table has no varchar fields then it should be easy (at least in MyISAM) to calculate the record position based on the record number. So I think it *should* in theory be possible to optimize this, but I just don't know if anyone has realized that it's an issue, or if they are planning on doing anything about it. Any insights from MySQL internal developers? Or should I be posting this to the internals list? Thanks again, -Neil -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: ORDER BY RAND() performance
ORDER BY RAND(), just sucks in my opinion. We have created our own internal randomization system because pretty much everytime you use it will show up in the slow query log, because of the using temporary, using filesort it does. Splitting your data into a hundred tables will still make it using temporary, using filesort. I just did a little test, where I only had 5 entries in a table, and I using temp using filesort. Will it ever be improved? Probably the same time order by DESC is improved. Donny -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, March 08, 2004 2:05 PM To: Neil Gunton Cc: MySQL Subject: Re: ORDER BY RAND() performance If your infact (sounds like) storing the pictures meta-data (name, size, owner, etc) and the data (blob of some kind) .. I would definately break up the design into 2 tables. That way when dealing with the meta-data table (your RAND() query) there is much less data that needs to be traversed to get your answer, which should result in a faster query. On Mon, 8 Mar 2004, Neil Gunton wrote: Hi all, I am using MySQL 4.0.x to run a community website which has (among other things) over 19,000 pictures. There is a page that selects 30 random thumbnails. I have noticed that the performance of ORDER BY RAND() on this table has a significant impact on performace. I have all the relevant indexes defined, and I have researched this issue on the Web. It seems that other people have also encountered a performance hit while using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN on a query using this, MySQL reports Using temporary; Using filesort, which is the worst possible result. Also, the number of rows reported is pretty much the entire set. So, presumably, the current implementation of ORDER BY RAND() means that MySQL has to traverse the entire table, regardless of other indexes. There are, of course, other ways to get around this, but they are all more complex than simply using ORDER BY RAND(). I think that selecting a random number of records from a table is something that a lot of websites would like to be able to do, and so as datasets get larger it would be nice to see this function scale well. For anyone who has a website with a large archive of data, the ability to present a random selection of this data is very useful. I would like to know if anyone knows if the MySQL team is aware of this problem, and if so whether they are planning on improving it at any point. I ask mainly because if I am told that yes, it'll be much better in version X then I can live with the couple of seconds that it takes currently, knowing that this will be better down the line. However if I am advised that this is a fundamentally hard problem for whatever reason, then I will put the effort into reworking my tables to use an alternative solution. The only real solution that I can see which is fast is to make another table which contains just the unique IDs of the pictures that are visible (there are others which are not publicly visible, and which shouldn't be included in the random query, so making a separate table with the appropriate subset makes sense for performance). This new table will have a primary key which is a numeric sequence field. Every record will have its own sequence number, going from 1 up to the number of records. Then, instead of doing one query with ORDER BY RAND() LIMIT 30, I can instead do 30 queries, each with a different random sequence (generated from Perl), which will look up the unique sequence number. Since this is a primary key, it will be very fast, so that doing 30 queries will not have a big performance impact. However this scheme requires that the sequences in the new table be kept very consistent - for example, if a picture is removed from the sequence then the sequence numbers above that record have to be updated. This introduces potential for error, but it is a possible solution. I don't want to implement it, obviously, if ORDER BY RAND() is slated for improvement. Thanks for any ideas or insights... -Neil Gunton -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: ORDER BY RAND() performance
Donny, what do you do? Throw all the values into an array or something on the client side, and use a random number generator to pull out the array elements? I suppose (depending on resultset size) pulling that many rows from server to client and handing on client side could be faster... On Mon, 8 Mar 2004, Donny Simonton wrote: ORDER BY RAND(), just sucks in my opinion. We have created our own internal randomization system because pretty much everytime you use it will show up in the slow query log, because of the using temporary, using filesort it does. Splitting your data into a hundred tables will still make it using temporary, using filesort. I just did a little test, where I only had 5 entries in a table, and I using temp using filesort. Will it ever be improved? Probably the same time order by DESC is improved. Donny -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, March 08, 2004 2:05 PM To: Neil Gunton Cc: MySQL Subject: Re: ORDER BY RAND() performance If your infact (sounds like) storing the pictures meta-data (name, size, owner, etc) and the data (blob of some kind) .. I would definately break up the design into 2 tables. That way when dealing with the meta-data table (your RAND() query) there is much less data that needs to be traversed to get your answer, which should result in a faster query. On Mon, 8 Mar 2004, Neil Gunton wrote: Hi all, I am using MySQL 4.0.x to run a community website which has (among other things) over 19,000 pictures. There is a page that selects 30 random thumbnails. I have noticed that the performance of ORDER BY RAND() on this table has a significant impact on performace. I have all the relevant indexes defined, and I have researched this issue on the Web. It seems that other people have also encountered a performance hit while using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN on a query using this, MySQL reports Using temporary; Using filesort, which is the worst possible result. Also, the number of rows reported is pretty much the entire set. So, presumably, the current implementation of ORDER BY RAND() means that MySQL has to traverse the entire table, regardless of other indexes. There are, of course, other ways to get around this, but they are all more complex than simply using ORDER BY RAND(). I think that selecting a random number of records from a table is something that a lot of websites would like to be able to do, and so as datasets get larger it would be nice to see this function scale well. For anyone who has a website with a large archive of data, the ability to present a random selection of this data is very useful. I would like to know if anyone knows if the MySQL team is aware of this problem, and if so whether they are planning on improving it at any point. I ask mainly because if I am told that yes, it'll be much better in version X then I can live with the couple of seconds that it takes currently, knowing that this will be better down the line. However if I am advised that this is a fundamentally hard problem for whatever reason, then I will put the effort into reworking my tables to use an alternative solution. The only real solution that I can see which is fast is to make another table which contains just the unique IDs of the pictures that are visible (there are others which are not publicly visible, and which shouldn't be included in the random query, so making a separate table with the appropriate subset makes sense for performance). This new table will have a primary key which is a numeric sequence field. Every record will have its own sequence number, going from 1 up to the number of records. Then, instead of doing one query with ORDER BY RAND() LIMIT 30, I can instead do 30 queries, each with a different random sequence (generated from Perl), which will look up the unique sequence number. Since this is a primary key, it will be very fast, so that doing 30 queries will not have a big performance impact. However this scheme requires that the sequences in the new table be kept very consistent - for example, if a picture is removed from the sequence then the sequence numbers above that record have to be updated. This introduces potential for error, but it is a possible solution. I don't want to implement it, obviously, if ORDER BY RAND() is slated for improvement. Thanks for any ideas or insights... -Neil Gunton -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives
Re: ORDER BY RAND() performance
On Monday 08 March 2004 14:14, Neil Gunton wrote: [EMAIL PROTECTED] wrote: If your infact (sounds like) storing the pictures meta-data (name, size, owner, etc) and the data (blob of some kind) .. I would definately break up the design into 2 tables. That way when dealing with the meta-data table (your RAND() query) there is much less data that needs to be traversed to get your answer, which should result in a faster query. Thanks! This is definitely good advice, but unfortunately it doesn't solve the RAND() slowness. I have been testing with a separate table that ONLY contains the id of the pics, and as it grows toward 100,000 records this simple query does get noticeably slower: SELECT * FROM visible_pics ORDER BY RAND() LIMIT 30; Where visible_pics just has two numeric ID fields (pic_id and doc_id). It doesn't seem to matter if I make pic_id a primary key or not. I think I've reduced it to pretty much the minimal case, given that I want a random selection of ALL the records. I don't know the internals of how MySQL could optimize this sort of thing, but I was thinking that perhaps there was some kind of internal trickery it could do to select random record positions and then get those very quickly, without having to traverse the entire table. I think if the table has no varchar fields then it should be easy (at least in MyISAM) to calculate the record position based on the record number. So I think it *should* in theory be possible to optimize this, but I just don't know if anyone has realized that it's an issue, or if they are planning on doing anything about it. Any insights from MySQL internal developers? Or should I be posting this to the internals list? Thanks again, -Neil an alternative to the order by rand() with large record sets is to pick a random starting point limit $randPoint, 30 don't know if its a viable solution to your situation, but it limits you to 2 querys (row count, fetch) rather then the 30 (fetch 1 x 30) -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: ORDER BY RAND() performance
Exactly, it's faster for us to pull all of the data and then randomize it locally. We have benchmarked it both ways and the local randomize was faster. Now that's if you want multiple records returned. Now if you only want one record, what we do, is create a random number, and then just do a limit 19345, 1 or something like that. We have tried another option which we stopped using which was creating 30 random numbers and then doing 30 select statements. But that was slower overall than 1 select with order by rand. One other option that we use sometimes is say you need 30 results randomized, and you have an auto-increment in your table. Create 30 random numbers, then do a select with something like this: Select * from blabla where lkajsdlkjas IN (10, 43, 22, 8981, etc...) This works fairly well, but then again, I haven't benchmarked it in a while and don't really remember how well it works. Actually, I just tried this on a table with 43 million entries and it took 0.0004 seconds. Just some ideas. Donny -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, March 08, 2004 2:36 PM To: Donny Simonton Cc: 'Neil Gunton'; 'MySQL' Subject: RE: ORDER BY RAND() performance Donny, what do you do? Throw all the values into an array or something on the client side, and use a random number generator to pull out the array elements? I suppose (depending on resultset size) pulling that many rows from server to client and handing on client side could be faster... On Mon, 8 Mar 2004, Donny Simonton wrote: ORDER BY RAND(), just sucks in my opinion. We have created our own internal randomization system because pretty much everytime you use it will show up in the slow query log, because of the using temporary, using filesort it does. Splitting your data into a hundred tables will still make it using temporary, using filesort. I just did a little test, where I only had 5 entries in a table, and I using temp using filesort. Will it ever be improved? Probably the same time order by DESC is improved. Donny -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, March 08, 2004 2:05 PM To: Neil Gunton Cc: MySQL Subject: Re: ORDER BY RAND() performance If your infact (sounds like) storing the pictures meta-data (name, size, owner, etc) and the data (blob of some kind) .. I would definately break up the design into 2 tables. That way when dealing with the meta-data table (your RAND() query) there is much less data that needs to be traversed to get your answer, which should result in a faster query. On Mon, 8 Mar 2004, Neil Gunton wrote: Hi all, I am using MySQL 4.0.x to run a community website which has (among other things) over 19,000 pictures. There is a page that selects 30 random thumbnails. I have noticed that the performance of ORDER BY RAND() on this table has a significant impact on performace. I have all the relevant indexes defined, and I have researched this issue on the Web. It seems that other people have also encountered a performance hit while using ORDER BY RAND(). The reason appears to be that when you do EXPLAIN on a query using this, MySQL reports Using temporary; Using filesort, which is the worst possible result. Also, the number of rows reported is pretty much the entire set. So, presumably, the current implementation of ORDER BY RAND() means that MySQL has to traverse the entire table, regardless of other indexes. There are, of course, other ways to get around this, but they are all more complex than simply using ORDER BY RAND(). I think that selecting a random number of records from a table is something that a lot of websites would like to be able to do, and so as datasets get larger it would be nice to see this function scale well. For anyone who has a website with a large archive of data, the ability to present a random selection of this data is very useful. I would like to know if anyone knows if the MySQL team is aware of this problem, and if so whether they are planning on improving it at any point. I ask mainly because if I am told that yes, it'll be much better in version X then I can live with the couple of seconds that it takes currently, knowing that this will be better down the line. However if I am advised that this is a fundamentally hard problem for whatever reason, then I will put the effort into reworking my tables to use an alternative solution. The only real solution that I can see which is fast is to make another table which contains just the unique IDs of the pictures that are visible (there are others which are not publicly visible, and which shouldn't be included in the random query, so making a separate table
Re: ORDER BY RAND() performance
Ray wrote: an alternative to the order by rand() with large record sets is to pick a random starting point limit $randPoint, 30 don't know if its a viable solution to your situation, but it limits you to 2 querys (row count, fetch) rather then the 30 (fetch 1 x 30) Thanks! I did see this suggested on another forum. However when I tried it, I found that EXPLAIN wasn't very encouraging. Using this minimal table: CREATE TABLE visible_pics ( pic_id int(10) unsigned NOT NULL default '0', doc_id int(10) unsigned NOT NULL default '0', PRIMARY KEY (pic_id), KEY doc_id (doc_id) ) TYPE=MyISAM; mysql explain select * from visible_pics limit 1,1; +--+--+---+--+-+--+---+---+ | table| type | possible_keys | key | key_len | ref | rows | Extra | +--+--+---+--+-+--+---+---+ | visible_pics | ALL | NULL | NULL |NULL | NULL | 19633 | | +--+--+---+--+-+--+---+---+ 1 row in set (0.00 sec) mysql explain select * from visible_pics order by pic_id limit 1,1; +--+---+---+-+-+--+---+---+ | table| type | possible_keys | key | key_len | ref | rows | Extra | +--+---+---+-+-+--+---+---+ | visible_pics | index | NULL | PRIMARY | 4 | NULL | 19633 | | +--+---+---+-+-+--+---+---+ 1 row in set (0.00 sec) In both cases, the number of rows which will be scanned is close to the total number of rows. I included the second EXPLAIN to see if using pic_id (the primary key) would make any difference. It actually seems to actually be faster without using the index, in my trivial tests: mysql select * from visible_pics order by pic_id limit 1,1; +++ | pic_id | doc_id | +++ | 11669 |258 | +++ 1 row in set (0.09 sec) mysql select * from visible_pics order by pic_id limit 10100,1; +++ | pic_id | doc_id | +++ | 11771 |258 | +++ 1 row in set (0.08 sec) mysql select * from visible_pics limit 10100,1; +++ | pic_id | doc_id | +++ | 11750 |258 | +++ 1 row in set (0.02 sec) mysql select * from visible_pics limit 12100,1; +++ | pic_id | doc_id | +++ | 14085 |269 | +++ 1 row in set (0.02 sec) mysql select * from visible_pics limit 900,1; +++ | pic_id | doc_id | +++ | 1100 | 53 | +++ 1 row in set (0.01 sec) mysql select * from visible_pics limit 18000,1; +++ | pic_id | doc_id | +++ | 20343 |387 | +++ 1 row in set (0.03 sec) mysql mysql select * from visible_pics order by pic_id limit 12000,1; +++ | pic_id | doc_id | +++ | 13857 |325 | +++ 1 row in set (0.10 sec) The last one was just to confirm that there wasn't some kind of disk caching going on that affected the results. The query without using the index was definitely faster. If the average query is about 0.05 second, and you do 30 of them, then that would give about 1.5 seconds for the whole thing. This is in fact worse than just doing the ORDER BY RAND() LIMIT 30 on the same table: mysql select * from visible_pics order by rand() limit 30; +++ | pic_id | doc_id | +++ | 4149 | 98 | | 5030 |148 | | 1911 | 69 | | 4258 |105 | | 14131 |170 | | 17047 |165 | | 12643 |319 | | 14271 |180 | | 1815 | 69 | | 12768 |260 | | 8118 |164 | | 2339 | 87 | | 3058 | 63 | | 2573 | 46 | | 11511 |230 | | 16939 |335 | | 7749 |113 | | 6921 |164 | | 2106 | 79 | | 3609 | 91 | | 12513 |259 | | 18169 |234 | | 19173 |372 | | 11912 |305 | | 2026 | 69 | | 7697 |222 | | 20834 |447 | |977 | 53 | | 1638 | 24 | | 13986 |308 | +++ 30 rows in set (0.22 sec) This isn't as simple as it appears at first, however - this is merely the query to get 30 random pic_id's. I then have to do 30 more queries to get the real records in the separate table, so that I can build the HTML page with filenames, captions etc. Thanks again - this is a good one to know about (for anyone else out there who is encountering the same issues). But the above tests were on a very minimal table, with no where clause, because the table was specially prepared to only contain the relevant records in the first place. I am still wondering if I should post my original question about whether ORDER BY RAND() will be optimized anytime soon to the internals list... I don't want to
Re: ORDER BY RAND() performance
Donny Simonton wrote: One other option that we use sometimes is say you need 30 results randomized, and you have an auto-increment in your table. Create 30 random numbers, then do a select with something like this: Select * from blabla where lkajsdlkjas IN (10, 43, 22, 8981, etc...) This works fairly well, but then again, I haven't benchmarked it in a while and don't really remember how well it works. Actually, I just tried this on a table with 43 million entries and it took 0.0004 seconds. I was thinking about something similar, but how do you handle cases where you might have gaps in the auto-increment sequence? For example, if you delete record 100, and then one of the random numbers you generate happens to be 100, you will be short 1 record because it doesn't exist. If records never get deleted from the table then there's no issue, but in my application it does happen, so gaps will occur. I have looked around for an easy way to maintain a table with a key that acts like a position marker, but it doesn't seem to be out there. In other words, if you had a table with n records, then each record would have a field which has a value corresponding to the record's position in the table, from 1 to n. This position can be simply the order the records were inserted or the order that they exist on the disk - it doesn't really matter, since this position field would only be used for quick lookups in random selects anyway. Then, if record 6 is removed, record 7 would become record 6, record 8 would now be 7 and so on. I know you can maintain this sort of thing yourself, but it takes work to maintain consistency and it would be a nice feature to have. If this was available then ORDER BY RAND() optimization would be easy, since you could have the sequence field be a primary key and then just do select where sequence in (...), and it would be very fast. This could be done internally for ORDER BY RAND(), or you could do the select yourself, using a better random number generator if you so wish. Thanks for the suggestions, -Neil -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: ORDER BY RAND() performance
Neil, We never delete from primary tables. No questions asked! We would just mark a entry as deleted, and not select from it. Another option you can do to solve your deletion problem is, select 35 rows for example, when you really only want 30. That way, you can have extras, if say #20 is not available. There are many options, we have even in some cases, created a table and run the order by rand query every 5 minutes and just have it update a table. And then we just do a select from that secondary table. So every 5 minutes you have new random items. Donny -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Neil Gunton Sent: Monday, March 08, 2004 3:11 PM To: Donny Simonton Cc: [EMAIL PROTECTED]; 'MySQL' Subject: Re: ORDER BY RAND() performance Donny Simonton wrote: One other option that we use sometimes is say you need 30 results randomized, and you have an auto-increment in your table. Create 30 random numbers, then do a select with something like this: Select * from blabla where lkajsdlkjas IN (10, 43, 22, 8981, etc...) This works fairly well, but then again, I haven't benchmarked it in a while and don't really remember how well it works. Actually, I just tried this on a table with 43 million entries and it took 0.0004 seconds. I was thinking about something similar, but how do you handle cases where you might have gaps in the auto-increment sequence? For example, if you delete record 100, and then one of the random numbers you generate happens to be 100, you will be short 1 record because it doesn't exist. If records never get deleted from the table then there's no issue, but in my application it does happen, so gaps will occur. I have looked around for an easy way to maintain a table with a key that acts like a position marker, but it doesn't seem to be out there. In other words, if you had a table with n records, then each record would have a field which has a value corresponding to the record's position in the table, from 1 to n. This position can be simply the order the records were inserted or the order that they exist on the disk - it doesn't really matter, since this position field would only be used for quick lookups in random selects anyway. Then, if record 6 is removed, record 7 would become record 6, record 8 would now be 7 and so on. I know you can maintain this sort of thing yourself, but it takes work to maintain consistency and it would be a nice feature to have. If this was available then ORDER BY RAND() optimization would be easy, since you could have the sequence field be a primary key and then just do select where sequence in (...), and it would be very fast. This could be done internally for ORDER BY RAND(), or you could do the select yourself, using a better random number generator if you so wish. Thanks for the suggestions, -Neil -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: ORDER BY RAND() performance
Donny Simonton wrote: Neil, We never delete from primary tables. No questions asked! We would just mark a entry as deleted, and not select from it. Another option you can do to solve your deletion problem is, select 35 rows for example, when you really only want 30. That way, you can have extras, if say #20 is not available. There are many options, we have even in some cases, created a table and run the order by rand query every 5 minutes and just have it update a table. And then we just do a select from that secondary table. So every 5 minutes you have new random items. Thanks again - lots of great suggestions. I use a reverse proxy caching front end Web server which allows me to reduce load on the back-end MySQL/mod_perl Apache processes. Thus by setting the expiration time of the web pages appropriately I can reduce the number of times the random pics page is executed, which kinda/sorta does what your last suggestion suggests, I think. I am all in favor of simplifying application code wherever possible, and I would still love to just use a simple query with ORDER BY RAND(), it would make things SOOO much more straightforward. So if any of the core MySQL developers are reading this, please take a look at the original question and let me know if there are any plans in the works to make this more efficient (or if it's even possible - if it's just inherently difficult then that would be good to know). Thanks, -Neil -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: ORDER BY RAND() performance
Donny Simonton wrote: Neil, We never delete from primary tables. No questions asked! We would just mark a entry as deleted, and not select from it. Another option you can do to solve your deletion problem is, select 35 rows for example, when you really only want 30. That way, you can have extras, if say #20 is not available. There are many options, we have even in some cases, created a table and run the order by rand query every 5 minutes and just have it update a table. And then we just do a select from that secondary table. So every 5 minutes you have new random items. Another way is to guess a reasonably narrow fixed-width random range for a column for which you have a key, and do ORDER BY RAND() LIMIT 1 inside it. If you guess it too narrow, double it and try again until you get enough records. The key range estimation technique is also useful in a number of other situations, eg. when paging through search results. -- Sasha Pachev Create online surveys at http://www.surveyz.com/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]