Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
On 12 Jan 2018, at 10:31am, Eric Grangewrote: > >> This should not be true. You should not be using OFFSET. Your queries >> should be something like > > That was with only the "value" field being indexed (so without a rank > field), is there a better way than OFFSET in that case ? Can’t think of one. OFFSET is the easiest way to get what you want if you’re not storing the values you’re searching/sorting on. But as you’ve seen, SQLite derives OFFSET by compiling the whole list, from the first result, and simply not reporting the first entries. So it’s slow. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
> But if you have million rows this could involve a lot of number-shuffling and I can see that it might not work out in the real world. Yes, also while value field can change several times per seconds (thousandths in some cases), it is acceptable to have the ranking be updated at a lower frequency and display somewhat "stale" ranking & values. > This should not be true. You should not be using OFFSET. Your queries should be something like That was with only the "value" field being indexed (so without a rank field), is there a better way than OFFSET in that case ? On Tue, Jan 9, 2018 at 6:21 PM, Simon Slavinwrote: > On 9 Jan 2018, at 11:35am, Eric Grange wrote: > > > In both cases, since things are constantly in flux, the absolute rank and > > neighbor do not really matter > > (outside the top ranks), but changes in rank are being looked at. > > i.e. having rank 155k or 154k is not really meaningful in itself, but on > > the other hand > > gaining 1000 "ranks" by going from 155k to 154k is what users are after. > > Okay, I see what you mean. You have a special, unusual, case where the > ranks change frequently and you have a genuine need to do things like "give > me all the ranks from 154k to 155k". > > That’s a very difficult thing to do quickly. The fastest way to do it > would be different depending on which you do more: change ranks or query > ranks. > > The cannonical SQL way would be that every time a figure changes you would > change the ranks of all the rows between the two positions. This would > take only two UPDATE commands each time. The data would be up-to-date all > the time and therefore queries would be fast. But if you have million rows > this could involve a lot of number-shuffling and I can see that it might > not work out in the real world. > > >> But then, if your range queries are based on a rank derived from value, > why > >> not index value directly? You'd still get fast range queries based on > values, no? > > > > You get fast value range queries, but rank range queries become slower > and > > slower the farther away you get from the top rank. > > This should not be true. You should not be using OFFSET. Your queries > should be something like > > SELECT * FROM ranked WHERE rank BETWEEN 154000 AND !55000 ORDER BY > rank > > which should be lightning fast because you have an index on "rank". > > Simon. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
On 9 Jan 2018, at 11:35am, Eric Grangewrote: > In both cases, since things are constantly in flux, the absolute rank and > neighbor do not really matter > (outside the top ranks), but changes in rank are being looked at. > i.e. having rank 155k or 154k is not really meaningful in itself, but on > the other hand > gaining 1000 "ranks" by going from 155k to 154k is what users are after. Okay, I see what you mean. You have a special, unusual, case where the ranks change frequently and you have a genuine need to do things like "give me all the ranks from 154k to 155k". That’s a very difficult thing to do quickly. The fastest way to do it would be different depending on which you do more: change ranks or query ranks. The cannonical SQL way would be that every time a figure changes you would change the ranks of all the rows between the two positions. This would take only two UPDATE commands each time. The data would be up-to-date all the time and therefore queries would be fast. But if you have million rows this could involve a lot of number-shuffling and I can see that it might not work out in the real world. >> But then, if your range queries are based on a rank derived from value, why >> not index value directly? You'd still get fast range queries based on >> values, no? > > You get fast value range queries, but rank range queries become slower and > slower the farther away you get from the top rank. This should not be true. You should not be using OFFSET. Your queries should be something like SELECT * FROM ranked WHERE rank BETWEEN 154000 AND !55000 ORDER BY rank which should be lightning fast because you have an index on "rank". Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
If you were going to do this entirely in memory (perhaps in C, or some similar language), you would likely use some tree structure where each node keeps track of the number of descendants (direct and indirect) of that node. That allows the operations you describe to occur in O(log(N)) time. Single-record insert/delete/update has the same time complexity. It is likely that for your tree, every node would have the same structure (struct in C), or else every internal node would have one structure, and every leaf node would have another structure. Now given a bunch of objects with the same structure, you can easily store them in a relational database, rather than in memory, and perform similar operations on them. A collection of struct instances turns into a table, and a C pointer turns into a row-id (or similar). This isn't entirely free, of course. In C we think of a pointer dereference as occurring in constant time, while in a database, a key lookup is typically log(N) time, but still, your log(N) in-memory solution becomes a log-squared(N) database solution, and that is usually fast enough. Of course you lose some of the database "convenience. You're essentially implementing trees which are close to those that already "free" in sqlite. Likewise, some simple SQL queries turn into something more complex (since you need to maintain your tree). At least you still get the ACID benefits. If I google "counting tree in sqlite" I see some hits that, perhaps, already do this kind of thing. Regards, Bill -Original Message- From: Eric Grange [mailto:zar...@gmail.com] Sent: Tuesday, January 9, 2018 3:51 To: General Discussion of SQLite DatabaseSubject: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ? Hi, I have a problem where I have a large set of (key, value) which I want to sort by value, and then store in a table having (rank, key, value) fields, so that for a given key I can quickly find the rank, or for a given rank range, I can quickly list the keys & values. Since there is no ROW_NUMBER() function, but there is an autoincrement feature, and the rank are numbered 1, 2, 3 etc. the strategy I have been using is to create ranked table like CREATE RANKED ( RANK INTEGER PRIMARY KEY AUTOINCREMENT, KEY INTEGER, VALUE FLOAT ) (+ an index for the key) and then I fill that table with something like INSERT INTO RANKED SELECT key, value FROM ...something rather complex and big... ORDER BY value desc This works well enough, but as the amount of values to be ranked increases, this feels wasteful to delete everything and then re-insert everything just to adjust the RANK column, also I am running into memory issues as the ORDER BY requires a temporary b-tree which runs into the gigabyte range in some instances. I have ways to maintain the KEY and VALUES individually and incrementally, but approaches I have tried to maintain the RANK with UPDATE queries ran much slower than deleting and recreating everything, though this could just be bad implementations from my part. Are there any other strategies I could use that could update just the RANK field and mitigate the temporary B-tree size? Eric ** This e-mail and any attachments thereto may contain confidential information and/or information protected by intellectual property rights for the exclusive attention of the intended addressees named above. If you have received this transmission in error, please immediately notify the sender by return e-mail and delete this message and its attachments. Unauthorized use, copying or further full or partial distribution of this e-mail or its contents is prohibited. ** ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
Analogous to the percentile solution (it's actually the same thing), you can use a checkpointing table. This has roughly the complexity of SQRT(n) for both read and write. I.E. say you expect to have 1M records and define order based on value then id. You then make a checkpoint table (first_rank,value,rec_id) holding every 1000th record in sorted order. So row 1 of checkpoint table coresponds to the 1000th sorted record. When you insert/delete a row, you only need to update checkpoints that come after said row. When you are searching for row 4521, you do something like: SELECT FROM table JOIN checkpoint WHERE ( (table.value=checkpoint.value AND table.id>=checkpoint.id) OR table.value>checkpoint.value ) AND checkpoint.first_rank=4500 ORDER BY table.value ASC,table.id ASC LIMIT 21,1 -- Sent from: http://sqlite.1065341.n5.nabble.com/ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
The only way to efficiently do this would be to have counting (range) index b-trees. Since you don't, you're stuck with a O(n) implementation, either on reading or writing. So your solution is as good as it gets, save maybe some implementation particularities. However, you may consider a shift in perspective: with this kind of data, statisticians use percentiles. That is, instead of querying for ranks 21,000-22,000, you query for "top 1%", "6-8%" etc, based on either value or rank; this way, you can maintain a percentile rank table as granular as you like (i.e. every 1% results in a table with 100 lines). Each line would have count, value min, value max. Such a table is much faster to update and then if you need to retrieve the actual records, you use by range (value BETWEEN min AND max) joined with the percentile table. -- Sent from: http://sqlite.1065341.n5.nabble.com/ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
On Tue, Jan 9, 2018 at 12:35 PM, Eric Grangewrote: > > But then, if your range queries are based on a rank derived from value, > why > > not index value directly? You'd still get fast range queries based on > values, no? > > You get fast value range queries, but rank range queries become slower and > slower the farther away you get from the top rank. > As I wrote, that can be (greatly IMHO) mitigated by having a partial mapping from values to ranks, across the value range, which thus allows to restrict your query to a value-range that's larger than the exact rank-range you want, but still narrow enough to be fast. And another thought is that you may be able to derive than partial mapping from the stats ANALYZE [1] extracts and stores in [2] or [3], if you don't want to do it in your app. (stats on the value column of your original table, not the ranked "derived" one). Also, I can't find the name right now, and thus no link sorry, but SQLite has a way to "mount" an index as a table, if I recall correctly (mostly for troubleshooting if I remember), not sure if that's super useful since you'd probably need access to low-level b-tree info I think estimate ranks from that I think, the stat(1|3|4) table info is a better avenue I think. --DD [1] https://www.sqlite.org/lang_analyze.html [2] https://www.sqlite.org/fileformat2.html#stat3tab [3] https://www.sqlite.org/fileformat2.html#stat4tab ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
> Do you actually have a need to find the 4512nd rank ? Yes, this is is used to display and check on "neighbors", ie. keys with similar rank. The other very common query is to show the rank for a given key. In both cases, since things are constantly in flux, the absolute rank and neighbor do not really matter (outside the top ranks), but changes in rank are being looked at. i.e. having rank 155k or 154k is not really meaningful in itself, but on the other hand gaining 1000 "ranks" by going from 155k to 154k is what users are after. > 3) In your programming language, number the rows by doing > >SELECT KEY FROM RANKED ORDER BY VALUE > >and for each row returned doing an UPDATE for the RANK value. >It’ll be slower, but it won’t take up the huge chunk of memory needed to keep that index in memory. Yes, that's what I tried, but it is indeed much slower than deleting the whole table an inserting everything all at once, because as soon as one key moves from top tier to bottom (or vice versa), you have to touch basically all records, and doing so with many updates wrecks the performance completely. > But then, if your range queries are based on a rank derived from value, why > not index value directly? You'd still get fast range queries based on values, no? You get fast value range queries, but rank range queries become slower and slower the farther away you get from the top rank. Also while most end-user queries are for the top 100 / top 1000, those results are cached and only infrequently hit the db (so even if finding out the top 1000 was slow and inefficient, it would not really matter). In practice, the queries that really hit on the db are for "random" keys far from the top 1000. Eric On Tue, Jan 9, 2018 at 11:44 AM, Dominique Deviennewrote: > On Tue, Jan 9, 2018 at 11:26 AM, Eric Grange wrote: > > > So the order by is used to control the insertion order, so that the RANK > > autoinc primary key ends up with natural rank order > > > But then, if your range queries are based on a rank derived from value, why > not index value directly? > You'd still get fast range queries based on values, no? That probably > forces you app to maintain a > mapping from "rank" to values, but you don't need to know all ranks, just > the few necessary to know > where to "page" from, no? (assuming I'm guess what you use rank for > correctly). > > SQLite then maintain that index automatically, no? I'm probably missing > something though. Sounds too simple... --DD > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
On Tue, Jan 9, 2018 at 11:26 AM, Eric Grangewrote: > So the order by is used to control the insertion order, so that the RANK > autoinc primary key ends up with natural rank order But then, if your range queries are based on a rank derived from value, why not index value directly? You'd still get fast range queries based on values, no? That probably forces you app to maintain a mapping from "rank" to values, but you don't need to know all ranks, just the few necessary to know where to "page" from, no? (assuming I'm guess what you use rank for correctly). SQLite then maintain that index automatically, no? I'm probably missing something though. Sounds too simple... --DD ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
On 9 Jan 2018, at 10:26am, Eric Grangewrote: > You mean using limit / offset instead ? > > Even with an index on the VALUE column, queries like > >select * from ranked >order by value >limit 10 offset xxx > > become very slow when xxx is great Yeah, to do OFFSET SQLite has to do the whole SELECT, it just throws away the first few entries without reporting them to you. Do you actually have a need to find the 4512nd rank ? I was expecting you to just need to need to list the rows in rank order, meaning you would never need OFFSET. Okay, if you do need to do things like find the 4512nd rank, and you want to be able to insert the values without taking the great amount of memory needed for the ORDER BY, 1) declare an index on the VALUE column (fastest to do the INSERT first then create the index, but might not matter for your data) 2) do the INSERT without ORDER BY 3) In your programming language, number the rows by doing SELECT KEY FROM RANKED ORDER BY VALUE and for each row returned doing an UPDATE for the RANK value. It’ll be slower, but it won’t take up the huge chunk of memory needed to keep that index in memory. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
Isn't this really a repeat of this thread... http://sqlite.1065341.n5.nabble.com/how-into-insert-row-into-middle-of-table-with-integer-primary-key-td98629.html The result of which was, don't try and use the table row order to sort your data. Add a column that defines your sort order and do the sorting on output, not input. I rather liked Jens solution to use a string to define the sort order. (top of second page of thread) Andy Ling -Original Message- From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On Behalf Of Eric Grange Sent: Tue 09 January 2018 10:26 To: SQLite mailing list Subject: [External] Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ? You mean using limit / offset instead ? Even with an index on the VALUE column, queries like select * from ranked order by value limit 10 offset xxx become very slow when xxx is great, while select * from ranked order by rank where rank between xxx and xxx+9 are fast regardless of the value of xxx Similarly finding the rank of a key becomes sluggish for keys that are not in the top without So the order by is used to control the insertion order, so that the RANK autoinc primary key ends up with natural rank order On Tue, Jan 9, 2018 at 10:59 AM, Simon Slavin <slav...@bigfraud.org> wrote: > On 9 Jan 2018, at 9:50am, Eric Grange <zar...@gmail.com> wrote: > > > then I fill that table with something like > > > > INSERT INTO RANKED > > SELECT key, value > > FROM ...something rather complex and big... > > ORDER BY value desc > > > > This works well enough, but as the amount of values to be ranked > increases, > > this feels wasteful to delete everything and then re-insert > > everything just to adjust the RANK column, also I am running into memory > > issues as the ORDER BY requires a temporary b-tree > > which runs into the gigabyte range in some instances. > > The ORDER BY clause serves no useful value here. Leave it out. Do your > sorting when you query the table. > > Simon. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users --- This email has been scanned for email related threats and delivered safely by Mimecast. For more information please visit http://www.mimecast.com --- ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
You mean using limit / offset instead ? Even with an index on the VALUE column, queries like select * from ranked order by value limit 10 offset xxx become very slow when xxx is great, while select * from ranked order by rank where rank between xxx and xxx+9 are fast regardless of the value of xxx Similarly finding the rank of a key becomes sluggish for keys that are not in the top without So the order by is used to control the insertion order, so that the RANK autoinc primary key ends up with natural rank order On Tue, Jan 9, 2018 at 10:59 AM, Simon Slavinwrote: > On 9 Jan 2018, at 9:50am, Eric Grange wrote: > > > then I fill that table with something like > > > > INSERT INTO RANKED > > SELECT key, value > > FROM ...something rather complex and big... > > ORDER BY value desc > > > > This works well enough, but as the amount of values to be ranked > increases, > > this feels wasteful to delete everything and then re-insert > > everything just to adjust the RANK column, also I am running into memory > > issues as the ORDER BY requires a temporary b-tree > > which runs into the gigabyte range in some instances. > > The ORDER BY clause serves no useful value here. Leave it out. Do your > sorting when you query the table. > > Simon. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?
On 9 Jan 2018, at 9:50am, Eric Grangewrote: > then I fill that table with something like > > INSERT INTO RANKED > SELECT key, value > FROM ...something rather complex and big... > ORDER BY value desc > > This works well enough, but as the amount of values to be ranked increases, > this feels wasteful to delete everything and then re-insert > everything just to adjust the RANK column, also I am running into memory > issues as the ORDER BY requires a temporary b-tree > which runs into the gigabyte range in some instances. The ORDER BY clause serves no useful value here. Leave it out. Do your sorting when you query the table. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users