Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-12 Thread Simon Slavin
On 12 Jan 2018, at 10:31am, Eric Grange  wrote:
> 
>> This should not be true.  You should not be using OFFSET.  Your queries
>> should be something like
> 
> That was with only the "value" field being indexed (so without a rank
> field), is there a better way than OFFSET in that case ?

Can’t think of one.  OFFSET is the easiest way to get what you want if you’re 
not storing the values you’re searching/sorting on.  But as you’ve seen, SQLite 
derives OFFSET by compiling the whole list, from the first result, and simply 
not reporting the first entries.  So it’s slow.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-12 Thread Eric Grange
> But if you have million rows this could involve a lot of number-shuffling
and I can see that it might not work out in the real world.

Yes, also while value field can change several times per seconds
(thousandths in some cases), it is acceptable to have the ranking be
updated at a lower frequency and display somewhat "stale" ranking & values.

> This should not be true.  You should not be using OFFSET.  Your queries
should be something like

That was with only the "value" field being indexed (so without a rank
field), is there a better way than OFFSET in that case ?


On Tue, Jan 9, 2018 at 6:21 PM, Simon Slavin  wrote:

> On 9 Jan 2018, at 11:35am, Eric Grange  wrote:
>
> > In both cases, since things are constantly in flux, the absolute rank and
> > neighbor do not really matter
> > (outside the top ranks), but changes in rank are being looked at.
> > i.e. having rank 155k or 154k is not really meaningful in itself, but on
> > the other hand
> > gaining 1000 "ranks" by going from 155k to 154k is what users are after.
>
> Okay, I see what you mean.  You have a special, unusual, case where the
> ranks change frequently and you have a genuine need to do things like "give
> me all the ranks from 154k to 155k".
>
> That’s a very difficult thing to do quickly.  The fastest way to do it
> would be different depending on which you do more: change ranks or query
> ranks.
>
> The cannonical SQL way would be that every time a figure changes you would
> change the ranks of all the rows between the two positions.  This would
> take only two UPDATE commands each time.  The data would be up-to-date all
> the time and therefore queries would be fast.  But if you have million rows
> this could involve a lot of number-shuffling and I can see that it might
> not work out in the real world.
>
> >> But then, if your range queries are based on a rank derived from value,
> why
> >> not index value directly? You'd still get fast range queries based on
> values, no?
> >
> > You get fast value range queries, but rank range queries become slower
> and
> > slower the farther away you get from the top rank.
>
> This should not be true.  You should not be using OFFSET.  Your queries
> should be something like
>
> SELECT * FROM ranked WHERE rank BETWEEN 154000 AND !55000 ORDER BY
> rank
>
> which should be lightning fast because you have an index on "rank".
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Simon Slavin
On 9 Jan 2018, at 11:35am, Eric Grange  wrote:

> In both cases, since things are constantly in flux, the absolute rank and
> neighbor do not really matter
> (outside the top ranks), but changes in rank are being looked at.
> i.e. having rank 155k or 154k is not really meaningful in itself, but on
> the other hand
> gaining 1000 "ranks" by going from 155k to 154k is what users are after.

Okay, I see what you mean.  You have a special, unusual, case where the ranks 
change frequently and you have a genuine need to do things like "give me all 
the ranks from 154k to 155k".

That’s a very difficult thing to do quickly.  The fastest way to do it would be 
different depending on which you do more: change ranks or query ranks.

The cannonical SQL way would be that every time a figure changes you would 
change the ranks of all the rows between the two positions.  This would take 
only two UPDATE commands each time.  The data would be up-to-date all the time 
and therefore queries would be fast.  But if you have million rows this could 
involve a lot of number-shuffling and I can see that it might not work out in 
the real world.

>> But then, if your range queries are based on a rank derived from value, why
>> not index value directly? You'd still get fast range queries based on 
>> values, no?
> 
> You get fast value range queries, but rank range queries become slower and
> slower the farther away you get from the top rank.

This should not be true.  You should not be using OFFSET.  Your queries should 
be something like

SELECT * FROM ranked WHERE rank BETWEEN 154000 AND !55000 ORDER BY rank

which should be lightning fast because you have an index on "rank".

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Wade, William
If you were going to do this entirely in memory (perhaps in C, or some similar 
language), you would likely use some tree structure where each node keeps track 
of the number of descendants (direct and indirect) of that node. That allows 
the operations you describe to occur in O(log(N)) time. Single-record 
insert/delete/update has the same time complexity.

It is likely that for your tree, every node would have the same structure 
(struct in C), or else every internal node would have one structure, and every 
leaf node would have another structure.

Now given a bunch of objects with the same structure, you can easily store them 
in a relational database, rather than in memory, and perform similar operations 
on them. A collection of struct instances turns into a table, and a C pointer 
turns into a row-id (or similar). This isn't entirely free, of course. In C we 
think of a pointer dereference as occurring in constant time, while in a 
database, a key lookup is typically log(N) time, but still, your log(N) 
in-memory solution becomes a log-squared(N) database solution, and that is 
usually fast enough.

Of course you lose some of the database "convenience. You're essentially 
implementing trees which are close to those that already "free" in sqlite. 
Likewise, some simple SQL queries turn into something more complex (since you 
need to maintain your tree). At least you still get the ACID benefits.

If I google "counting tree in sqlite" I see some hits that, perhaps, already do 
this kind of thing.

Regards,
Bill

-Original Message-
From: Eric Grange [mailto:zar...@gmail.com]
Sent: Tuesday, January 9, 2018 3:51
To: General Discussion of SQLite Database 
Subject: [sqlite] Efficient ways to maintaining order rank / row_number() in a 
rather large set ?

Hi,

I have a problem where I have a large set of (key, value) which I want to sort 
by value, and then store in a table having (rank, key, value) fields, so that 
for a given key I can quickly find the rank, or for a given rank range, I can 
quickly list the keys & values.

Since there is no ROW_NUMBER() function, but there is an autoincrement feature, 
and the rank are numbered 1, 2, 3 etc. the strategy I have been using is to 
create ranked table like

   CREATE RANKED (
  RANK INTEGER PRIMARY KEY AUTOINCREMENT,
  KEY INTEGER,
  VALUE FLOAT
   )

(+ an index for the key)

and then I fill that table with something like

   INSERT INTO RANKED
  SELECT key, value
  FROM ...something rather complex and big...
  ORDER BY value desc

This works well enough, but as the amount of values to be ranked increases, 
this feels wasteful to delete everything and then re-insert everything just to 
adjust the RANK column, also I am running into memory issues as the ORDER BY 
requires a temporary b-tree which runs into the gigabyte range in some 
instances.

I have ways to maintain the KEY and VALUES individually and incrementally, but 
approaches I have tried to maintain the RANK with UPDATE queries ran much 
slower than deleting and recreating everything, though this could just be bad 
implementations from my part.

Are there any other strategies I could use that could update just the RANK 
field and mitigate the temporary B-tree size?

Eric


**
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
**
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Dinu
Analogous to the percentile solution (it's actually the same thing), you can
use a checkpointing table. This has roughly the complexity of SQRT(n) for
both read and write.

I.E. say you expect to have 1M records and define order based on value then
id.

You then make a checkpoint table (first_rank,value,rec_id) holding every
1000th record in sorted order. So row 1 of checkpoint table coresponds to
the 1000th sorted record.

When you insert/delete a row, you only need to update checkpoints that come
after said row.

When you are searching for row 4521, you do something like:

SELECT 
FROM 
table 
JOIN
checkpoint
WHERE 
(
(table.value=checkpoint.value AND table.id>=checkpoint.id) OR 
table.value>checkpoint.value
) AND
checkpoint.first_rank=4500
ORDER BY
table.value ASC,table.id ASC
LIMIT 21,1



--
Sent from: http://sqlite.1065341.n5.nabble.com/
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Dinu
The only way to efficiently do this would be to have counting (range) index
b-trees. Since you don't, you're stuck with a O(n) implementation, either on
reading or writing. So your solution is as good as it gets, save maybe some
implementation particularities.

However, you may consider a shift in perspective: with this kind of data,
statisticians use percentiles.
That is, instead of querying for ranks 21,000-22,000, you query for "top
1%", "6-8%" etc, based on either value or rank; this way, you can maintain a
percentile rank table as granular as you like (i.e. every 1% results in a
table with 100 lines). Each line would have count, value min, value max.
Such a table is much faster to update and then if you need to retrieve the
actual records, you use by range (value BETWEEN min AND max) joined with the
percentile table.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Dominique Devienne
On Tue, Jan 9, 2018 at 12:35 PM, Eric Grange  wrote:

> > But then, if your range queries are based on a rank derived from value,
> why
> > not index value directly? You'd still get fast range queries based on
> values, no?
>
> You get fast value range queries, but rank range queries become slower and
> slower the farther away you get from the top rank.
>

As I wrote, that can be (greatly IMHO) mitigated by having a partial
mapping from values to ranks,
across the value range, which thus allows to restrict your query to a
value-range that's larger than
the exact rank-range you want, but still narrow enough to be fast.

And another thought is that you may be able to derive than partial mapping
from the stats ANALYZE [1]
extracts and stores in [2] or [3], if you don't want to do it in your app.
(stats on the value column of your
original table, not the ranked "derived" one).

Also, I can't find the name right now, and thus no link sorry, but SQLite
has a way to "mount" an index
as a table, if I recall correctly (mostly for troubleshooting if I
remember), not sure if that's super useful
since you'd probably need access to low-level b-tree info I think estimate
ranks from that I think, the
stat(1|3|4) table info is a better avenue I think. --DD

[1] https://www.sqlite.org/lang_analyze.html
[2] https://www.sqlite.org/fileformat2.html#stat3tab
[3] https://www.sqlite.org/fileformat2.html#stat4tab
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Eric Grange
> Do you actually have a need to find the 4512nd rank ?

Yes, this is is used to display and check on "neighbors", ie. keys with
similar rank.
The other very common query is to show the rank for a given key.

In both cases, since things are constantly in flux, the absolute rank and
neighbor do not really matter
(outside the top ranks), but changes in rank are being looked at.
i.e. having rank 155k or 154k is not really meaningful in itself, but on
the other hand
gaining 1000 "ranks" by going from 155k to 154k is what users are after.


> 3) In your programming language, number the rows by doing
>
>SELECT KEY FROM RANKED ORDER BY VALUE
>
>and for each row returned doing an UPDATE for the RANK value.
>It’ll be slower, but it won’t take up the huge chunk of memory needed to
keep that index in memory.

Yes, that's what I tried, but it is indeed much slower than deleting the
whole table an inserting everything all at once,
because as soon as one key moves from top tier to bottom (or vice versa),
you have to touch basically all records,
and doing so with many updates wrecks the performance completely.

> But then, if your range queries are based on a rank derived from value,
why
> not index value directly? You'd still get fast range queries based on
values, no?

You get fast value range queries, but rank range queries become slower and
slower the farther away you get from the top rank.

Also while most end-user queries are for the top 100 / top 1000, those
results are cached and only infrequently
hit the db (so even if finding out the top 1000 was slow and inefficient,
it would not really matter).
In practice, the queries that really hit on the db are for "random" keys
far from the top 1000.

Eric


On Tue, Jan 9, 2018 at 11:44 AM, Dominique Devienne 
wrote:

> On Tue, Jan 9, 2018 at 11:26 AM, Eric Grange  wrote:
>
> > So the order by is used to control the insertion order, so that the RANK
> > autoinc primary key ends up with natural rank order
>
>
> But then, if your range queries are based on a rank derived from value, why
> not index value directly?
> You'd still get fast range queries based on values, no? That probably
> forces you app to maintain a
> mapping from "rank" to values, but you don't need to know all ranks, just
> the few necessary to know
> where to "page" from, no? (assuming I'm guess what you use rank for
> correctly).
>
> SQLite then maintain that index automatically, no? I'm probably missing
> something though. Sounds too simple... --DD
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Dominique Devienne
On Tue, Jan 9, 2018 at 11:26 AM, Eric Grange  wrote:

> So the order by is used to control the insertion order, so that the RANK
> autoinc primary key ends up with natural rank order


But then, if your range queries are based on a rank derived from value, why
not index value directly?
You'd still get fast range queries based on values, no? That probably
forces you app to maintain a
mapping from "rank" to values, but you don't need to know all ranks, just
the few necessary to know
where to "page" from, no? (assuming I'm guess what you use rank for
correctly).

SQLite then maintain that index automatically, no? I'm probably missing
something though. Sounds too simple... --DD
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Simon Slavin


On 9 Jan 2018, at 10:26am, Eric Grange  wrote:

> You mean using limit / offset instead ?
> 
> Even with an index on the VALUE column, queries like
> 
>select * from ranked
>order by value
>limit 10 offset xxx
> 
> become very slow when xxx is great

Yeah, to do OFFSET SQLite has to do the whole SELECT, it just throws away the 
first few entries without reporting them to you.

Do you actually have a need to find the 4512nd rank ?  I was expecting you to 
just need to need to list the rows in rank order, meaning you would never need 
OFFSET.

Okay, if you do need to do things like find the 4512nd rank, and you want to be 
able to insert the values without taking the great amount of memory needed for 
the ORDER BY,

1) declare an index on the VALUE column (fastest to do the INSERT first then 
create the index, but might not matter for your data)
2) do the INSERT without ORDER BY
3) In your programming language, number the rows by doing

SELECT KEY FROM RANKED ORDER BY VALUE

and for each row returned doing an UPDATE for the RANK value.

It’ll be slower, but it won’t take up the huge chunk of memory needed to keep 
that index in memory.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Andy Ling
Isn't this really a repeat of this thread...

http://sqlite.1065341.n5.nabble.com/how-into-insert-row-into-middle-of-table-with-integer-primary-key-td98629.html

The result of which was, don't try and use the table row order to sort your 
data. Add a column that defines your sort order
and do the sorting on output, not input.

I rather liked Jens solution to use a string to define the sort order. (top of 
second page of thread)

Andy Ling


-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Eric Grange
Sent: Tue 09 January 2018 10:26
To: SQLite mailing list
Subject: [External] Re: [sqlite] Efficient ways to maintaining order rank / 
row_number() in a rather large set ?

You mean using limit / offset instead ?

Even with an index on the VALUE column, queries like

select * from ranked
order by value
limit 10 offset xxx

become very slow when xxx is great, while

select * from ranked
order by rank
where rank between xxx and xxx+9

are fast regardless of the value of xxx

Similarly finding the rank of a key becomes sluggish for keys that are not
in the top without

So the order by is used to control the insertion order, so that the RANK
autoinc primary key ends up with natural rank order



On Tue, Jan 9, 2018 at 10:59 AM, Simon Slavin <slav...@bigfraud.org> wrote:

> On 9 Jan 2018, at 9:50am, Eric Grange <zar...@gmail.com> wrote:
>
> > then I fill that table with something like
> >
> >   INSERT INTO RANKED
> >  SELECT key, value
> >  FROM ...something rather complex and big...
> >  ORDER BY value desc
> >
> > This works well enough, but as the amount of values to be ranked
> increases,
> > this feels wasteful to delete everything and then re-insert
> > everything just to adjust the RANK column, also I am running into memory
> > issues as the ORDER BY requires a temporary b-tree
> > which runs into the gigabyte range in some instances.
>
> The ORDER BY clause serves no useful value here.  Leave it out.  Do your
> sorting when you query the table.
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
---
This email has been scanned for email related threats and delivered safely by 
Mimecast.
For more information please visit http://www.mimecast.com
---

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Eric Grange
You mean using limit / offset instead ?

Even with an index on the VALUE column, queries like

select * from ranked
order by value
limit 10 offset xxx

become very slow when xxx is great, while

select * from ranked
order by rank
where rank between xxx and xxx+9

are fast regardless of the value of xxx

Similarly finding the rank of a key becomes sluggish for keys that are not
in the top without

So the order by is used to control the insertion order, so that the RANK
autoinc primary key ends up with natural rank order



On Tue, Jan 9, 2018 at 10:59 AM, Simon Slavin  wrote:

> On 9 Jan 2018, at 9:50am, Eric Grange  wrote:
>
> > then I fill that table with something like
> >
> >   INSERT INTO RANKED
> >  SELECT key, value
> >  FROM ...something rather complex and big...
> >  ORDER BY value desc
> >
> > This works well enough, but as the amount of values to be ranked
> increases,
> > this feels wasteful to delete everything and then re-insert
> > everything just to adjust the RANK column, also I am running into memory
> > issues as the ORDER BY requires a temporary b-tree
> > which runs into the gigabyte range in some instances.
>
> The ORDER BY clause serves no useful value here.  Leave it out.  Do your
> sorting when you query the table.
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

2018-01-09 Thread Simon Slavin
On 9 Jan 2018, at 9:50am, Eric Grange  wrote:

> then I fill that table with something like
> 
>   INSERT INTO RANKED
>  SELECT key, value
>  FROM ...something rather complex and big...
>  ORDER BY value desc
> 
> This works well enough, but as the amount of values to be ranked increases,
> this feels wasteful to delete everything and then re-insert
> everything just to adjust the RANK column, also I am running into memory
> issues as the ORDER BY requires a temporary b-tree
> which runs into the gigabyte range in some instances.

The ORDER BY clause serves no useful value here.  Leave it out.  Do your 
sorting when you query the table.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users