On 2019/02/06 12:12 AM, Gerlando Falauto wrote:

The use case involves retaining as much data as the storage can possibly
hold (so a bunch of gigabytes).
I could've just used directories and logfiles instead of abusing a
relational database but I just thought it would more convenient to issue a
query and use a cursor.

There is nothing wrong with using a database to store a list of values, as another poster pointed out, [insert high made-up number]% of schemata out there are basically just that, but I like Keith's suggestion, since you did decide to DB it, why not make it nicely relational too?

the table is often more efficient than threading a lookup via another
index into the query plan. Sometimes crafting a new temp BTree Index for
(a) specific field(s) on a materialized set of data might also be judged
faster than re-establishing links between said data and its original Index.

Do you think restoring the original primary key (instead of ROWID) and
dropping the index would make any difference?

I do think it would make a difference (almost any change would), but I am not sure it would make all the difference. I would however suggest, at the very least, to test this and see.



I pre-populated the table with a realistic use case scenario and ran
ANALYZE.
I'm not planning on using ANALYZE on the real system -- though I might
indeed pre-populate sqlite_stat1 with typical values as suggested in the
docs.

This is fine. I would ask - did your "test" data include a gigabyte or more data? The amount, cardinality and shape of the data are all most important for ANALYZE to provide good information.

If you can demonstrate a true degradation //...
Yes, in the worst case, adding the ORDER BY clause (2 vs.1, 4 vs.3) leads
to a perceivable degradation in terms of both seek time (several seconds
vs. milliseconds to get the first row) and occupied disk space.

I must have missed this, apologies, that is certainly a very true degradation. Note that the entire query delivery should be taken into consideration. The first row can often be delivered near instantaneous with following rows taking progressively longer. Very often the time it takes to deliver the first row is compensated by the time that is saved later along the subsequent rows. The QP takes this into consideration.

A good example is getting a set of data, say 100 rows, sorting it first and then just spitting it out from memory. The preparation (aka first row delivery) will take time, all the rest will be instant. Contrast that with a query that needs no sorting, it might produce rows as it scans the table, the first of which might appear instantly (since it's at the top of the table and satisfies the WHERE clause), but all the next qualifying rows might take a long while to produce depending on where they fall within the table. In the end the fully traversed cursor may take similar amounts of time.

The QP cannot know before-hand how many "hits" it would encounter, so has to use a basic pre-made guide and/or help from the ANALYZE data to best guess which route is better - and you can easily construct a non-usual set of data for which it will choose wrong every time, and for which "fixing" it will negatively affect more common sets of data.


As I already said, my use case *is* quite unusual. Definitely not something
you'd normally use a relational database for.

That does not matter - if the query planner can do better, it should - unless of course changing the decision tree will negatively affect another more widely used query case. (This is the hard part to establish.)


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to