On Sun, 31 Jan 2016 16:14:45 +0200 R Smith <rsmith at rsweb.co.za> wrote:
I'm replying to Igor too with this message, as both of you had a similar answer. > > First understand what an Index is and how it works. > > Imagine you are asked to find all the people whose surnames are > "Duch?ne" from the telephone directory. You would be able to do this > quite fast, because the phonebook is indexed first by Surname, then > name, then address. Perhaps a Telephone directory schema might look like > this: > CREATE TABLE PhoneBook (Surname TEXT, Name TEXT, Address TEXT, > PhoneNo TEXT, PRIMARY KEY (Surname, Name, Address) ); > > Your query might look like this: > SELECT * FROM PhoneBook WHERE Surname='Duch?ne'; > > Imagine now that you are asked to find all people named "Yannick" in the > phone directory, like so: > SELECT * FROM PhoneBook WHERE Name='Yannick'; > > Immediately that would go very slow because you have to look at each > Surname and see if there are any Yannicks in there, and the same problem > arise if you are asked to look for a specific address. That makes sense, I agree, and I had this in mind for a short time, until it was shadowed by a bad bet I made for three reasons. If it's worth the words, let me tell: I saw a page (can't retrieve the URL) suggesting to order table columns by names. It was strange to me, as I had the idea of a hierarchical access for tables access. But I though ?there must be a good reason for them to say this?. Then in an SQLite page [1], there was a suggestion to avoid index containing the same data as a wider index. So after these two things, I tried to imagine ways of setting up an index so that this makes sense: I though a multi?column key could be accessed by any column, using fragments whose content are ordered. Precisely with the case of your example, I though the "name" column would be partitioned into individually sorted parts. While it was also contradicted by the fact adding a index on a single column of a multi?column primary key, could help grouping (although later again, there was another surprise contradicting this too). [1]: https://www.sqlite.org/queryplanner.html Which says > Note that Idx3 contains all the same information as the original Idx1. And so > if we have Idx3, we do not really need Idx1 any more. While reading it again, I overlooked what was next: > your database schema should never contain two indices where one index is a > prefix of the other. My bad. > You will have a bright idea right the first time you are asked to do > this - you will make a list of names in alphabetical order followed by > surnames and keep it separate, so if ever you are asked again to find > someone by name, you can reference this second list to quickly see the > name and surname, and perhaps use that info to find them in the > PhoneBook and get the rest of the info. This second list is what is > called an Index - but it is not the PRIMARY index. > > If you wish for all those searched to go fast, you need 3 Indices, not > simply a 3-sectiion primary Index. > > Perhaps this SCHEMA would better suit your needs: > > CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c)) > CREATE INDEX t_1 ON t (b); > CREATE INDEX t_1 ON t (c); > > Be careful though, every next Index will require a full datalist plus some > overhead worth of space, and will make INSERTs slower because it has to > insert more times and re-organize the B-Tree of every index a bit. I guessed, and that's why I have no swapped index (although the a,b is a foreign key to another table with only a,b, which helps for some queries, but there is no table for b,a) and prefer an attempt to tweak the query on the table as?is (for now, as I will have a third refactoring). > Best is to decide which searches you will do, make all Indices you think will > be needed, then try the queries (using explain query plan), see which Indices > are used and that the speed is good, then remove those who are not used. That's what I did, and it shown me on a query with a single `GROUP BY`, an index on "b" helped and was indeed used. Then later, it shown a variant I had the idea to test, with two nested `GROUP BY`, is running faster (an efficiency not far from that of the same query on the first column) while not using this column index at all (which I finally removed). That's how I ended with these tests and the question. May be that's the opportunity for another question I have: given a foreign key (a,b) where "a" and "b" are more than a few bytes (not small) or are of variable size (still hopefully limited), are the values for "a" and "b" duplicated or do the foreign key creates a kind of references? (may be with an hash or even a short ID for the bigger value) If it's duplicated, then I will use integer keys instead. A bit long ago, I questioned the habit of mechanically using integers PK (and also FK), feeling using the literal values is more readable and simplifies queries' text. If my assumption is wrong (i.e. this is not by reference and there are copies every where), then I will have views for readable consultation and will bother less about more verbose queries. I enjoyed your answer to all three of you (will look at `ANALYSE` in the documentation, suggested by Dominique). @Igor: I won't forget to have a look at the two references you gave. -- Yannick Duch?ne