On Sun, 31 Jan 2016 16:14:45 +0200
R Smith <rsmith at rsweb.co.za> wrote:

I'm replying to Igor too with this message, as both of you had a similar answer.

> 
> First understand what an Index is and how it works.
> 
> Imagine you are asked to find all the people whose surnames are 
> "Duch?ne" from the telephone directory. You would be able to do this 
> quite fast, because the phonebook is indexed first by Surname, then 
> name, then address. Perhaps a Telephone directory schema might look like 
> this:
>    CREATE TABLE PhoneBook (Surname TEXT, Name TEXT, Address TEXT, 
> PhoneNo TEXT, PRIMARY KEY (Surname, Name, Address) );
> 
> Your query might look like this:
>    SELECT * FROM PhoneBook WHERE Surname='Duch?ne';
> 
> Imagine now that you are asked to find all people named "Yannick" in the 
> phone directory, like so:
>    SELECT * FROM PhoneBook WHERE Name='Yannick';
> 
> Immediately that would go very slow because you have to look at each 
> Surname and see if there are any Yannicks in there, and the same problem 
> arise if you are asked to look for a specific address.


That makes sense, I agree, and I had this in mind for a short time, until it 
was shadowed by a bad bet I made for three reasons. If it's worth the words, 
let me tell:

I saw a page (can't retrieve the URL) suggesting to order table columns by 
names. It was strange to me, as I had the idea of a hierarchical access for 
tables access. But I though ?there must be a good reason for them to say this?. 
Then in an SQLite page [1], there was a suggestion to avoid index containing 
the same data as a wider index. So after these two things, I tried to imagine 
ways of setting up an index so that this makes sense: I though a multi?column 
key could be accessed by any column, using fragments whose content are ordered.

Precisely with the case of your example, I though the "name" column would be 
partitioned into individually sorted parts. While it was also contradicted by 
the fact adding a index on a single column of a multi?column primary key, could 
help grouping (although later again, there was another surprise contradicting 
this too).

[1]: https://www.sqlite.org/queryplanner.html
Which says
> Note that Idx3 contains all the same information as the original Idx1. And so 
> if we have Idx3, we do not really need Idx1 any more.
While reading it again, I overlooked what was next:
> your database schema should never contain two indices where one index is a 
> prefix of the other.
My bad.


> You will have a bright idea right the first time you are asked to do 
> this - you will make a list of names in alphabetical order followed by 
> surnames and keep it separate, so if ever you are asked again to find 
> someone by name, you can reference this second list to quickly see the 
> name and surname, and perhaps use that info to find them in the 
> PhoneBook and get the rest of the info. This second list is what is 
> called an Index - but it is not the PRIMARY index.
> 
> If you wish for all those searched to go fast, you need 3 Indices, not 
> simply a 3-sectiion primary Index.
> 
> Perhaps this SCHEMA would better suit your needs:
> 
>    CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>    CREATE INDEX t_1 ON t (b);
>    CREATE INDEX t_1 ON t (c);
> 
> Be careful though, every next Index will require a full datalist plus some 
> overhead worth of space, and will make INSERTs slower because it has to 
> insert more times and re-organize the B-Tree of every index a bit.

I guessed, and that's why I have no swapped index (although the a,b is a 
foreign key to another table with only a,b, which helps for some queries, but 
there is no table for b,a) and prefer an attempt to tweak the query on the 
table as?is (for now, as I will have a third refactoring).

> Best is to decide which searches you will do, make all Indices you think will 
> be needed, then try the queries (using explain query plan), see which Indices 
> are used and that the speed is good, then remove those who are not used.

That's what I did, and it shown me on a query with a single `GROUP BY`, an 
index on "b" helped and was indeed used. Then later, it shown a variant I had 
the idea to test, with two nested `GROUP BY`, is running faster (an efficiency 
not far from that of the same query on the first column) while not using this 
column index at all (which I finally removed). That's how I ended with these 
tests and the question.



May be that's the opportunity for another question I have: given a foreign key 
(a,b) where "a" and "b" are more than a few bytes (not small) or are of 
variable size (still hopefully limited), are the values for "a" and "b" 
duplicated or do the foreign key creates a kind of references? (may be with an 
hash or even a short ID for the bigger value) If it's duplicated, then I will 
use integer keys instead. A bit long ago, I questioned the habit of 
mechanically using integers PK (and also FK), feeling using the literal values 
is more readable and simplifies queries' text. If my assumption is wrong (i.e. 
this is not by reference and there are copies every where), then I will have 
views for readable consultation and will bother less about more verbose queries.


I enjoyed your answer to all three of you (will look at `ANALYSE` in the 
documentation, suggested by Dominique).

@Igor: I won't forget to have a look at the two references you gave.


-- 
Yannick Duch?ne

Reply via email to