On Sun, 31 Jan 2016 17:39:28 +0100
Yannick Duch?ne <yannick_duchene at yahoo.fr> wrote:

> I saw a page (can't retrieve the URL) suggesting to order table
> columns by names. It was strange to me, as I had the idea of a
> hierarchical access for tables access. But I though ?there must be a
> good reason for them to say this?. 

On average, the quality of advice on the Internet is average, for SQL
doubly so.  Some of it is truely terrible.  

Because the advice on this list is peer-reviewed (in the sense that
people who wrote the code participate), it tends to be very good.  

My advice, ahem, is to choose chemistry over alchemy.  If you don't
understand why the advice is well founded, keep checking until you do.
If sometimes good foundations are a little mysterious, the contrary is
not true: unfounded assumptions flush out early.  

I want to answer your question a little bit abstractly, and then circle
back to SQLite.  

We know how the table is physically ordered.  But there's no WHERE
clause; the whole table will be scanned.  Building an output table for
the aggregates will be required regardless.  The only difference would
be if the cardinality of a, b, and c were differerent.  That is, if
GROUP BY A produces many fewer rows than GROUP BY B, we would expect it
to run faster. Otherwise it's an artifact of the implementation, not
something inherent in the problem.  

Yet your counts indicate the opposite: GROUP BY B is the smallest
output.  How to explain?  Or, um, EXPLAIN?  

Consider that GROUP BY A will cause each A output row to be summed
using the input (from t) sequentially.  As the input is consumed, we
move to the next output row whenever A changes.  There's no seeking
from one output row to another.  

For GROUP BY B, each input row, read sequentially, means a seek
(logically speaking) to the appropriate output row.  If the cost of
that seek is not O(1), it will add to the time used to create the
output.  Because SQLite tables are based on B+ trees, that seek cost
is O(log2 n).  

I'd say that's the source of the difference you're seeing. EXPLAIN
shows there's an optimization for GROUP BY A, probably because the
output can be constructed sequentially.  

And that's a defensible cboice, because aggregation-seeking isn't
usually a high cost (as in fact it's not in your case, either).

Aggregations by definition are reductions. Outputs are usually small,
because otherwise the GROUP BY produces incomprehensible results.
10,000 rows of aggregated output isn't unheard of, but it's unusually
many, and to find one row in 10,000 in a binary tree requires at most 15
comparisons.  More commonly outputs are in the hundreds of rows, and
need half that many.  It could be faster, but only at the expense of
code complexity and memory footprint.  

HTH, to see the lite.  

--jkl

Reply via email to