On 6/14/15, Scott Hess <shess at google.com> wrote:
> SQLite essentially gives you a set of
> b-trees with syntactic sugar over them,

SQL (and I speak in general terms here, not just of SQLite) provides
way more than syntactic sugar over b-trees.  The syntactic sugar over
b-trees part is the *least* of the many advantages of SQL.  Other more
important features include:

(1) Transactions.  Atomic updates with the option to ROLLBACK at any time.

(2) Data abstraction.  The schema concisely and accurately describes
the data format in a way that is portable across systems and across
time.  Content stored in an SQL database is far, far more likely to be
accessible 25 years in the future.  It is also far more likely to be
accessible today by other applications written in different
programming languages or implemented on obscure and/or experimental
hardware or operating systems.

(3) Declarative Programming.  With SQL, the programmer asks the
machine a question and lets the query planner figure out an
appropriate algorithm.  A few lines of query text replace hundreds or
thousands of lines of procedural code needed to implement that query.
If performance problems are encountered, they can usually be remedied
by CREATE INDEX, and without changing a single line of code - the
query procedures devised by the query planner shift automatically.

To be sure, many programmers end up using SQL as a glorified key-value
store.  You can actually find programs that do "SELECT * FROM table
WHERE pk=?" on two or three tables and then implement a join as
procedural code in the application.  Maybe this is what they were
taught.  Or maybe they never really understood the concept of a join.
I dunno.  But for whatever reason, you do see a lot of people treating
SQL as little more than an interface or wire protocol for talking to a
btree, and thereby bypassing 95% of the goodness of SQL.   But just
because a lot of programmers do this, does not mean it is right.

Alexander Lloyd (on the Google Spanner team) says that "NoSQL
databases ... enforce a broadly applied premature optimization on the
entire system."  What Alexander means is that there are some
specialized cases where NoSQL-type solutions are required for
performance, but that those situations are the exception rather than
the rule.   It is better to use the powerful abstractions provided by
SQL to build a reliable system, then go back and optimize any
performance bottlenecks on a case-by-case basis.  Alexander's remarks
implicitly but clearly reference Don Knuth's assertion that premature
optimization is the root of all evil.

Another way of thinking about the difference between SQL and NoSQL is
by analogy to C versus assembly language.  There was a huge debate in
the software industry in the 1980s about whether applications should
be coded in assembly or C.  Hand-coded assembly is theoretically
faster than machine-code generated by a C compiler (or at least it was
in the 80s - that point is debatable now).  But in practice, programs
written in C tended to perform better than those written in assembly.
The reason is that C being higher level enabled programmers to spend
less time fiddling with bits and more time developing better
algorithms.  C enables a programmer to spend more time "heads-up"
thinking about solving the users problem and less them "heads-down"
worrying about the details of the implementation.  The SQL vs. NoSQL
situation is quite similar.  You can, in theory, always run faster
with NoSQL storage.  But in practice, SQL is so much more powerful
that applications coded using SQL tend to perform better, be more
reliable, and have more and useful features than those written using
NoSQL or ad hoc storage solutions.

One clear example of an application suffering from a lack of SQL is
Git.  Git uses a NoSQL ad hoc key/value data store - files in the .git
directory.  This is fast (they say).  But it also causes serious
problems.  For example, finding the children of a commit involves a
full-table scan of the Git log.  This turns out to be so difficult and
slow that nobody implements it.  On GitHub (and on every other Git
interface I've seen) you will find links to the parents of a check-in
but not its children.  And so when looking at the history of a Git
project, it is impossible to find out what happened next.  It also
leads to the vexing problem known as "detached head state".  If the
Git log had been implemented as an SQL database (SQLite would have
been a good choice for this, I think, but there are other options)
then a detached head becomes impossible - that scourge of so many
hapless Git users simple would have never existed.  And finding (for
example) the first 30 descendants of a check-in would become a simple
query something like this:

    WITH RECURSIVE
      dx(h,m) AS (SELECT child, mtime FROM gitlog WHERE child=$commithash
                      UNION
                      SELECT child, mtime FROM gitlog, dx WHERE parent=h
                      ORDER BY 2)
    SELECT h, m FROM dx LIMIT 30;

This query takes microseconds.  But as things stand now, doing the
equivalent computation using the NoSQL Git repository format would
involve thousands of lines of procedural code and even then would be
insufferably slow, which perhaps explains why nobody has ever
implemented it.

So, no, SQL is not just syntactic sugar on a btree.  To anyone who
thinks it is, I claim that you do not really understand SQL and I
encourage you to abandon your preconceived notions and look more
closely at the whole SQL concept.  I think you will find that SQL is
an exceedingly powerful concept that, when used to its fullest, will
make your applications faster, more reliable, and easier to write and
maintain.

-- 
D. Richard Hipp
drh at sqlite.org

Reply via email to