On 6/14/15, Scott Hess <shess at google.com> wrote: > SQLite essentially gives you a set of > b-trees with syntactic sugar over them,
SQL (and I speak in general terms here, not just of SQLite) provides way more than syntactic sugar over b-trees. The syntactic sugar over b-trees part is the *least* of the many advantages of SQL. Other more important features include: (1) Transactions. Atomic updates with the option to ROLLBACK at any time. (2) Data abstraction. The schema concisely and accurately describes the data format in a way that is portable across systems and across time. Content stored in an SQL database is far, far more likely to be accessible 25 years in the future. It is also far more likely to be accessible today by other applications written in different programming languages or implemented on obscure and/or experimental hardware or operating systems. (3) Declarative Programming. With SQL, the programmer asks the machine a question and lets the query planner figure out an appropriate algorithm. A few lines of query text replace hundreds or thousands of lines of procedural code needed to implement that query. If performance problems are encountered, they can usually be remedied by CREATE INDEX, and without changing a single line of code - the query procedures devised by the query planner shift automatically. To be sure, many programmers end up using SQL as a glorified key-value store. You can actually find programs that do "SELECT * FROM table WHERE pk=?" on two or three tables and then implement a join as procedural code in the application. Maybe this is what they were taught. Or maybe they never really understood the concept of a join. I dunno. But for whatever reason, you do see a lot of people treating SQL as little more than an interface or wire protocol for talking to a btree, and thereby bypassing 95% of the goodness of SQL. But just because a lot of programmers do this, does not mean it is right. Alexander Lloyd (on the Google Spanner team) says that "NoSQL databases ... enforce a broadly applied premature optimization on the entire system." What Alexander means is that there are some specialized cases where NoSQL-type solutions are required for performance, but that those situations are the exception rather than the rule. It is better to use the powerful abstractions provided by SQL to build a reliable system, then go back and optimize any performance bottlenecks on a case-by-case basis. Alexander's remarks implicitly but clearly reference Don Knuth's assertion that premature optimization is the root of all evil. Another way of thinking about the difference between SQL and NoSQL is by analogy to C versus assembly language. There was a huge debate in the software industry in the 1980s about whether applications should be coded in assembly or C. Hand-coded assembly is theoretically faster than machine-code generated by a C compiler (or at least it was in the 80s - that point is debatable now). But in practice, programs written in C tended to perform better than those written in assembly. The reason is that C being higher level enabled programmers to spend less time fiddling with bits and more time developing better algorithms. C enables a programmer to spend more time "heads-up" thinking about solving the users problem and less them "heads-down" worrying about the details of the implementation. The SQL vs. NoSQL situation is quite similar. You can, in theory, always run faster with NoSQL storage. But in practice, SQL is so much more powerful that applications coded using SQL tend to perform better, be more reliable, and have more and useful features than those written using NoSQL or ad hoc storage solutions. One clear example of an application suffering from a lack of SQL is Git. Git uses a NoSQL ad hoc key/value data store - files in the .git directory. This is fast (they say). But it also causes serious problems. For example, finding the children of a commit involves a full-table scan of the Git log. This turns out to be so difficult and slow that nobody implements it. On GitHub (and on every other Git interface I've seen) you will find links to the parents of a check-in but not its children. And so when looking at the history of a Git project, it is impossible to find out what happened next. It also leads to the vexing problem known as "detached head state". If the Git log had been implemented as an SQL database (SQLite would have been a good choice for this, I think, but there are other options) then a detached head becomes impossible - that scourge of so many hapless Git users simple would have never existed. And finding (for example) the first 30 descendants of a check-in would become a simple query something like this: WITH RECURSIVE dx(h,m) AS (SELECT child, mtime FROM gitlog WHERE child=$commithash UNION SELECT child, mtime FROM gitlog, dx WHERE parent=h ORDER BY 2) SELECT h, m FROM dx LIMIT 30; This query takes microseconds. But as things stand now, doing the equivalent computation using the NoSQL Git repository format would involve thousands of lines of procedural code and even then would be insufferably slow, which perhaps explains why nobody has ever implemented it. So, no, SQL is not just syntactic sugar on a btree. To anyone who thinks it is, I claim that you do not really understand SQL and I encourage you to abandon your preconceived notions and look more closely at the whole SQL concept. I think you will find that SQL is an exceedingly powerful concept that, when used to its fullest, will make your applications faster, more reliable, and easier to write and maintain. -- D. Richard Hipp drh at sqlite.org