On Tue, Jul 02, 2013 at 11:40:34AM +0100, Alex Bowden scratched on the wall: > > The SQL standard has always been a moving feast, chasing the field > implementations, perfectly capable of going back on it's earlier > mistakes, the main purpose of which, on a good day, is to promote > standardisation of SQL implementations and try and keep to the > Relational Theory model where practical considerations allow. > > So, if the SQL standard has drifted toward requiring "in the order > in which they are defined in the table definition" to be meaningful, > then this is an oversight that would likely be corrected when somebody > has an in the field SQL database which, correctly, enforces no such > concept.
I realize that historically the SQL standards have been somewhat liquid, and that the standards often follow the implementations (although not so much in the last 20 years). This is not some small misstep, however. The concept that columns have order, and are defined and referenced by that order, is baked deep, deep into the SQL language, environment, and every SQL operation. It isn't just some after-thought in a newer version of the standard. As I said before, SQL has very poor and highly ambiguous naming conventions (nevermind standards) for result set columns. Trying to define columns by their names in an SQL environment would be extremely difficult and require a massive overhaul of SQL and every database out there. While it may be a bit ugly and non-Relational to have a defined column order, it rarely matters. Unlike row order, there aren't performance concerns in a row-based database-- the query optimizer has little to gain by allowing arbitrary column ordering. This is a reason nearly every database API provides access to results by column index. In the SQLite API, column index is the *ONLY* way to retrieve a value (or, for that matter, a column name). Even if it wasn't part of the standard, reordering the column results of a "*" would break almost every application written in the last 35 years that uses that type of query. > People should not be encouraged to become more dependent on the use of > such temporary misfeatures. I agree that good database engineers should have a strong understanding of Relational Model and-- given the choice-- should tend to default to doing things "the Relational way", but back in reality, that's not the world we live in. SQL is not the Relational Model, and the Relational Model is not SQL. Every practical RDBMS out there uses SQL, so every good database engineer should be equally versed in SQL, and the differences between SQL and the Relational Model. Sometimes you can approach a problem in a very Relational way, but still express it in SQL. Many times you cannot. Fighting SQL and going against its design concepts just to satisfy some desire to do things the theoretically pretty way is likely to result in making thing overly complex and poor performance. If you're working in SQL, doing it the "SQL way" isn't inherently evil, especially if you know enough to understand why you're diverging from the Model, and why it makes sense in SQL. If you try to make your C++ look and act exactly like Smalltalk, you'll end up writing something that is even uglier than standard C++. Using ideas and concepts from Smalltalk to better your C++ programs and OOP designs is good engineering. Forcing something that isn't meant to be is not. The same applies to SQL and the Relational Model. > In context, the particular focus of your objection to the > relational approach, seems irrelevant. Yes and no. I have great respect for the Relational Model, and think every database programmer worth anything should have a deep understanding of it... Not the Relational Model through SQL, but the Model itself, unencumbered by everything SQL brings to the table. They should also know SQL, and where the environments differ. If anyone cares, I can provide a list of good books (most by C.J. Date). But in the bigger context of this discussion, I think SQLite might consider a row_number() function, or (my personal preference) some type of virtual column, such as "_row_number_". I suggest that term, since that's what Oracle, MS SQL Server, and PostgreSQL use. I prefer the idea of a virtual column because, unlike a function, it is difficult to misuse a virtual column in some other context. If it can only be expressed in a result set selector (the SELECT clause), it will only work there. It is easy to say such a feature it isn't Relational. But guess what... result sets aren't Relational, even in a pure Relational Model environment. In fact, that's why they're called "result sets" rather than "result relations" or (in SQL) "result tables." Yes, the name "set" is not the best choice, but the difference in terms is no casual accident. Results specifically have a different name, because the object that is returned by a query result is not a relation or table... it has different properties, including a defined column order and possibly defined row order. If you want to claim a result set should be strongly Relational, that's fine, but then you need to get rid of both ORDER BY and LIMIT. While getting rid of ORDER BY and LIMIT is all nice and pure and full of theory, I think the people that actually use SQL to build real applications and get stuff done might have a slightly different set of thoughts on why those features should stay. Sorry, but "it's not Relational" doesn't cut it for me, even as someone that really likes the Model and finds SQL pretty darn ugly. Results aren't Relational in many, many ways; this is not the first. Other points: -- There are other ways of doing this in SQL. Umm, exactly? This isn't adding anything new. There are ways you can bend over backwards to do this in SQL already, and people will use them. Might as well provide users with a very straight-forward way of dealing with things, so that the query optimizer can understand what the heck you're *really* trying to do and deal with it appropriately (which is to say, mostly ignore it). If you want to take the high road on pure theory over performance, that's fine, have fun with your pure theory and math. Meanwhile, back in the Real World... -- You can just do it in code. Well sure. I can "just do" ORDER BY in code as well, but that's not really the point, is it? SQL is an API-independent query expression language for a reason. Most of the database logic and information management is supposed to happen in the SQL, which is a pretty rich language for such things, allowing complex sub-queries, views, and all kinds of crazy things. If an application needs to manipulate database data as part of the data management pipeline, it should likely be able to do it in SQL-- not use some quirk of the API to glue the desired information into the result set afterwords. Otherwise ORDER BY and LIMIT are out again, as is selecting specific columns, or even WHERE limits... we can go on and on, saying "just do it in code" until you've devolved into a very simple and basic key-value store. Then you have what's known as a "NoSQL" database. Finally, I'm going to make the argument that having a ranking function is actually more Relational, not less. Here's why: Here is a query. I would say it is a very reasonable query for a database to perform-- and I specifically mean "database", as in SQL, not a full application pipeline. Interactive, at the prompt, type stuff. "Give me a list of the top 10 sales regions for last quarter, ranked by total number of orders." Seems pretty reasonable. JOIN together all the data we need, filter by dates, group by sales regions, total them all up, and sort by the number of orders. Very traditional type of query, very easy to express in SQL. Of course the result is very non-Relational, since the inherent information I asked for is dependent on row order, but I'll get back to that. So anyways, I don't actually care about the actual number of orders, which is mostly likely what my SQL query returns, I just want the ranking-- who is first, second, and third. I can get that from an ORDER BY query, but the data that makes up the result set doesn't actually contain the information I want. The information I want is encoded into the row order of the result set, not the data values of the result. In other words, the row order is extremely relevant, and part of the desired result itself. That's about as non-Relational as you can get. Now consider a ranking function that lets me label the result of the ORDER BY explicitly. Now we have an actual column value that represents the data I want. This is important, because once that column exists, I can once again treat the rows as a proper set. That is, once the ranking is explicitly encoded as a column-- a data value within that unique row, rather than inferred from the structure of the rows-- then the row order of the result is not important. The data set represents the same answer to the same query, even if the row order is scrambled, or undefined. That is a very, very Relational like thing. "But, but, but..." you scream. Yes, I know. I had to sort the rows into an explicit order before I could extract the ranking data, so after the first pass, the ranking data will always be in order and that doesn't really add much. Yes and no. There are plenty of cases when an SQL result set gets turned back into a relation, dropping the defined row order. Views and sub-queries are the most obvious case. Views are a touchy subject, since some people will say that views can be ordered. Personally, I don't buy that. If a view is supposed to look like a table, smell like a table, and act like a table, then the view cannot have a pre-defined row order. It breaks way too many Relational ideas. Ordered views are ugly from a Relational standpoint, not just an SQL one. So if you want a view that provides any type of ordering, you need to be able to provide a rank column as part of the view result. Subqueries are usually a bit easier to understand, since it is obvious to see how a very complex query with many subqueries is not going to preserve ORDER BY output through a tree of subqueries. And, in the case of a ranking, it makes sense to order the result of a subquery, attach the query information, and then devolve the result set back into an arbitrary relation/table and let the rows scramble themselves for the next stage of the root query. More to the point, if I have a ranking functionality available, I'm happy to let the query engine do this, since I've explicitly defined the information I want to carry through the larger query. Additionally, it should be pointed out that most other databases provide a ranking function as part of their windowing function set, meaning a ranking could be applied explicitly over some data set; i.e. "RANK( total( orders ) )", rather than having ORDER BY applied over some data set and then pulling the rank information back out, i.e. "SELECT total( orders ), _row_number_ [...] ORDER BY 1". That's clearly perfectly Relational, since row order never comes into it. It is also likely to be higher performance, since the query engine understands what you want and may not have to do a full sort in order to provide it. In such a case the rows can always be treated as a proper set (i.e. Relationally). Of course, such a thing would require a large and complex update to the SQL syntax, so having a simple rank function that is dependent on ORDER BY to generate meaningful results makes some sense in the world of theory vs practicality. So, yes... overall I think some type of ranking functionality-- be it an explicit function or a virtual column or something else-- is a good idea. I think it actually improves the Relational aspects of these queries, since it allows the SQL programmer to explicitly define orders and ranking as data values, rather than having them implied in row order. This makes the result set much more Relational, since the defined row order can be dropped without losing query information. And, yes, it will be abused by people that don't really understand how to use it, or how it fits into the overall SQL language. You know, like every other aspect of the SQL language-- or any programming language-- is misused by clueless people. I don't care. It lets me sell more books. > Sort order isn't necessarily deterministic even if we know the column > order. So the possibility that we may not know it, makes life no worse. True, you can sort by random(). The sort *process* must be deterministic, however, or there isn't much point in having a sort. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "Intelligence is like underwear: it is important that you have it, but showing it to the wrong people has the tendency to make them feel uncomfortable." -- Angela Johnson _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users