On Tue, Jul 02, 2013 at 11:40:34AM +0100, Alex Bowden scratched on the wall:
> 
> The SQL standard has always been a moving feast, chasing the field
> implementations, perfectly capable of going back on it's earlier
> mistakes,  the main purpose of which, on a good day, is to promote
> standardisation of SQL implementations and try and keep to the
> Relational Theory model where practical considerations allow.
> 
> So, if the SQL standard has drifted toward requiring "in the order
> in which they are defined in the table definition"  to be meaningful,
> then this is an oversight that would likely be corrected when somebody
> has an in the field SQL database which, correctly, enforces no such
> concept.

  I realize that historically the SQL standards have been somewhat liquid,
  and that the standards often follow the implementations (although not
  so much in the last 20 years).  This is not some small misstep, however.
  The concept that columns have order, and are defined and referenced by
  that order, is baked deep, deep into the SQL language, environment,
  and every SQL operation.  It isn't just some after-thought in a newer
  version of the standard.

  As I said before, SQL has very poor and highly ambiguous naming conventions 
  (nevermind standards) for result set columns.  Trying to define
  columns by their names in an SQL environment would be extremely
  difficult and require a massive overhaul of SQL and every database
  out there.  While it may be a bit ugly and non-Relational to have a
  defined column order, it rarely matters.  Unlike row order, there
  aren't performance concerns in a row-based database-- the query
  optimizer has little to gain by allowing arbitrary column ordering.
  
  This is a reason nearly every database API provides access to
  results by column index.  In the SQLite API, column index is the
  *ONLY* way to retrieve a value (or, for that matter, a column name).
  Even if it wasn't part of the standard, reordering the column results
  of a "*" would break almost every application written in the last 35
  years that uses that type of query.

> People should not be encouraged to become more dependent on the use of
> such temporary misfeatures.

  I agree that good database engineers should have a strong
  understanding of Relational Model and-- given the choice-- should
  tend to default to doing things "the Relational way", but back in
  reality, that's not the world we live in.  SQL is not the Relational 
  Model, and the Relational Model is not SQL.  Every practical RDBMS out
  there uses SQL, so every good database engineer should be equally
  versed in SQL, and the differences between SQL and the Relational Model.

  Sometimes you can approach a problem in a very Relational way, but still
  express it in SQL.  Many times you cannot.  Fighting SQL and going
  against its design concepts just to satisfy some desire to do things
  the theoretically pretty way is likely to result in making thing
  overly complex and poor performance.  If you're working in SQL, doing
  it the "SQL way" isn't inherently evil, especially if you know enough
  to understand why you're diverging from the Model, and why it makes
  sense in SQL.

  If you try to make your C++ look and act exactly like Smalltalk, you'll
  end up writing something that is even uglier than standard C++.  Using
  ideas and concepts from Smalltalk to better your C++ programs and OOP
  designs is good engineering.  Forcing something that isn't meant to
  be is not.

  The same applies to SQL and the Relational Model.

> In context, the particular focus of your objection to the
> relational approach, seems irrelevant.

  Yes and no.  I have great respect for the Relational Model, and think
  every database programmer worth anything should have a deep
  understanding of it... Not the Relational Model through SQL, but the
  Model itself, unencumbered by everything SQL brings to the table.
  They should also know SQL, and where the environments differ.  If anyone
  cares, I can provide a list of good books (most by C.J. Date).

  But in the bigger context of this discussion, I think SQLite might
  consider a row_number() function, or (my personal preference) some
  type of virtual column, such as "_row_number_".  I suggest that term,
  since that's what Oracle, MS SQL Server, and PostgreSQL use.  I
  prefer the idea of a virtual column because, unlike a function, it is
  difficult to misuse a virtual column in some other context.  If it
  can only be expressed in a result set selector (the SELECT clause),
  it will only work there.



  It is easy to say such a feature it isn't Relational.  But guess what...
  result sets aren't Relational, even in a pure Relational Model environment.
  In fact, that's why they're called "result sets" rather than
  "result relations" or (in SQL) "result tables."  Yes, the name "set"
  is not the best choice, but the difference in terms is no casual accident.
  Results specifically have a different name, because the object that is
  returned by a query result is not a relation or table... it has
  different properties, including a defined column order and
  possibly defined row order.

  If you want to claim a result set should be strongly Relational,
  that's fine, but then you need to get rid of both ORDER BY and LIMIT.
  While getting rid of ORDER BY and LIMIT is all nice and pure and full
  of theory, I think the people that actually use SQL to build real
  applications and get stuff done might have a slightly different set
  of thoughts on why those features should stay.

  Sorry, but "it's not Relational" doesn't cut it for me, even as
  someone that really likes the Model and finds SQL pretty darn ugly.
  Results aren't Relational in many, many ways; this is not the first.

  Other points:

  -- There are other ways of doing this in SQL.

  Umm, exactly?  This isn't adding anything new.  There are ways you
  can bend over backwards to do this in SQL already, and people will
  use them.  Might as well provide users with a very straight-forward
  way of dealing with things, so that the query optimizer can understand
  what the heck you're *really* trying to do and deal with it appropriately
  (which is to say, mostly ignore it).
  
  If you want to take the high road on pure theory over performance,
  that's fine, have fun with your pure theory and math.  Meanwhile,
  back in the Real World...

  -- You can just do it in code.

  Well sure.  I can "just do" ORDER BY in code as well, but that's not
  really the point, is it?  SQL is an API-independent query expression
  language for a reason.  Most of the database logic and information
  management is supposed to happen in the SQL, which is a pretty rich
  language for such things, allowing complex sub-queries, views, and
  all kinds of crazy things.  If an application needs to manipulate
  database data as part of the data management pipeline, it should
  likely be able to do it in SQL-- not use some quirk of the API to
  glue the desired information into the result set afterwords.
  Otherwise ORDER BY and LIMIT are out again, as is selecting specific
  columns, or even WHERE limits... we can go on and on, saying "just do
  it in code" until you've devolved into a very simple and basic
  key-value store.  Then you have what's known as a "NoSQL" database.



  Finally, I'm going to make the argument that having a ranking
  function is actually more Relational, not less.  Here's why:

  Here is a query.  I would say it is a very reasonable query for a
  database to perform-- and I specifically mean "database", as in SQL,
  not a full application pipeline.  Interactive, at the prompt, type
  stuff.

        "Give me a list of the top 10 sales regions for
        last quarter, ranked by total number of orders."

  Seems pretty reasonable.  JOIN together all the data we need, filter
  by dates, group by sales regions, total them all up, and sort by the
  number of orders.  Very traditional type of query, very easy to
  express in SQL.

  Of course the result is very non-Relational, since the inherent
  information I asked for is dependent on row order, but I'll get back
  to that.

  So anyways, I don't actually care about the actual number of orders,
  which is mostly likely what my SQL query returns, I just want the
  ranking-- who is first, second, and third.  I can get that from an
  ORDER BY query, but the data that makes up the result set doesn't
  actually contain the information I want.  The information I want is
  encoded into the row order of the result set, not the data values of
  the result.  In other words, the row order is extremely relevant, and
  part of the desired result itself.   That's about as non-Relational
  as you can get.
 
  Now consider a ranking function that lets me label the result of the
  ORDER BY explicitly. 

  Now we have an actual column value that represents the data I want.
  This is important, because once that column exists, I can once again
  treat the rows as a proper set.  That is, once the ranking is
  explicitly encoded as a column-- a data value within that unique
  row, rather than inferred from the structure of the rows-- then the row
  order of the result is not important.  The data set represents the
  same answer to the same query, even if the row order is scrambled, or
  undefined.  That is a very, very Relational like thing.

  "But, but, but..." you scream.  Yes, I know.  I had to sort the rows
  into an explicit order before I could extract the ranking data, so
  after the first pass, the ranking data will always be in order and
  that doesn't really add much.  Yes and no.
  
  There are plenty of cases when an SQL result set gets turned back into
  a relation, dropping the defined row order.  Views and sub-queries are
  the most obvious case.

  Views are a touchy subject, since some people will say that views can
  be ordered.  Personally, I don't buy that.  If a view is supposed to
  look like a table, smell like a table, and act like a table, then the
  view cannot have a pre-defined row order.  It breaks way too many
  Relational ideas.  Ordered views are ugly from a Relational
  standpoint, not just an SQL one.  So if you want a view that provides
  any type of ordering, you need to be able to provide a rank column as
  part of the view result.

  Subqueries are usually a bit easier to understand, since it is
  obvious to see how a very complex query with many subqueries is not
  going to preserve ORDER BY output through a tree of subqueries.  And,
  in the case of a ranking, it makes sense to order the result of a
  subquery, attach the query information, and then devolve the result
  set back into an arbitrary relation/table and let the rows scramble
  themselves for the next stage of the root query.  More to the point,
  if I have a ranking functionality available, I'm happy to let the
  query engine do this, since I've explicitly defined the information I
  want to carry through the larger query.

  Additionally, it should be pointed out that most other databases
  provide a ranking function as part of their windowing function set,
  meaning a ranking could be applied explicitly over some data set;
  i.e. "RANK( total( orders ) )", rather than having ORDER BY applied
  over some data set and then pulling the rank information back out,
  i.e. "SELECT total( orders ), _row_number_ [...] ORDER BY 1".  That's
  clearly perfectly Relational, since row order never comes into it.
  It is also likely to be higher performance, since the query engine
  understands what you want and may not have to do a full sort in order
  to provide it.  In such a case the rows can always be treated as a
  proper set (i.e. Relationally).  Of course, such a thing would
  require a large and complex update to the SQL syntax, so having a
  simple rank function that is dependent on ORDER BY to generate
  meaningful results makes some sense in the world of theory vs
  practicality.

 
 

  So, yes... overall I think some type of ranking functionality-- be it
  an explicit function or a virtual column or something else-- is a
  good idea.  I think it actually improves the Relational aspects of
  these queries, since it allows the SQL programmer to explicitly
  define orders and ranking as data values, rather than having them
  implied in row order.  This makes the result set much more
  Relational, since the defined row order can be dropped without losing
  query information.

  And, yes, it will be abused by people that don't really understand
  how to use it, or how it fits into the overall SQL language.  You
  know, like every other aspect of the SQL language-- or any
  programming language-- is misused by clueless people.  I don't care.
  It lets me sell more books.


> Sort order isn't necessarily deterministic even if we know the column
> order.  So the possibility that we may not know it, makes life no worse.

  True, you can sort by random().  The sort *process* must be
  deterministic, however, or there isn't much point in having a sort.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to