Excellent post Paul!
On 28 Dec 2011, at 07:38, Paul Davis <[email protected]> wrote: > I'll prefix this with the fact that Sean does an excellent point of > responding your core concerns. I've had a bit of this brewing as of > late for a various number of reasons. But your email provokes a bit of > a response so I think this is as good of a place as any to write down > some thoughts. > > Also, to preface, my nascent background was to be taught SQL and the > relational model by a couple guys that are deep into it. Before I > moved to non-relational datastores I was deeply entrenched in a purist > relational model. Purist beyond even most DBA's because I was taught > from practical application of theory. Things like "null != null" and > "null values are a hack around Codd's rules" are relatively benign but > points that I find few people have completely groked in the real > world. I was lucky to have these two guys teach me from the "this is > proper and this is why we ignore that" point of view. > > On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo <[email protected]> wrote: >> Hi all, >> >> (I originally posted this on the mobile-couchbase list, but was recommended >> to post it here instead, since this is more of a general couchdb question) >> >> I am struggling with something that I know would be quite trivial to do in >> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years >> of experience with SQL is clouding my judgement and so I am trying rather to >> break away from the SQL/relational mindset and understand how this would be >> better designed in couch. >> > > This is good. The first thing to realize in learning a new tool is > that its not the old tool. New tools behave differently and > differences should be embraced rather than rejected outright. > >> I am developing an iOS art gallery app, and wish to display information >> about artworks, artists and galleries. >> >> The menu screen requires a list of all artworks together with the gallery >> they are currently housed at and the artist that created the artwork: >> >> * artwork title >> * artwork medium >> * artwork dimensions >> * artist full_name >> * gallery name, >> * gallery city >> >> for each artwork, respectively. >> >> Imagine the following objects in my database (I am not including all the >> fields for sake of brevity. i.e. only the ones that are important). >> >> | Artist >> ------ >> _id >> full_name >> birth_year >> birth_place >> >> Gallery >> ------- >> _id >> name >> city >> >> Artwork >> ------- >> _id >> title >> medium >> dimensions >> artist_id_ (foreign key) >> gallery_id_ (foreign key) >> | >> >> In SQL it would be as easy as one query with two joins: >> >> |SELECT >> artwork.title, >> artwork.medium, >> artwork.dimensions, >> artist.full_name, >> gallery.name, >> gallery.city >> FROM >> artwork INNER JOIN artist >> ON artwork.artist_id_ = artist._id INNER JOIN gallery >> ON artwork.gallery_id_ = gallery._id >> | >> > > The fact that you took the time to specify INNER JOIN is a good tip > you have a thorough understanding of SQL. > >> From this I would be able to get the data I need in one query (i.e. one call >> to the database). >> >> | Guernica | Pablo Picasso | Museo Reina Sofia, Madrid, Spain >> Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France >> David | Michelangelo | Uffizi Gallery, Florence, Italy. >> | >> >> Using couchdb I am storing the data in a very similar way to the way I would >> in a relational database, represented as something along the lines of: >> > > Roughly speaking, this is your first mistake. > >> |{ >> "type" : "Artwork", >> "_id" : "d665afaedde12b44c5159cf1782c5c1e", >> "_rev" : "1-446201af8a1ddfa3759164a01be899de", >> "artist_id" : "d665afaedde12b44c5159cf1782c2d0b", >> "gallery_id" : "d665afaedde12b44c5159cf1782d44af", >> "title" : "Guernica", >> "medium" : "Oil on canvas", >> "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)" >> } >> >> { >> "type" : "Artist", >> "_id" : "d665afaedde12b44c5159cf1782c2d0b", >> "_rev" : "1-8fa8a6dd50d9d9072b08db1a840128b1", >> "full_name" : "Pablo Picasso", >> "birth_year" : "1881" >> } >> >> { >> "type" : "Gallery", >> "_id" : "d665afaedde12b44c5159cf1782d44af", >> "_rev" : "1-27a139e40a7f88d6465812eec3c73a0f", >> "name" : "Museo Reina Sofia" >> } >> | >> >> To get the same result in couch, I would have to do one query to get all the >> artworks, e.g. >> >> |function(doc) { >> if(doc.type&& doc.type == 'Artist') { >> emit(doc._id,doc); >> } >> } >> | >> >> then I would need to do a |GET| for the _id of each |Gallery| and |Artist| >> returned per |Artwork| respectively to fill in the missing information. This >> is usually known as the |n+1| queries problem (in this case |2n+1|). >> >> I have always worked on the assumption that the database should be leveraged >> to do as much of the heavy lifting as possible with as few network hits as >> possible (i.e. SQL is always faster than client manipulation). With a list >> of 120 artworks, this means 241 HTTP GETs to the database. >> > > This is your second mistake. And more to the point, this is the first > mistake I had to overcome myself when switching from relational to > non-relational data stores. Even more specifically, my stumbling block > was exactly the same as yours. The assumption that the number of > requests/queries is to be minimized. > > I was taught quite specifically to minimize the number of queries per > page load. Current ORM layers are still abhorrent to me because of the > ease in which they translate directly to your worst case, each row in > the list ends up being three queries per row. Specifically, people > must understand both the relational model and their ORM layer to avoid > it from doing the Wrong Thing ™. > > The conceptual realization that we need to make here is that SQL isn't > magical. When we issue a query, it goes through the planner and ends > up doing (seriously advanced) optimizations over the indexes and data > retrieval methods it has available. The "minimize number of queries" > mantra is a bit overly broad here. In practical terms it has the > effect of making sure that we're making the best use of indexes and > look up patterns. As an analogy, its akin to writing better C because > we might have some intuition on how the assembler translation might be > generated. > >> I am aware of "linked documents" in views >> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents), >> however this seems to only work with one related row. >> >> I can't really see how "view collation" >> (http://wiki.apache.org/couchdb/View_collation) would work here either. With >> view collation the correlated information ends up being on different rows, >> and I think that would be better suited for where say a single Artists or >> Gallery is displayed with associated Artworks. My case is the opposite, >> whereby Artworks plus corresponding information from Artists and Galleries >> is required. >> > > These are definitely tools in the arsenal, but don't really speak to > the core issue. I'll come back to this in a bit. > >> I am also aware that one could embed each gallery or artist in each artwork >> or vice versa, but that seems to be a lot of duplication, and would make it >> difficult to change artist or gallery information if required. Maybe this is >> how it should be done? >> > > As Sean pointed out, the second thing to overcome is data duplication. > Storage is cheaper than CPU and network costs. If you don't expect > data to change, you should try and exploit that. > >> For this small app, this is not going to be a performance issue. I am trying >> rather to break away from a SQL/relational mindset and understand how this >> would be better designed in a database like couch. >> > > Your honest questions are quite refreshing because they remind me of > my own initial exploration. There's lots of people talking but I've > never found a really concise description of the difference between the > approaches. There's lots of anecdotal blog posts about "scaling" and > the such, but nothing I've seen really addresses the core issues. > >> Thanks, >> >> Steve >> > > So, the bottom line is that the difference between relational and > non-relational is two things. The first issue that people encounter > when moving from relational to non is that the query planner is gone. > This seems benign at fist, but affects a huge number of moving pieces > in anything but non-trivial applications. If the query planner is > gone, then all queries are less than optimal. Think as a DBA if > *every* query was a full table scan. Most DBA's I know would scream at > the thought. And also be highly confused why anyone would give that > up. > > There are two answers here. One is about "horizontal scalability". > Basically, SQL query planner optimizations don't generally work well > when there's a network connection involved. Note that I'm not saying > that SQL won't work in a clustered environment. There are companies > that have proven otherwise. The issue is that the difference isn't > clear to most people. Sure you can make SQL work on multiple machines, > but the tradeoffs between scaling SQL horizontally vs vertically are > complex and generally not well understood. > > The second issue is that non-relational stores have less information > to do query optimizations. If you're in the crowd that says "I don't > need scaling, I need to not be constrained by a schema" you're in this > group. When you don't define a schema you're not allowing for lots of > important information to be fed into some really awesome algorithms > for data retrieval. > > There's a rough analogy here between statically and dynamically typed > programming languages. Anyone that says one is better than the other > is someone you shouldn't listen to. There are extremes where both are > obviously better, but in the middle it gets into a complicated mess > where its more important on the observer's point of view than any sort > of theoretical points. If you have a well defined data usage, then > static might be best. A rule of thumb is, if you can't print your data > model on paper and have it be relevant for more than a couple months, > then you're probably forcing things a bit. > > Granted, there are some obvious optimizations that non-relational > stores could make by introspecting data, but its quite important to > note the difference here. Relational stores regardless of scaling have > the enormous benefit of having the data model defined up front. > > The best way I've learned to express this is that relational stores > require you to invest in defining how you get data into the store. > Non-relational means you have to invest in how to get data out. > > So, the end of the day decision is if the traditional relational > stores provide you the ability to meet your customer's demands. For a > lot of people over the last thirty or more years, this is true. It > might be less than the optimal, but it generally works. Some people > have recently realized that some of the assumptions in the relational > model are artificially constraining. NOSQL is basically a response to > this. A sudden realization of "If I ignore this one thing, my problem > is a lot easier." This is the issue that drives people. "What > constraint am I willing to give up?" The single piece of advice I > would give any relational-to-non convert is, "Analyze you're > requirements and ask any new data store what they give up in return > for their promise." > > CouchDB gives up isolation and consistency (in the traditional sense > of transactions), for distribution through replication. We prioritize > being consistent over being fast. And we use a funky programming > language to make it happen. > > Other data stores have different reasons for existing. Redis, which > I'm quite fond of, is roughly our opposite in that its geared towards > RAM and being fast. There are problems I would solve with either, but > no single problem for which I would use both interchangeably.
