Re: Database design question

Steven Ringo Wed, 28 Dec 2011 03:42:59 -0800

Thanks for this comprehensive explanation. Have an even deeperunderstanding now. Good to see I am not the only one that had to startover with the mental models of how a system might work.

You make some very interesting points about the planner (or lack thereofin CouchDB).

And I see you have also experienced what happens when an ORM abstractionis misunderstood in terms of its underlying execution. I see this a lotwith Rails. As an aside I have also noticed must CouchDB to Rails ORMstrying to make CouchDB perform like a SQL database. This is preciselywhat I was trying to avoid.

You also make some great points around compromises. I now am on the roadto having a much better idea as to what types of data and theirexchanges between one another are best suited to SQL and NoSQL respectively.

The app I am referring to would have worked very well with aconventional SQL store if it were server-only. SQL's design andnormalisation would be its strength, and would require much less code.However as soon as you throw mobile replication and offline use into themix, the SQL store becomes its weakness. Its here where NoSQL shines,and the minor tradeoffs of denormalisation pale in comparison to thebenefits.

You make an analogy between static and dynamic languages. I thinkthere's another great analogy: centralised vs decentralised versioncontrol systems (or Git vs Subversion if you like). I have oftendescribed CouchDB as the Git of the database world. Replication andpeer-to-peer is ingrained into Git as it is to Couch. SQL and Subversionare both strong single server solutions. Git has tradeoffs including asteeper learning curve and some conceptual issues to get one's headaround at the beginning. These are very nicely traded for cheap and easybranching, offline use and peer-replication. Git is terrible at managinglarge binary attachments, whereas Subversion handles this with aplomb.However try use Subversion at 30 000 feet in an aluminium tube over theAtlantic and you're stuck...

I am glad this discussion has created some food for thought. Thanksagain for your contribution.


Paul Davis wrote:

I'll prefix this with the fact that Sean does an excellent point of
responding your core concerns. I've had a bit of this brewing as of
late for a various number of reasons. But your email provokes a bit of
a response so I think this is as good of a place as any to write down
some thoughts.

Also, to preface, my nascent background was to be taught SQL and the
relational model by a couple guys that are deep into it. Before I
moved to non-relational datastores I was deeply entrenched in a purist
relational model. Purist beyond even most DBA's because I was taught
from practical application of theory. Things like "null != null" and
"null values are a hack around Codd's rules" are relatively benign but
points that I find few people have completely groked in the real
world. I was lucky to have these two guys teach me from the "this is
proper and this is why we ignore that" point of view.

On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo<[email protected]>  wrote:

Hi all,

(I originally posted this on the mobile-couchbase list, but was recommended
to post it here instead, since this is more of a general couchdb question)

I am struggling with something that I know would be quite trivial to do in
SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years
of experience with SQL is clouding my judgement and so I am trying rather to
break away from the SQL/relational mindset and understand how this would be
better designed in couch.


This is good. The first thing to realize in learning a new tool is
that its not the old tool. New tools behave differently and
differences should be embraced rather than rejected outright.

I am developing an iOS art gallery app, and wish to display information
about artworks, artists and galleries.

The menu screen requires a list of all artworks together with the gallery
they are currently housed at and the artist that created the artwork:

  * artwork title
  * artwork medium
  * artwork dimensions
  * artist full_name
  * gallery name,
  * gallery city

for each artwork, respectively.

Imagine the following objects in my database (I am not including all the
fields for sake of brevity. i.e. only the ones that are important).

|     Artist
    ------
    _id
    full_name
    birth_year
    birth_place

    Gallery
    -------
    _id
    name
    city

    Artwork
    -------
    _id
    title
    medium
    dimensions
    artist_id_ (foreign key)
    gallery_id_ (foreign key)
|

In SQL it would be as easy as one query with two joins:

|SELECT
    artwork.title,
    artwork.medium,
    artwork.dimensions,
    artist.full_name,
    gallery.name,
    gallery.city
FROM
    artwork INNER JOIN artist
    ON artwork.artist_id_ = artist._id INNER JOIN gallery
    ON artwork.gallery_id_ = gallery._id
|


The fact that you took the time to specify INNER JOIN is a good tip
you have a thorough understanding of SQL.

 From this I would be able to get the data I need in one query (i.e. one call
to the database).

|     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
    Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
    David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
|

Using couchdb I am storing the data in a very similar way to the way I would
in a relational database, represented as something along the lines of:


Roughly speaking, this is your first mistake.

|{
   "type"       : "Artwork",
   "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
   "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
   "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
   "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
   "title"      : "Guernica",
   "medium"     : "Oil on canvas",
   "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
}

{
   "type"       : "Artist",
   "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
   "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
   "full_name"  : "Pablo Picasso",
   "birth_year" : "1881"
}

{
   "type"       : "Gallery",
   "_id"        : "d665afaedde12b44c5159cf1782d44af",
   "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
   "name"       : "Museo Reina Sofia"
}
|

To get the same result in couch, I would have to do one query to get all the
artworks, e.g.

|function(doc) {
  if(doc.type&&    doc.type == 'Artist') {
    emit(doc._id,doc);
  }
}
|

then I would need to do a |GET| for the _id of each |Gallery| and |Artist|
returned per |Artwork| respectively to fill in the missing information. This
is usually known as the |n+1| queries problem (in this case |2n+1|).

I have always worked on the assumption that the database should be leveraged
to do as much of the heavy lifting as possible with as few network hits as
possible (i.e. SQL is always faster than client manipulation). With a list
of 120 artworks, this  means 241 HTTP GETs to the database.


This is your second mistake. And more to the point, this is the first
mistake I had to overcome myself when switching from relational to
non-relational data stores. Even more specifically, my stumbling block
was exactly the same as yours. The assumption that the number of
requests/queries is to be minimized.

I was taught quite specifically to minimize the number of queries per
page load. Current ORM layers are still abhorrent to me because of the
ease in which they translate directly to your worst case, each row in
the list ends up being three queries per row. Specifically, people
must understand both the relational model and their ORM layer to avoid
it from doing the Wrong Thing&trade;.

The conceptual realization that we need to make here is that SQL isn't
magical. When we issue a query, it goes through the planner and ends
up doing (seriously advanced) optimizations over the indexes and data
retrieval methods it has available. The "minimize number of queries"
mantra is a bit overly broad here. In practical terms it has the
effect of making sure that we're making the best use of indexes and
look up patterns. As an analogy, its akin to writing better C because
we might have some intuition on how the assembler translation might be
generated.

I am aware of "linked documents" in views
(http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
however this seems to only work with one related row.

I can't really see how "view collation"
(http://wiki.apache.org/couchdb/View_collation) would work here either. With
view collation the correlated information ends up being on different rows,
and I think that would be better suited for where say a single Artists or
Gallery is displayed with associated Artworks. My case is the opposite,
whereby Artworks plus corresponding information from Artists and Galleries
is required.


These are definitely tools in the arsenal, but don't really speak to
the core issue. I'll come back to this in a bit.

I am also aware that one could embed each gallery or artist in each artwork
or vice versa, but that seems to be a lot of duplication, and would make it
difficult to change artist or gallery information if required. Maybe this is
how it should be done?


As Sean pointed out, the second thing to overcome is data duplication.
Storage is cheaper than CPU and network costs. If you don't expect
data to change, you should try and exploit that.

For this small app, this is not going to be a performance issue. I am trying
rather to break away from a SQL/relational mindset and understand how this
would be better designed in a database like couch.


Your honest questions are quite refreshing because they remind me of
my own initial exploration. There's lots of people talking but I've
never found a really concise description of the difference between the
approaches. There's lots of anecdotal blog posts about "scaling" and
the such, but nothing I've seen really addresses the core issues.

Thanks,

Steve


So, the bottom line is that the difference between relational and
non-relational is two things. The first issue that people encounter
when moving from relational to non is that the query planner is gone.
This seems benign at fist, but affects a huge number of moving pieces
in anything but non-trivial applications. If the query planner is
gone, then all queries are less than optimal. Think as a DBA if
*every* query was a full table scan. Most DBA's I know would scream at
the thought. And also be highly confused why anyone would give that
up.

There are two answers here. One is about "horizontal scalability".
Basically, SQL query planner optimizations don't generally work well
when there's a network connection involved. Note that I'm not saying
that SQL won't work in a clustered environment. There are companies
that have proven otherwise. The issue is that the difference isn't
clear to most people. Sure you can make SQL work on multiple machines,
but the tradeoffs between scaling SQL horizontally vs vertically are
complex and generally not well understood.

The second issue is that non-relational stores have less information
to do query optimizations. If you're in the crowd that says "I don't
need scaling, I need to not be constrained by a schema" you're in this
group. When you don't define a schema you're not allowing for lots of
important information to be fed into some really awesome algorithms
for data retrieval.

There's a rough analogy here between statically and dynamically typed
programming languages. Anyone that says one is better than the other
is someone you shouldn't listen to. There are extremes where both are
obviously better, but in the middle it gets into a complicated mess
where its more important on the observer's point of view than any sort
of theoretical points. If you have a well defined data usage, then
static might be best. A rule of thumb is, if you can't print your data
model on paper and have it be relevant for more than a couple months,
then you're probably forcing things a bit.

Granted, there are some obvious optimizations that non-relational
stores could make by introspecting data, but its quite important to
note the difference here. Relational stores regardless of scaling have
the enormous benefit of having the data model defined up front.

The best way I've learned to express this is that relational stores
require you to invest in defining how you get data into the store.
Non-relational means you have to invest in how to get data out.

So, the end of the day decision is if the traditional relational
stores provide you the ability to meet your customer's demands. For a
lot of people over the last thirty or more years, this is true. It
might be less than the optimal, but it generally works. Some people
have recently realized that some of the assumptions in the relational
model are artificially constraining. NOSQL is basically a response to
this. A sudden realization of "If I ignore this one thing, my problem
is a  lot easier." This is the issue that drives people. "What
constraint am I willing to give up?" The single piece of advice I
would give any relational-to-non convert is, "Analyze you're
requirements and ask any new data store what they give up in return
for their promise."

CouchDB gives up isolation and consistency (in the traditional sense
of transactions), for distribution through replication. We prioritize
being consistent over being fast. And we use a funky programming
language to make it happen.

Other data stores have different reasons for existing. Redis, which
I'm quite fond of, is roughly our opposite in that its geared towards
RAM and being fast. There are problems I would solve with either, but
no single problem for which I would use both interchangeably.

Re: Database design question

Reply via email to