On Mon, Jul 27, 2009 at 02:44:08PM -0700, CityDev scratched on the wall:
>
> More recently the term 'denormalise' has been used
> instead. This is where you repeat foreign data in a table to avoid the
> overhead of joins at runtime.
> Over the intervening years I can't ever remember denormalising data (even
> when dealing with eg 13 million insurance customers in a table). Is it OK
> nowadays to say always aim to be fully normalised - modern RDBMSs are
> usually powerful enough to cope with most anything?
Normalization is a fundamental part of the Relational Model. The
first three Normal Forms were defined as part of the original
research into the Relational Model.
As you might remember, the basic idea behind normalization is to
eliminate all duplicate data. This is critical to data integrity,
which is critical to the assumptions made by the Relational Model.
A big part of normalization is making sure that each table (relation)
is used to represent one and only one "thing." A side effect of this
is that when you normalize your design there is only one copy of any
given atomic data unit. If there is only one copy, it is impossible
for that data to get out of sync, further forwarding the ideas of
data integrity and consistency.
It also tends to INCREASE performance for everything but SELECT
statements. In general, normalized databases tend to be smaller,
thanks to all the duplicate elimination. That means less I/O,
which is a big performance hit for many database system. If a
smaller database can hold the same represented data, it also means
a higher percentage of the unique data can fit into a given size
cache or memory pool, further reducing the need for I/O. It also
means that if you need to update something (insert/update/delete)
those changes tend to be extremely localized, making the change
smaller and faster.
The one trick is that if you split your data up into a bunch of
tables, you need to spend time gathering it back up for most queries.
That means JOINs.
There are two problems with JOINs. First, unless you're careful
about your index placement and how the query is setup, they can become
very expensive. Normally they're fairly efficient, but it only takes
a misplaced index to really screw things up. So you have to
understand what you're doing and pay attention. Not a real big deal,
but you need to have to know what is going on.
The other issue (that plays strongly into the first) is that a lot of
people fundamentally cannot wrap their heads around JOINs. If you're
uncomfortable with JOINs, you tend to design databases that don't
require them. And that's really too bad, as the JOIN is, in many
ways, *The* fundamental Relational operator.
So, in general, I would say yes... the goal should be 3NF (in nearly
all cases, if you get to 3NF you get to 5NF... the last two are to
cover a few somewhat unusual edge cases). The normal forms really
buy you a lot of practical benefits, and are a lot more than just a
bunch of stuffy academic design rules.
I think it is fair to view denormalization as a last-resort performance
optimization, but it should be seen as an optimization and it should
be nearly last resort. It shouldn't be done until you can prove the
performance isn't there, that there are no other practical ways of
getting it back (e.g. better index placement), and that you've
measured and profiled (and NOT guessed) that the JOINs (or whatever
the denormalization is going to get rid of) are the actual cause of
the performance issues.
The big exception to this, and where the most blatant denormalization
takes place, is in data warehouses. That's a pretty special case,
however. First, the data is basically write-once, read-many. Once the
data is imported, it is usually treated as read-only. The lack of
updates means there is less chance of inconsistencies being
introduced-- assuming the original data is good. Second, the data is
usually dumped from another database that has higher normalization.
That is, the data is "known good" due to the fact that it came from
what was, in theory, a properly normalized and well formed database.
This essentially turns the data warehouse into a series of "instanced
views" where useful sets of data are "pre-queried" into common
sub-parts of queries that are normally joined together from several
tables for nearly every query you might want to run. Essentially
exactly the kind of thing you might use a view for, only you select
it into a table. Since the database is lacks updates except,
typically, for nightly batch imports, this works out just fine as
long as you remember to re-form the "views" each time you import new
data.
On the other hand, with the advent of the data driven website, many
databases see a far greater number of reads than writes, so it makes
some sense to push read performance over other issues. This is a
tricky balance. It exposes a lot of data management problems and
pushes more application logic and validation up into the database
access layer. On the other hand, if the database access layer is a
website (and not a desktop application) that has one code base (fewer
deployment problems) that are controlled by the same folks that run
the database, there is at lease some hope of keeping all the Is
dotted and Ts crossed.
On the gripping hand, memory and processor is really cheap (fast
storage, not so much) put next to programmer time. Unless the
database has hundreds of thousands of rows, if you're running on
modern system chances are none of it matters. You'll generally do
fine and it is doubtful the question will ever come up. Those
running on a phone, perhaps not, but a server with a pile of RAM
should let you do almost anything (just remember to adjust things so
that said resources can be effectively used).
Overall, I'd say that normalization is like many "rules." There are
good reasons for normalization and the "always normalize" rule should
be strongly respected. *Respecting* a rule doesn't always mean
*following* a rule, however. If you understand the rule, why it is
there, the thoughts and ideas behind it, and the full consequences of
breaking it, then you're allowed to make a conscious decision to cross
that line and accept the consequences. Breaking a rule by default or
ignorance is usually bad news, however.
"Normalize until it hurts, denormalize until it works."
-j
--
Jay A. Kreibich < J A Y @ K R E I B I.C H >
"Our opponent is an alien starship packed with atomic bombs. We have
a protractor." "I'll go home and see if I can scrounge up a ruler
and a piece of string." --from Anathem by Neal Stephenson
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users