Hi Rich,

> I wanted to solicit some opinions about how best to layout a database table 
> for a project I am working on.  I have a distributed application that reports 
> run time information back to a central machine.  Included in the report are 
> the parameters used during runtime... approx 25 or so.  Parameters are site 
> specific and may be changed locally by whomever runs the application.  I have 
> a table that records an application instance (columns such as location, 
> version etc), and a table that records report instances (time, date, size).  
> I wanted to be able to query at a glance the most recent parameter set in 
> use, so I came up with the following table design...
> 
> Design A)
> primaryKey | appInstanceId | reportId | param1 | param2 | ... | param25

so that would be:

create table "Application Instance"
(       appInstanceId integer primary key not null
,       name text not null collate nocase
,       location text not null collate nocase
,       version text not null collate nocase
)
;
create table "Report"
(       reportId integer primary key not null
,       datetime real not null
,       size integer
)
;
create table "Runtime"
(       primaryKey integer primary key not null
,       appInstanceId integer not null
,       reportId integer not null
,       param1
,       param2
,       param3
,       param4
,       param5
,       param6
,       param7
,       param8
,       param10
,       param11
,       param12
,       param13
,       param14
,       param15
,       param16
,       param17
,       param18
,       param19
,       param20
,       param21
,       param22
,       param23
,       param24
,       param25
)

> I spoke to a few people at work, and they all say the same thing... "this 
> isn't normalized'.  I get the following suggestion:
> 
> Design B)
> primaryKey | appInstanceId | reportId |paramNum |paramValue
> ---with a table second table--
> paramNum | paramDescription

Which would be:

create table "Application Instance"
(       appInstanceId integer primary key not null
,       name text not null collate nocase
,       location text not null collate nocase
,       version text not null collate nocase
)
;
create table "Report"
(       reportId integer primary key not null
,       datetime real not null
,       size integer
)
;
create table "Param"
(       paramNum integer primary key not null
,       paramDescription text
)
;
create table "Runtime"
(       primaryKey integer primary key not null
,       appInstanceId integer not null
                references "Application Instance" (appInstanceId) on delete 
cascade on update cascade
,       reportId integer not null
                references "Report" (reportId) on delete cascade on update 
cascade
,       paramNum integer not null
                references "Param" (paramNum) on delete restrict on update 
cascade
,       paramValue not null
)
;

> The reason for Design B:
>   New parameters are easily added (no schema change)
>   It is "Better Design" (throwing around the word normalization a lot)

Agreed.

> My reasons for Design A:
>   1:1 record to report (not 1:25 records)
>   meta data overhead per row is 1/9 (opposed to 4/1)

Those aren't problems. You can use views and other mechanisms to view the data 
in whatever layout you like (so you don't have to manually cross reference). 
But the underlying data should be stored efficiently and, yes, normalized.

> I'm not a database guru, so I can't fight back...but I feel I am hearing a 
> line straight from a textbook.  Critics of my design simply tells me it is 
> wrong and keep repeating the mantra "It is not normalized".

Yes, that is frustrating. I've been there, years ago, even on this list, had 
"normalized" thrown at me without explanation. The best thing to do is research 
what normalized means. I am a convert now ;-) The "Definitive Guide to SQLite" 
book is a good place to start, in the chapter on "Normalization" (though read 
the previous pages first). In part it states:

>> Normalization concerns itself with the organization of attributes within 
>> relations so as to minimize duplication of data. Data duplication, as you 
>> will see, has more deleterious effects than just taking up unnecessary 
>> space. It increases the opportunity for database inconsistencies to arise. 
>> Normalization is about designing your database to be more resistant to the 
>> ill effects of thoughtless users and buggy programs. 

>> As stated, the chief aim of normalization is to eradicate duplication. 
>> Relations that have duplication removed are said to be normalized. However, 
>> there are degrees of normalization. These degrees are called normal forms.

>> First normal form simply states that all attributes in a relation use 
>> domains that are made up of atomic values..., meaning simply “that which 
>> cannot be broken down further.”

>> To understand second and third normal form, you have to first understand 
>> functional dependencies. The simple definition of a functional dependency is 
>> a correlation between columns in a table. If the values of one column (or 
>> set of columns) correlate to the values of another, then they have a 
>> functional dependency... Many times, functional dependencies are a warning 
>> sign that duplication lurks within a table.

>> Second normal form is defined in terms of functional dependencies. It 
>> requires that a relation be in first normal form and that all non-key 
>> attributes be functionally dependent upon all attributes of the primary key 
>> (not just part of them)... So what do you do? You decompose the table into 
>> two tables.

>> This is like factoring out a common variable in an algebraic expression... 
>> Furthermore, no information is lost... But look what else you get: the new 
>> design allows referential integrity to back you up: you have a foreign key 
>> constraint guarding this relationship..., which you couldn’t get in the 
>> previous form. What was previously only implied is now both explicit and 
>> enforceable.

That's a brief extract. It makes more sense in context, with included examples.

> The way I see it the column itself describes the data, which means less data 
> is stored... efficient and simple.

There are several problems with that approach, such as:

1. The parameter names _are_ actually data, so should be stored as such, not in 
the schema itself.

2. You end up with a pile of null values, which is inefficient, redundant 
duplication.

3. The human mind copes well with a single huge flat table, like you suggest, 
but that doesn't make it efficient or simple for a computer system. You can 
store it one way, but show it in many ways. The storage is key to get right, or 
it will bight you again and again later on. You can always adjust the views, 
but the underlying structure rapidly becomes set in stone, so it's important to 
get it right, even if you don't see the advantages up front.

4. It's difficult or impossible to add or remove parameters later without 
altering the whole schema. That has a huge overhead, such as potentially 
breaking dependant systems (which possibly don't even exist yet), having to 
backup rows, drop the old table, create a new one, insert the old data etc.

5. You have to tie external mechanisms into the specific names of the columns 
and change all those mechanisms when you change those columns. You can't, for 
instance, simply sort the output but parameter descriptions.

> I do see the appeal of Design A by effectively 'future-proofing' the database 
> table. My application performing the inserts should not have to change, I 
> just always insert the parameters I find in each report.

Yes. And the benefits become more apparent with time.

> I should add that I do expect future parameters to be added (maybe 5,10 more 
> at most, but rather infrequently).  Old and new versions
> will co-exist, but eventually all application instances should be updated.  
> With Design A I would add new columns and set existing records to NULL.

There's reason enough in that paragraph to urge you to run towards a normalized 
database design.

> I would expect 100,000 reports a month or so.  When I select from parameters 
> I generally will show all parameters, so my queries become "select * from tbl 
> where appid=x and reportId=y" with 1 result set necessary (not iterating over 
> 25 results).

For the normalized design, to get, say, the parameterDescription and paramValue 
pairs, sorted alphabetically for a particular AppInstance (name = 'Safari' and 
location = 'Reception Desk') for the latest Report,  your query becomes:

select paramDescription, paramValue
from "Runtime"
        join "Param" using (paramNum)
        join "Report" using (reportId)
        join "AppInstance" using (appInstanceId)
where AppInstance.Name = 'Safari'
        and AppInstance.Location = 'Reception Desk'
        and Report.datetime = (select max (datetime) from Report)
order by paramDescription

Note that this will only give you results that actually exist, no extra nulls 
etc that aren't needed. It will be listed as rows, not columns, which is far 
more manageable and predictable.

> I realize design A may not be "best", but I would prefer an friendly answer 
> and not a canned response like I get from co-workers.

Hopefully this was friendly and leads you towards "enlightenment" ;-)

Tom

Tom Brodhurst-Hill
BareFeetWare

--
iPhone/iPad/iPod and Mac software development, specialising in databases
develo...@barefeetware.com
--
Twitter: http://twitter.com/barefeetware/
Facebook: http://www.facebook.com/BareFeetWare

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to