[PERFORM] The many nulls problem

2008-03-14 Thread Kynn Jones
It often happens that a particular pieces of information is non-null for a
small minority of cases.  A superficially different manifestation of this is
when two pieces of information are identical in all but a small minority of
cases.  This can be easily mapped to the previous description by defining a
null in one column to mean that its contents should be obtained from those
of another column.  A further variant of this is when one piece of
information is a simple function of another one in all but a small minority
of cases.

(BTW, I vaguely recall that RDb theorists have a technical term for this
particular design issue, but I don't remember it.)

In all these cases, the design choice, at least according to RDb's 101, is
between including a column in the table that will be NULL most of the time,
or defining a second auxiliary column that references the first one and
holds the non-redundant information for the minority of cases for which this
is necessary (and maybe define a VIEW that includes all the columns).

But for me it is a frequent occurrence that my quaint and simple RDb's 101
reasoning doesn't really apply for PostgreSQL.  Basically, Pg is too smart
for it!  For example, does a large proportion of NULLs really imply a lot of
wasted space?  Maybe this is true for fixed-length data types, but what
about for type TEXT or VARCHAR?

Just to be concrete, consider the case of a customers database for some home
shopping website.  Suppose that, as it happens, for the majority of this
site's customers, the shipping and billing addresses are identical.  Or
consider the scenario of a company in which, for most employees, the email
address can be readily computed from the first and last name using the rule
First M. Last = [EMAIL PROTECTED], but the company allows some
flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's
known to everyone by his nickname, Yaz, the email is
[EMAIL PROTECTED] hardly anyone remembers or even knows his
full name.)

What's your schema design approach for such situations?  How would you go
about deciding whether the number of exceptional cases is small enough to
warrant a second table?  Of course, one could do a systematic profiling of
various possible scenarios, but as a first approximation what's your
rule-of-thumb?

TIA!

Kynn


Re: [PERFORM] The many nulls problem

2008-03-14 Thread Oleg Bartunov

Kynn,

have you seen contrib/hstore ? You can have one table with common attributes
and hide others in hstore

Oleg
On Fri, 14 Mar 2008, Kynn Jones wrote:


It often happens that a particular pieces of information is non-null for a
small minority of cases.  A superficially different manifestation of this is
when two pieces of information are identical in all but a small minority of
cases.  This can be easily mapped to the previous description by defining a
null in one column to mean that its contents should be obtained from those
of another column.  A further variant of this is when one piece of
information is a simple function of another one in all but a small minority
of cases.

(BTW, I vaguely recall that RDb theorists have a technical term for this
particular design issue, but I don't remember it.)

In all these cases, the design choice, at least according to RDb's 101, is
between including a column in the table that will be NULL most of the time,
or defining a second auxiliary column that references the first one and
holds the non-redundant information for the minority of cases for which this
is necessary (and maybe define a VIEW that includes all the columns).

But for me it is a frequent occurrence that my quaint and simple RDb's 101
reasoning doesn't really apply for PostgreSQL.  Basically, Pg is too smart
for it!  For example, does a large proportion of NULLs really imply a lot of
wasted space?  Maybe this is true for fixed-length data types, but what
about for type TEXT or VARCHAR?

Just to be concrete, consider the case of a customers database for some home
shopping website.  Suppose that, as it happens, for the majority of this
site's customers, the shipping and billing addresses are identical.  Or
consider the scenario of a company in which, for most employees, the email
address can be readily computed from the first and last name using the rule
First M. Last = [EMAIL PROTECTED], but the company allows some
flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's
known to everyone by his nickname, Yaz, the email is
[EMAIL PROTECTED] hardly anyone remembers or even knows his
full name.)

What's your schema design approach for such situations?  How would you go
about deciding whether the number of exceptional cases is small enough to
warrant a second table?  Of course, one could do a systematic profiling of
various possible scenarios, but as a first approximation what's your
rule-of-thumb?

TIA!

Kynn



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] The many nulls problem

2008-03-14 Thread Heikki Linnakangas

Kynn Jones wrote:

In all these cases, the design choice, at least according to RDb's 101, is
between including a column in the table that will be NULL most of the time,
or defining a second auxiliary column that references the first one and
holds the non-redundant information for the minority of cases for which this
is necessary (and maybe define a VIEW that includes all the columns).

But for me it is a frequent occurrence that my quaint and simple RDb's 101
reasoning doesn't really apply for PostgreSQL.  Basically, Pg is too smart
for it!  For example, does a large proportion of NULLs really imply a lot of
wasted space?  


It depends. If there's *any* NULLs on a row, a bitmap of the NULLs is 
stored in the tuple header. Without NULL bitmap, the tuple header is 23 
bytes, and due to memory alignment, it's always rounded up to 24 bytes. 
That one padding byte is free for use as NULL bitmap, so it happens 
that if your table has eight columns or less, NULLs will take no space 
at all. If you have more columns than that, if there's *any* NULLs on a 
row you'll waste a whole 4 or 8 bytes (or more if you have a very wide 
table and go beyond the next 4/8 byte boundary), depending on whether 
you're on a 32-bit or 64-bit platform, regardless of how many NULLs 
there is.


That's on 8.3. 8.2 and earlier versions are similar, but the tuple 
header used to be 27 bytes instead of 23, so you have either one or five 
free bytes, depending on architecture.


In any case, that's pretty good compared to many other RDBMSs.

 Maybe this is true for fixed-length data types, but what
 about for type TEXT or VARCHAR?

Datatype doesn't make any difference. Neither does fixed vs variable length.


What's your schema design approach for such situations?  How would you go
about deciding whether the number of exceptional cases is small enough to
warrant a second table?  Of course, one could do a systematic profiling of
various possible scenarios, but as a first approximation what's your
rule-of-thumb?


From performance point of view, I would go with a single table with 
NULL fields on PostgreSQL.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] The many nulls problem

2008-03-14 Thread Kynn Jones
On Fri, Mar 14, 2008 at 3:46 PM, Heikki Linnakangas [EMAIL PROTECTED]
wrote:

 tons of useful info snipped

 From performance point of view, I would go with a single table with
 NULL fields on PostgreSQL.


Wow.  I'm so glad I asked!  Thank you very much!

Kynn


Re: [PERFORM] The many nulls problem

2008-03-14 Thread Kynn Jones
On Fri, Mar 14, 2008 at 2:59 PM, Oleg Bartunov [EMAIL PROTECTED] wrote:

 have you seen contrib/hstore ? You can have one table with common
 attributes
 and hide others in hstore


That's interesting.  I'll check it out.  Thanks!

Kynn


Re: [PERFORM] The many nulls problem

2008-03-14 Thread Oleg Bartunov

On Fri, 14 Mar 2008, Kynn Jones wrote:


On Fri, Mar 14, 2008 at 2:59 PM, Oleg Bartunov [EMAIL PROTECTED] wrote:


have you seen contrib/hstore ? You can have one table with common
attributes
and hide others in hstore



That's interesting.  I'll check it out.  Thanks!


actually, hstore was designed specially for this kind of problems.




Kynn



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance