Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-28 Thread Leandro Guimarães Faria Corcete Dutra
Em Qui, 2006-01-19 às 22:29 +0100, Martijn van Oosterhout escreveu:
 Possibly nowhere. But when you send invoices to customers, any details
 on there *are* immutable. Sure, in your database you don't care if
 things change, but then they don't match reality anymore do they?

Then what you need is a temporal database -- at least some form of
historical records.  Nothing to do with keys in themselves.


 I never said there were duplicate tuples, just that the data has no
 natural keys. The tuples are unique because there's a surrogate key.

This does not guarantee uniqueness, as the key is artificially and
internally generated.


 It
 is entirely possible to have two people with the same first name, last
 name and date of birth. Rather uncommon, but the database must be able
 to support it.

And the way to support it is to take into account additional data --
place of birth, parents' data etc -- as part of the candidate keys.  Not
to allow duplicates.


 I don't understand your example though. If you have a personnel
 directory with two rows with the same first and last name, what does
 that tell you. Nothing. You have to go find out whether there really
 are two of those people or not.

And how will you do that if you don't store additional data?


 You can simplify the process by taking
 into account the fact that it's very unlikely, but a unique constraint
 is not the answer.

Oh yes, it is.  They only one.


 Besides, it's far more likely the same person will
 appear twice with two different spellings of their name. :)

So what?

-- 
+55 (11) 5685 2219   xmpp:[EMAIL PROTECTED]
+55 (11) 9406 7191   Yahoo!: lgcdutra
+55 (11) 5686 9607 MSN: [EMAIL PROTECTED]
+55 (11) 4390 5383  ICQ/AIM: 61287803


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-24 Thread Bruno Wolff III
On Thu, Jan 19, 2006 at 00:06:41 -0500,
  Tom Lane [EMAIL PROTECTED] wrote:
 
 The problem with SSN is that somebody other than you controls it.
 If you are the college registrar, then you control the student's
 registration number, and you don't have to change it.  In fact, guess
 what: you probably generated it in the same way as a surrogate key.

I work for a University and even the numbers assigned by us get changed on a
regular basis as it is very easy for people to get entered into the
system multiple times. (And for a while campus ids were SSNs by default and we
are still in the process of making them different for everyone.) There are
several effectively surrogate keys (campus id and emplid), but they don't map
1 to 1 to real people. I believe we keep a history of campus ids, and delete
emplids for duplicates.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-24 Thread Bruno Wolff III
On Thu, Jan 19, 2006 at 09:53:11 -0500,
  [EMAIL PROTECTED] wrote:
 
 Yes. Representation of the DNA is probably best. But - that's a lot of
 data to use as a key in multiple tables. :-)

On a simple level, this would be a problem for twins.
There are other complications as well. People are going to have slightly
different DNA in different cells due to mutations. Though you could probably
do some averaging over a number of cells to get a single value.
For people that have had transplants, you could probably define something
for doing the sample for original material.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-20 Thread Mike Rylander
On 1/19/06, Pollard, Mike [EMAIL PROTECTED] wrote:
 Martijn van Oosterhout wrote:

  Please provides natural keys for any of the following:
 
  - A Person
  - A phone call: (from,to,date,time,duration) is not enough
  - A physical address
  - A phone line: (phone numbers arn't unique over time)
  - An internet account: (usernames not unique over time either)

 Ahh, a challenge.  Hmm, not sure about all of them, but here goes:

 A Person - well, you could use a bit map of their fingerprints, or maybe
 their retinal scan.  Of course, that could change due to serious injury.
 Maybe some kind of representation of their DNA?

Unless the person in question happens to be a chimera (yes, they do exist).

;-)

--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread David Fetter
On Thu, Jan 19, 2006 at 12:06:41AM -0500, Tom Lane wrote:
 Michael Glaesemann [EMAIL PROTECTED] writes:
  As far as I can tell, the only difference between your position,
  Dann, and Date and Darwen's, is that you think no natural key is
  immutable.
 
 DD's examples of natural keys are worth a second look though:
 
  If a primary key exists for a collection that is known never to
  change, for example social security number, student registration
  number, or employee number, then no additional system-assigned
  UID is required.
 
 The problem with SSN is that somebody other than you controls it.

No, that's not the big problem.  The big problem is that it's very
likely illegal for you to use it for anything unless you happen to be
the Social Security Administration.

 If you are the college registrar, then you control the student's
 registration number, and you don't have to change it.  In fact,
 guess what: you probably generated it in the same way as a surrogate
 key.

True.

 I'd argue that all of these are in reality the exact same thing as a
 surrogate key --- from the point of view of the issuing authority.
 But from anyone else's point of view, they are external data and you
 can't hang your own database design on the assumption that they
 won't change.

Right :)

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Martijn van Oosterhout
On Thu, Jan 19, 2006 at 12:50:52AM -0800, David Fetter wrote:
 On Thu, Jan 19, 2006 at 12:06:41AM -0500, Tom Lane wrote:
 No, that's not the big problem.  The big problem is that it's very
 likely illegal for you to use it for anything unless you happen to be
 the Social Security Administration.

Actually no. From reading here[1] it appears anyone is allowed to ask
you your SSN and they can do what they like with it.

What you're describing is more like the TFN in Australia. Apart from
the fact you're not required to have one or provide it if asked, if
you're not a bank, or share registrary or some other such institution,
you're not allowed to ask for it, let alone store it. Medicare number
the same, if you're not a health service provider, you can't ask for
it.

Anyway, this doesn't mean an SSN is a good key, for all sorts of other
reasons people have already stated.

[1] http://www.cpsr.org/prevsite/cpsr/privacy/ssn/ssn.faq.html#IsItIllegalToAsk

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Pollard, Mike
Martijn van Oosterhout wrote:

 Please provides natural keys for any of the following:
 
 - A Person
 - A phone call: (from,to,date,time,duration) is not enough
 - A physical address
 - A phone line: (phone numbers arn't unique over time)
 - An internet account: (usernames not unique over time either)

Ahh, a challenge.  Hmm, not sure about all of them, but here goes:

A Person - well, you could use a bit map of their fingerprints, or maybe
their retinal scan.  Of course, that could change due to serious injury.
Maybe some kind of representation of their DNA?

A physical address - how about longitude/latitude/height from sea level?

The point here is two-fold.  First, what we call 'natural' is frequently
itself a surrogate key (yes, even your name is really just a surrogate
key.  As with all surrogate keys, it is a sequence of symbols that you
use to represent yourself).  The second point is even when you find a
truly 'natural' key (something not arbitrarily made up by anyone, and
uniquely identifying the data in question), it may be completely and
utterly inappropriate to use in a database.

What is 'natural' anyway?  If someone phones in an order, we usually
assign an order number to that request.  This order number is not the
actual order, and the customer couldn't care a less what it is, but I've
never heard a DBA argue we should get rid of it (well, to be fair, I've
never discussed order numbers with a DBA at all).  After all, would it
make sense for the key for that order to be the customer's name, the
date/time of the order, all the items ordered, and the address to ship
the order?  That isn't a key, but it's the only 'natural' thing that
identifies that order that immediately comes to my mind.

On the other hand, would anyone argue that an order_item table should
have a surrogate key?  Well, I wouldn't.  The key for the order_item
table should be something like the order number and the inventory item
number together (IMHO).

The point?  Surrogate keys and natural keys are two tools in the
database arsenal.  Just as it is unwise to use a hammer to drive a screw
just because you don't believe in screwdrivers, it is unwise to just off
hand discard either method of specifying a key.  Rather, use
intelligence and education (one of which is discussions such as this) in
deciding how best to represent your data to aide in performance, ease of
use, and adaptability.

Mike Pollard
SUPRA Server SQL Engineering and Support
Cincom Systems, Inc.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread mark
On Thu, Jan 19, 2006 at 09:37:12AM -0500, Pollard, Mike wrote:
 Martijn van Oosterhout wrote:
  Please provides natural keys for any of the following:
  - A Person
  - A phone call: (from,to,date,time,duration) is not enough
  - A physical address
  - A phone line: (phone numbers arn't unique over time)
  - An internet account: (usernames not unique over time either)
 Ahh, a challenge.  Hmm, not sure about all of them, but here goes:
 A Person - well, you could use a bit map of their fingerprints, or maybe
 their retinal scan.  Of course, that could change due to serious injury.
 Maybe some kind of representation of their DNA?

Yes. Representation of the DNA is probably best. But - that's a lot of
data to use as a key in multiple tables. :-)

 A physical address - how about longitude/latitude/height from sea level?

Planet? Solar system? Galaxy? Universe? :-)

I agreed with what you had to say (the stuff I deleted). Just felt
like being funny. Not sure if I'm successful. Hehe...

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Jim C. Nasby
On Thu, Jan 19, 2006 at 09:37:12AM -0500, Pollard, Mike wrote:
 The point?  Surrogate keys and natural keys are two tools in the
 database arsenal.  Just as it is unwise to use a hammer to drive a screw
 just because you don't believe in screwdrivers, it is unwise to just off
 hand discard either method of specifying a key.  Rather, use
 intelligence and education (one of which is discussions such as this) in
 deciding how best to represent your data to aide in performance, ease of
 use, and adaptability.

There is one thing to consider: consistency. If you mix and match
'natural' keys and surrogate keys as PK, then how do you know which one
you're supposed to be joining on? How does everyone else on the team
know?

Sure, there's many examples where you don't really need a surrogate key.
But there's just as many (if not more) where you want a surrogate key so
that you don't have to deal with the pain of a multiple-field key. (Note
that I don't consider simply defining a multiple-field key to be unique
as painful). So ISTM it's much easier to just use surrogate keys and be
done with it. Only deviate when you have a good reason to do so.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Josh Berkus

Martjin,


In any of these either misspellings, changes of names, ownership or
even structure over time render the obvious useless as keys. There are
techniques for detecting and reducing duplication but the point is that
for any of these duplicates *can* be valid data.


Please point me out where, in the writings of E.F. Codd or in the SQL 
Standard, it says that keys have to be immutable for the life of the row.


Duplicate *values* can be valid data.  Duplicate *tuples* show some 
serious flaws in your database design.  If you have a personnel 
directory on which you've not bothered to define any unique constraints 
other than the ID column, then you can't match your data to reality.  If 
you have two rows with the same first and last name, you don't know if 
they are two different people or the same person, duplicated.  Which 
will be a big problem come paycheck time.


Per E.F. Codd, each tuple is a *unique* predicate (or key) comprising a 
set of values definining a *unique* data entity.  i.e. The employeee 
named John Little at extension 4531.  There is nothing anywhere 
said about keys never changing.


This is Databases 101 material.  Really!

--Josh


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Josh Berkus

Jim,

 So ISTM it's much easier to just use surrogate keys and be

done with it. Only deviate when you have a good reason to do so.


The lazy man's guide to SQL database design, but Jim Nasby.

;-)

--Josh


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread mark
On Thu, Jan 19, 2006 at 10:11:51AM -0800, Josh Berkus wrote:
  So ISTM it's much easier to just use surrogate keys and be
 done with it. Only deviate when you have a good reason to do so.
 The lazy man's guide to SQL database design, but Jim Nasby.
 ;-)

Hehe... I was thinking the same thing. I've definately seen cases
where the use of surrogate keys verges on ridiculous. It hasn't
harmed the application, except it terms of complexity. It still works.
It still performs fine. The SQL queries are awful looking. :-)

That's where I would tend to draw the line. For me, I find
implementation and maintenance to be the most expensive part of my
applications. My data hasn't yet become large enough to make disk
space, compute resources, or I/O bandwidth a serious concern.

If I think the use of surrogate keys may make my life harder, I'll try
not to use them. If I think they may make my life easier, I'll use
them without blinking an eye. Harder vs. easier = cost to implement.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Dann Corbit


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Josh Berkus
 Sent: Thursday, January 19, 2006 10:09 AM
 To: Martijn van Oosterhout
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Surrogate keys (Was: enums)
 
 Martjin,
 
  In any of these either misspellings, changes of names, ownership or
  even structure over time render the obvious useless as keys. There
are
  techniques for detecting and reducing duplication but the point is
that
  for any of these duplicates *can* be valid data.
 
 Please point me out where, in the writings of E.F. Codd or in the SQL
 Standard, it says that keys have to be immutable for the life of the
row.

Only do that for data that you care about.  If you think that the data
has no value, there is no need to have a way to identify a row.

 Duplicate *values* can be valid data.  Duplicate *tuples* show some
 serious flaws in your database design.  If you have a personnel
 directory on which you've not bothered to define any unique
constraints
 other than the ID column, then you can't match your data to reality.
If
 you have two rows with the same first and last name, you don't know if
 they are two different people or the same person, duplicated.  Which
 will be a big problem come paycheck time.
 
 Per E.F. Codd, each tuple is a *unique* predicate (or key) comprising
a
 set of values definining a *unique* data entity.  i.e. The employeee
 named John Little at extension 4531.  There is nothing anywhere
 said about keys never changing.
 
 This is Databases 101 material.  Really!

I give it an 'F.'

When the data changes, the problems generated are not just due to
repercussions related to the child and parent tables related through the
primary key.

Someone has an invoice, and they call in with a question. A combination
of their name and address was used as a primary key.  They moved, and
sent in a forwarding address.  The DBA was smart enough to design the
database to cascade results, so that there are no orphan records and we
have not compromised the structure of the database.
The customer calls in with a question about an old invoice.
We have no record of that transaction.

I was a DBA for a database for a company with many millions of customers
worldwide (e.g. the product registration table was 24 GB).

Their design had natural keys in it.  It caused dozens of problems,
every single day.

I content that most people are not smart enough to decide when a natural
key is a good idea.  The engineers that designed the database were
probably pretty smart, since it sort of worked and had thousands of
tables and hundreds of millions of rows in it.  But one bad decision on
a natural key will cause literally millions of dollars of damage.

The primary defense I have heard so far is that the Oids are hard to
understand.  They are nothing in comparison with understanding what to
do when you have 25 changes to primary keys on various tables every
single day.

Once you get used to Oids, I find it hard to believe that any
intelligent person finds them confusing.  Confusion resulting from
having primary keys that are a moving target?  Now that's confusion for
you.

IMO-YMMV.

I think it is time for me to give it a rest, though.  My experience may
be very atypical and I feel strangely passionate about it.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Jim C. Nasby
On Thu, Jan 19, 2006 at 02:01:14PM -0500, [EMAIL PROTECTED] wrote:
 On Thu, Jan 19, 2006 at 10:11:51AM -0800, Josh Berkus wrote:
   So ISTM it's much easier to just use surrogate keys and be
  done with it. Only deviate when you have a good reason to do so.
  The lazy man's guide to SQL database design, but Jim Nasby.
  ;-)
 
 Hehe... I was thinking the same thing. I've definately seen cases
 where the use of surrogate keys verges on ridiculous. It hasn't
 harmed the application, except it terms of complexity. It still works.
 It still performs fine. The SQL queries are awful looking. :-)

Got an example?

 That's where I would tend to draw the line. For me, I find
 implementation and maintenance to be the most expensive part of my
 applications. My data hasn't yet become large enough to make disk
 space, compute resources, or I/O bandwidth a serious concern.

Which is exactly what my thought process is. If you mix surrogate and
non-surrogate keys, how do you know which table has which? Sure, while
you're actively writing the code it's not an issue, but what about 6
months later? What about if someone else picks up the code?

I know Josh was poking fun with his comment about me being lazy, but
lazy can make for better code. I can go back to code I wrote 3 years ago
and I know that 99% of tables will have something_id (where something is
almost certain to be the table name) as a surrogate key to join on;
there's no need for me to go and figure out what does and what doesn't
have a surrogate key. The 1% that don't fall into that generally aren't
an issue because they're normally very large tables that nothing joins
to.

There's actually an article floating around somewhere about how lazy
coders are good coders... :)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Jim C. Nasby
On Thu, Jan 19, 2006 at 11:22:24AM -0800, Dann Corbit wrote:
   In any of these either misspellings, changes of names, ownership or
   even structure over time render the obvious useless as keys. There
 are
   techniques for detecting and reducing duplication but the point is
 that
   for any of these duplicates *can* be valid data.
  
  Please point me out where, in the writings of E.F. Codd or in the SQL
  Standard, it says that keys have to be immutable for the life of the
 row.
 
 Only do that for data that you care about.  If you think that the data
 has no value, there is no need to have a way to identify a row.

Erm... if you don't care, why are you storing it? :)

 I was a DBA for a database for a company with many millions of customers
 worldwide (e.g. the product registration table was 24 GB).
 
 Their design had natural keys in it.  It caused dozens of problems,
 every single day.
 
 I content that most people are not smart enough to decide when a natural
 key is a good idea.  The engineers that designed the database were
 probably pretty smart, since it sort of worked and had thousands of
 tables and hundreds of millions of rows in it.  But one bad decision on
 a natural key will cause literally millions of dollars of damage.
 
 The primary defense I have heard so far is that the Oids are hard to
 understand.  They are nothing in comparison with understanding what to
 do when you have 25 changes to primary keys on various tables every
 single day.
 
 Once you get used to Oids, I find it hard to believe that any
 intelligent person finds them confusing.  Confusion resulting from
 having primary keys that are a moving target?  Now that's confusion for
 you.

Well, I wouldn't use OIDs as in the PostgreSQL OID, but I agree. If
nothing else an ID gives you a fallback... if you absolutely can't find
a customer (or whatever else) through natural keys, you ask them for
their customer ID/number, which has no reason to ever change.

BTW, if you want to see a mess*, take a look at the distributed.net stats
code, which unfortunately uses email as the means to identify
participants. It made perfect sense originally, anyone running the
client was bound to have an email address, and they all had to be
unique, right? Worked great until the first person contacted us
wondering how to change his email address in stats because he'd changed
ISPs. If you look at todays statscode (at least the database portion of
it) approximately 50% of it is there to deal with people retiring one
email address into another, and I'd say that 90%+ of the bugs are in
this code. Had we just required new users to register to get a nice
shiny unique numeric ID (or a unique username...), none of that code
would exist.

* note that I'm not trying to rag on any of the numerous people who've
been involved in the stats code over the years, but it is insightful to
look at some of the 'dumb mistakes' that have been made and the large
amount of pain that it's caused.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread mark
On Thu, Jan 19, 2006 at 01:48:18PM -0600, Jim C. Nasby wrote:
 On Thu, Jan 19, 2006 at 02:01:14PM -0500, [EMAIL PROTECTED] wrote:
  On Thu, Jan 19, 2006 at 10:11:51AM -0800, Josh Berkus wrote:
So ISTM it's much easier to just use surrogate keys and be
   done with it. Only deviate when you have a good reason to do so.
   The lazy man's guide to SQL database design, but Jim Nasby.
   ;-)
  Hehe... I was thinking the same thing. I've definately seen cases
  where the use of surrogate keys verges on ridiculous. It hasn't
  harmed the application, except it terms of complexity. It still works.
  It still performs fine. The SQL queries are awful looking. :-)
 Got an example?

Sure, but I have to be vague, because it's a company DB. :-)

The DB has a primary table, that maps the primary key to a surrogate
key. The surrogate key is used in several 1:1 and 1:N relationships.
Pretty straight forward. (I tend to put the primary key in the most
primary table that would have a 1:1 relationship, of which there is
one in this database - but whatever) The primary key is an identifier
used for all inputs and outputs to the application. It is used by
manual and automatic processes internal and external to the company.
Definately a primary key / surrogate key scenario.

The problem here, is that the primary key *is* a natural key. It
is generated to be unique, and it is immutable. There are no interfaces
provided to allow the rename of the key. It is a short character
string of 5 to 20 characters.

All queries to the table are joined with this primary key/surrogate
key table, to allow lookup by the primary key, for records only
identified by the surrogate key.

The database is only likely to have a few thousands records, with
the 1:N relationships not exceeding 5 or 10, and not recursive.
For performance, or disk space, it doesn't really matter which way
they went.

The confusion, though, of joining using a surrogate, that is
intended to be opaque (the value is never queried), ensures that
the program has no simple queries. All queries involve at least
one join.

I said almost ridiculous. It's not enough for me to complain, and
request a re-design. I don't really care what it does, as long as
it accepts my data, and allows me to query my data. But, it does
seem silly to me.

  That's where I would tend to draw the line. For me, I find
  implementation and maintenance to be the most expensive part of my
  applications. My data hasn't yet become large enough to make disk
  space, compute resources, or I/O bandwidth a serious concern.
 Which is exactly what my thought process is. If you mix surrogate and
 non-surrogate keys, how do you know which table has which? Sure, while
 you're actively writing the code it's not an issue, but what about 6
 months later? What about if someone else picks up the code?

It's usually pretty obvious, looking at a database diagram. You look
up the primary key, and see that it only shows up in one table. :-)

 I know Josh was poking fun with his comment about me being lazy, but
 lazy can make for better code. I can go back to code I wrote 3 years ago
 and I know that 99% of tables will have something_id (where something is
 almost certain to be the table name) as a surrogate key to join on;
 there's no need for me to go and figure out what does and what doesn't
 have a surrogate key. The 1% that don't fall into that generally aren't
 an issue because they're normally very large tables that nothing joins
 to.

I don't disagree with you. I just don't mind deciding to use a surrogate
key if I'm unsure, and not using a surrogate if it seems more effort than
gain.

 There's actually an article floating around somewhere about how lazy
 coders are good coders... :)

Dunno where it started, but that's one of the tenets of the developers
of Perl. Of course, with Perl 6, they admitted to having made quite a
few deisgn errors with Perl 5 and earlier... :-)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Martijn van Oosterhout
On Thu, Jan 19, 2006 at 10:09:26AM -0800, Josh Berkus wrote:
 Martjin,
 
 In any of these either misspellings, changes of names, ownership or
 even structure over time render the obvious useless as keys. There are
 techniques for detecting and reducing duplication but the point is that
 for any of these duplicates *can* be valid data.
 
 Please point me out where, in the writings of E.F. Codd or in the SQL 
 Standard, it says that keys have to be immutable for the life of the row.

Possibly nowhere. But when you send invoices to customers, any details
on there *are* immutable. Sure, in your database you don't care if
things change, but then they don't match reality anymore do they?

 Duplicate *values* can be valid data.  Duplicate *tuples* show some 
 serious flaws in your database design.  If you have a personnel 
 directory on which you've not bothered to define any unique constraints 
 other than the ID column, then you can't match your data to reality.  If 
 you have two rows with the same first and last name, you don't know if 
 they are two different people or the same person, duplicated.  Which 
 will be a big problem come paycheck time.

I never said there were duplicate tuples, just that the data has no
natural keys. The tuples are unique because there's a surrogate key. It
is entirely possible to have two people with the same first name, last
name and date of birth. Rather uncommon, but the database must be able
to support it.

I don't understand your example though. If you have a personnel
directory with two rows with the same first and last name, what does
that tell you. Nothing. You have to go find out whether there really
are two of those people or not. You can simplify the process by taking
into account the fact that it's very unlikely, but a unique constraint
is not the answer. Besides, it's far more likely the same person will
appear twice with two different spellings of their name. :)

Anyway, the discussion was about surrogate vs natural keys. Nothing
here has convinced me that there are any useful natural keys to be
found in the examples I gave. Most of the examples I gave come from a
system I had to maintain where some designer had assumed there was some
kind of natural key and in *each* and *every* case it caused
problems...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Christopher Kings-Lynne

Yes. Representation of the DNA is probably best. But - that's a lot of
data to use as a key in multiple tables. :-)


No then you have problems with identical twins :)

Chris


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-19 Thread Michael Glaesemann


On Jan 20, 2006, at 10:50 , Christopher Kings-Lynne wrote:

Yes. Representation of the DNA is probably best. But - that's a  
lot of

data to use as a key in multiple tables. :-)


No then you have problems with identical twins :)


And, looking forward, clones.

Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Leandro Guimarães Faria Corcete DUTRA
Jim C. Nasby jnasby at pervasive.com writes:

 a) the optimizer does a really poor job on multi-column index statistics

So it should be fixed?

And there are a *lot* of singular, natural keys.


 b) If each parent record will have many children, the space savings from
 using a surrogate key can be quite large

Not such a common case.


 c) depending on how you view things, putting actual keys all over the
 place is denormalized

How come?  Never!


 Generally, I just use surrogate keys for everything unless performance
 dictates something else.

What I am proposing is the reverse: use natural keys for everything unless 
performance dictates something else.

In support of my PoV: 
http://blogs.ittoolbox.com/database/soup/archives/007327.asp?rss=1



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Leandro Guimarães Faria Corcete DUTRA
Greg Stark gsstark at mit.edu writes:

 I hate knee-jerk reactions too, but just think of all the pain of people
 dealing with databases where they used Social Security numbers for primary
 keys. I would never use an attribute that represents some real-world datum as
 a primary key any more.

I am not familiar with the situation.


 In my experience there are very few occasions where I want a real non-sequence
 generated primary key. I've never regretted having a sequence generated
 primary key, and I've certainly had occasions to regret not having one.

http://blogs.ittoolbox.com/database/soup/archives/007327.asp?rss=1



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Martijn van Oosterhout
On Wed, Jan 18, 2006 at 01:08:53PM +, Leandro Guimarães Faria Corcete DUTRA 
wrote:
 Jim C. Nasby jnasby at pervasive.com writes:
  Generally, I just use surrogate keys for everything unless performance
  dictates something else.
 
 What I am proposing is the reverse: use natural keys for everything unless 
 performance dictates something else.
 
 In support of my PoV: 
 http://blogs.ittoolbox.com/database/soup/archives/007327.asp?rss=1

Interesting. However, in my experience very few things have natural
keys. There are no combination of attributes for people, phone calls
or even real events that make useful natural keys.

You don't say what the primary key on your events table was but I can
see one possibility:

(place,datetime)   

A unique on this won't prevent overlapping events. Sure, it'll get rid
of the obvious duplicates but won't solve the problem. It also fails
the criteria that keys stable, since you can move events. You do need a
constraint on that table, but a unique constraint isn't it.

While I agree with your statement that it's the abuse of these keys
thats the problem, I find people are far too likely to see natural keys
where none exist.

BTW, the way I deal with people mixing up surrogate keys is by (usually
by chance) having the sequences for different tables start at wildly
different points. By starting one counter at a million and the other at
one, the chances that you'll be able to mix them up is reduced. On some
systems I can even identify the table a key comes from by looking at the
number, just because I know only one table has keys in the 30,000
range.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Michael Glaesemann


On Jan 18, 2006, at 22:08 , Leandro Guimarães Faria Corcete DUTRA wrote:


Jim C. Nasby jnasby at pervasive.com writes:

a) the optimizer does a really poor job on multi-column index  
statistics


So it should be fixed?


Of course! Patches welcome!

Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Greg Stark
Leandro Guimarães Faria Corcete DUTRA [EMAIL PROTECTED] writes:

 Greg Stark gsstark at mit.edu writes:
 
  I hate knee-jerk reactions too, but just think of all the pain of people
  dealing with databases where they used Social Security numbers for primary
  keys. I would never use an attribute that represents some real-world datum 
  as
  a primary key any more.
 
 I am not familiar with the situation.

The US gov't handed out unique numbers to every worker for their old age
pension program. Many early database designers thought that made a wonderful
natural primary key.

It turns out that:

a) not everyone has a social insurance number: when their business expanded to
include foreign nationals these databases had to make up fake social insurance
numbers.

b) Occasionally people's social insurance numbers change, either because they
got it wrong in the first place or because of identity theft later on. Even
dealing with it changing isn't good enough because the old records don't
disappear; the person essentially has *two* social insurance numbers.

c) For security reasons it turns out to be a bad idea to be passing around
social insurance numbers in the first place. So these database designers had a
major problem adapting when people started refusing to give them social
insurance numbers or complaining when their application leaked their social
insurance number.

In short, what seemed like the clearest possible example of a natural primary
key became a great example of how hard it is to deal with changing business
requirements when you've tied your database design to the old rules. Using
natural primary keys makes an iron-clad design assumption that the business
rules surrounding that datum will never change. And the one thing constant in
business is that business rules change.

In the past I've used username as a primary key for a users table, what
could be safer? 

Later we had to create a sequence generated userid column because some data
partners couldn't handle an text column without corrupting it. And of course
one day the question arose whether we could handle someone wanting to change
their username. Then another day we were asked whether we could have two
different people with the same username if they belonged to separate branded
subsites.



-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Jim C. Nasby
On Wed, Jan 18, 2006 at 01:08:53PM +, Leandro Guimar??es Faria Corcete 
DUTRA wrote:
  b) If each parent record will have many children, the space savings from
  using a surrogate key can be quite large
 
 Not such a common case.
 
Hmmm...

Many blog entries per user... Many blog comments per entry

Many PO's per customer... many line items per PO...
 
Etc., etc. I would argue that one-many relationships are far more common
than one-one, and it's very common for an integer ID to be a more
compact representation than a real key.

  c) depending on how you view things, putting actual keys all over the
  place is denormalized
 
 How come?  Never!
 
Huh?

One of the tenants of normalization is that you don't repeat data. You
don't use customer name in your PO table, because it's asking for
problems; what if a customer changes names (as just one example).

  Generally, I just use surrogate keys for everything unless performance
  dictates something else.
 
 What I am proposing is the reverse: use natural keys for everything unless 
 performance dictates something else.
 
 In support of my PoV: 
 http://blogs.ittoolbox.com/database/soup/archives/007327.asp?rss=1

Read the bottom of it:

I am not saying that you should avoid autonumber surrogate keys like an
SCO executive. The danger is not in their use but in their abuse. The
events_id column in the events table didn't give us any trouble
until we began to rely on it as the sole key for the table. The
accounting application gave us problems because we were using the ID as
the entire handle for the records. That crossed the line from use to
misuse, and we suffered for it.

To paraphrase, the issue isn't that surrogate keys were used for RI; the
issue is that proper keys were not setup to begin with. Does it make
sense to have a customer table where customer_name isn't unique? Almost
certainly not. But that's just one possible constraint you might put on
that table. To put words in Josh's mouth, the issue isn't with using a
surrogate key, it's with not thinking about what constraints you should
be placing on your data.

Take a look at cbk's comment; he does a great job of summing the issue
up.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


FW: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Jim Nasby
Ooops, fat-finger'd -hackers...

-Original Message-
Adding -hackers back to the list.

 From: Leandro Guimarães Faria Corcete Dutra
 Em Seg, 2006-01-16 às 12:52 -0600, Jim C. Nasby escreveu:
  On Sat, Jan 14, 2006 at 07:28:21PM +0900, Michael Glaesemann wrote:
   
   For UPDATEs and INSERTs, the proper primary key also 
 needs to be  
   checked, but keys are used for more than just checking 
 uniqueness:  
   they're also often used in JOINs. Joining against a 
 single integer  
   I'd think it quite a different proposition (I'd think 
 faster in terms  
   of performance) than joining against, say, a text column or a  
   composite key.
 
   How different is that?

Comparing two ints is much, much faster than comparing two text fields. For a 
small number of comparisons, it doesn't matter. When you're joining tables 
together, it's a different story.

  a) the optimizer does a really poor job on multi-column 
 index statistics
 
   Then it should eventually be fixed?

It's on the to-do, but it's not an easy nut to crack.

  b) If each parent record will have many children, the space 
 savings from
  using a surrogate key can be quite large
 
   Only where the surrogate is significantly smaller than 
 the natural?

#define significant

Here's a real-life example: the primary table for stats.distributed.net has 
about 120M rows. One field in that table (participant_id) links back to the 
participant table; it's an int. If instead we used participant_name and that 
averaged 8 characters in length, that would grow the main table by 1GB (8 chars 
takes 8 bytes instead of 4, plus there's the varlena header of 4 bytes). The 
machine that stats runs on has 4G of memory, so cutting 1G of wasted space out 
of that table helps quite a bit.

(In actuality, there isn't participant_name... participants are identified by 
email address (not a great idea, but I wasn't around when that was chosen). As 
you can imagine, email addresses are substantially longer than 4 bytes. When we 
normalized email out of that main table things got substantially faster. That 
was a number of years ago, so the table was probably 15-25% of it's current 
size, but it still made a huge difference.)

  c) depending on how you view things, putting actual keys 
 all over the
  place is denormalized
 
   How come?

See my other reply... :)

  Generally, I just use surrogate keys for everything unless 
 performance
  dictates something else.
 
   Shouldn't it be the other way round, for the user's sake?

Why should it? It's trivial to create views that abstract surrogate keys out, 
and if you really want to you can even make the views updatable. But here's two 
other things to consider:

In many cases you can't define a single field as a unique key. So you end up 
with having to add many extra keys to all your join clauses. Not very friendly, 
and prone to error.

Not every language has equal support for text comparisons (and in my 
experience, almost all real keys are mostly text).
--
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: FW: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Jim C. Nasby
 Comparing two ints is much, much faster than comparing two text
 fields. For a small number of comparisons, it doesn't matter. When
 you're joining tables together, it's a different story.

   That is where data independence would come handy... like a better enum,
with possreps and hidden implementation.

   Forgive me my ignorance, but are ints inherently faster to compare than
strings, or is it just an implementation detail?  Ideally, if this is so
a fully data-independent system would create a hash behind the back of
user in order to get performance.

The CPU can do an integer comparison with one instruction; it can't do
that with a text string. (Well, theoretically if the string was 3/4
bytes exactly (dependin on null termination) you could compare just as
fast, but I'm pretty certain that no compiler is that fancy.)

 Here's a real-life example: the primary table for
 stats.distributed.net has about 120M rows. One field in that table
 (participant_id) links back to the participant table; it's an int. If
 instead we used participant_name and that averaged 8 characters in
 length, that would grow the main table by 1GB (8 chars takes 8 bytes
 instead of 4, plus there's the varlena header of 4 bytes). The machine
 that stats runs on has 4G of memory, so cutting 1G of wasted space out
 of that table helps quite a bit.

   OK, hardly a typical example.  As I think I left clear, my problem is
not using surrogate keys, but using them by default, or even
exclusively.

No? It's certainly not uncommon to have tables with 100M+ rows. And keep
in mind that this applies to every row of every table that has foreign
keys. I'd bet it's actually common to save 1G or more with surrogate
keys in moderately sized databases.

Of course, you do have to be intelligent here, too. The only key defined
on the table in my example is participant_id, project_id, date; there is
no surrogate key because there's no real reason to have one.

 (In actuality, there isn't participant_name... participants are
 identified by email address (not a great idea, but I wasn't around
 when that was chosen). As you can imagine, email addresses are
 substantially longer than 4 bytes. When we normalized email out of
 that main table things got substantially faster. That was a number of
 years ago, so the table was probably 15-25% of it's current size, but
 it still made a huge difference.)

   This isn't normalisation at all, as far as I understand it.  It is just

I don't have the rules of normalization memorized enough to know what
form this breaks, but I'm 99% certain it breaks at least one of them.

Look at it this way: if someone wants to change their email address,
best case scenario is that you have cascading RI setup and it updates
thousands of rows in that table. Worst case scenario, you just de-linked
a whole bunch of data. But with a surrogate key, all you have to do is
update one row in one table and you're done.

that we don't have data independence... so you had to expose an
implementation detail?

Expose to what? The application? First, this is a pretty minor thing to
expose; second, if it's that big a concern you can completely hide it by
using a view.

But the reality is, dealing with a numeric ID can be a heck of a lot
easier than an email address. Look at URLs that embbed one versus the
other for a good example.

 Why should it? It's trivial to create views that abstract surrogate
 keys out, and if you really want to you can even make the views
 updatable. But here's two other things to consider:

   These views, in heavy querying environments, can be prohibitive.

Normalize 'til it hurts; denormalize 'til it works.

Yes, the added overhead of rules for updates/inserts/deletes could start
to add up in performance-critical code. But if performance is that
critical you're far more likely to run into other bottlenecks first. And
worst-case, you abstract behind a stored procedure that just has the
right queries hard-coded.

As for select-only views you'll have a hard time showing any meaningful
performance penalty.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Josh Berkus
Martjin,

 Interesting. However, in my experience very few things have natural
 keys. There are no combination of attributes for people, phone calls
 or even real events that make useful natural keys.

I certainly hope that I never have to pick up one of your projects.   A 
table without a natural key is a data management disaster.   Without a 
key, it's not data, it's garbage.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Dann Corbit
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Josh Berkus
 Sent: Wednesday, January 18, 2006 3:59 PM
 To: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Surrogate keys (Was: enums)
 
 Martjin,
 
  Interesting. However, in my experience very few things have natural
  keys. There are no combination of attributes for people, phone
calls
  or even real events that make useful natural keys.
 
 I certainly hope that I never have to pick up one of your projects.
A
 table without a natural key is a data management disaster.   Without a
 key, it's not data, it's garbage.

I have a different opinion.

The data should absolutely never use a natural key as a primary key.

The data should use something like a sequence for the primary key.

Examples:
SSN -- believe it or not, SSN's sometimes change.

First, Middle, Last names -- Not really unique

Street Address -- More than one person can live there.  They can move.

Basically, every physical attribute or logical attribute is a terrible
choice for a primary key.  They won't cause problems very often, it's
true.  But when they do cause problems, it is a terrible doozie of a
problem.

Now, on the other hand, if we are talking about INDEXES here, that's a
horse of a different color.   Lots of natural attributes and
combinations of natural attributes make excellent candidates for keys.
Such things as SSN, names, addresses, phone numbers, etc.

Therefore, I am guessing the two posters upstream in this thread that I
am responding to were therefore talking about different subjects
altogether.

One was talking about using natural attributes for indexes, which is a
superior idea that I agree with.

The other was talking about never using natural attributes for keys,
which I also agree with.

Therefore, I am guessing that everyone is in complete agreement, but it
is a nomenclature thing.

Just a guess.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Dann Corbit
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Dann Corbit
 Sent: Wednesday, January 18, 2006 4:04 PM
 To: josh@agliodbs.com; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Surrogate keys (Was: enums)
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
  [EMAIL PROTECTED] On Behalf Of Josh Berkus
  Sent: Wednesday, January 18, 2006 3:59 PM
  To: pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] Surrogate keys (Was: enums)
 
  Martjin,
 
   Interesting. However, in my experience very few things have
natural
   keys. There are no combination of attributes for people, phone
 calls
   or even real events that make useful natural keys.
 
  I certainly hope that I never have to pick up one of your projects.
 A
  table without a natural key is a data management disaster.   Without
a
  key, it's not data, it's garbage.
 
 I have a different opinion.
 
 The data should absolutely never use a natural key as a primary key.
 
 The data should use something like a sequence for the primary key.
 
 Examples:
 SSN -- believe it or not, SSN's sometimes change.
 
 First, Middle, Last names -- Not really unique
 
 Street Address -- More than one person can live there.  They can
move.
 
 Basically, every physical attribute or logical attribute is a terrible
 choice for a primary key.  They won't cause problems very often, it's
 true.  But when they do cause problems, it is a terrible doozie of a
 problem.
 
 Now, on the other hand, if we are talking about INDEXES here, that's a
 horse of a different color.   Lots of natural attributes and
 combinations of natural attributes make excellent candidates for keys.

Make that: combinations of natural attributes make excellent candidates
for indexes.
See. I even messed it up, when I was trying to highlight the
distinction.
Of course, we can probably just chalk that up to dumb as a box of
hammers.

 Such things as SSN, names, addresses, phone numbers, etc.
 
 Therefore, I am guessing the two posters upstream in this thread that
I
 am responding to were therefore talking about different subjects
 altogether.
 
 One was talking about using natural attributes for indexes, which is a
 superior idea that I agree with.
 
 The other was talking about never using natural attributes for keys,
 which I also agree with.
 
 Therefore, I am guessing that everyone is in complete agreement, but
it
 is a nomenclature thing.
 
 Just a guess.
 
 ---(end of
broadcast)---
 TIP 6: explain analyze is your friend

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: FW: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Leandro Guimarães Faria Corcete Dutra
Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu:
 
  Forgive me my ignorance, but are ints inherently faster to compare than
 strings, or is it just an implementation detail?  Ideally, if this is so
 a fully data-independent system would create a hash behind the back of
 user in order to get performance.
 
 The CPU can do an integer comparison with one instruction; it can't do
 that with a text string.

OK.  Again, data independence should be the goal here.


  OK, hardly a typical example.  As I think I left clear, my problem is
 not using surrogate keys, but using them by default, or even
 exclusively.
 
 No? It's certainly not uncommon to have tables with 100M+ rows.

No, but neither are they *that* common.

Certainly, lots of database have a few of them.  But then, they have
dozens, hundreds, thousands of much smaller tables.


 And keep
 in mind that this applies to every row of every table that has foreign
 keys. I'd bet it's actually common to save 1G or more with surrogate
 keys in moderately sized databases.

Only if you have quite some children, because otherwise, in the main
tables, the surrogate keys add a field, an index and a sequence to an
otherwise smaller table and index.


 Of course, you do have to be intelligent here, too. The only key defined
 on the table in my example is participant_id, project_id, date; there is
 no surrogate key because there's no real reason to have one.

Quite.


  (In actuality, there isn't participant_name... participants are
  identified by email address (not a great idea, but I wasn't around
  when that was chosen). As you can imagine, email addresses are
  substantially longer than 4 bytes. When we normalized email out of
  that main table things got substantially faster. That was a number of
  years ago, so the table was probably 15-25% of it's current size, but
  it still made a huge difference.)
 
  This isn't normalisation at all, as far as I understand it.  It is just
 
 I don't have the rules of normalization memorized enough to know what
 form this breaks, but I'm 99% certain it breaks at least one of them.

No, never.

Normalisation is about eliminating redundancy and, therefore, update
anomalies.  Making all the table dependent on only the keys and the
whole keys, by projecting relations to eliminate entity mixups.

What you mention is actually exposing an implementation detail, namely
an integer that serves as a hash of the key.


 Look at it this way: if someone wants to change their email address,
 best case scenario is that you have cascading RI setup and it updates
 thousands of rows in that table. Worst case scenario, you just de-linked
 a whole bunch of data. But with a surrogate key, all you have to do is
 update one row in one table and you're done.

OK, if you have lots of linked data.  But most tables are really dead
ends.


 that we don't have data independence... so you had to expose an
 implementation detail?
 
 Expose to what? The application? First, this is a pretty minor thing to
 expose; second, if it's that big a concern you can completely hide it by
 using a view.

As someone said, you end up with ids everywhere, and no
user-understandable data at all...


 But the reality is, dealing with a numeric ID can be a heck of a lot
 easier than an email address. Look at URLs that embbed one versus the
 other for a good example.

Again, implementation details... levels mixup.


  Why should it? It's trivial to create views that abstract surrogate
  keys out, and if you really want to you can even make the views
  updatable. But here's two other things to consider:
 
  These views, in heavy querying environments, can be prohibitive.
 
 Normalize 'til it hurts; denormalize 'til it works.

Lack of data implementation biting us again.


 Yes, the added overhead of rules for updates/inserts/deletes could start
 to add up in performance-critical code. But if performance is that
 critical you're far more likely to run into other bottlenecks first. And
 worst-case, you abstract behind a stored procedure that just has the
 right queries hard-coded.
 
 As for select-only views you'll have a hard time showing any meaningful
 performance penalty.

Yet real user-defined data types could make it all much simpler.

-- 
+55 (11) 5685 2219   xmpp:[EMAIL PROTECTED]
+55 (11) 9406 7191   Yahoo!: lgcdutra
+55 (11) 5686 9607 MSN: [EMAIL PROTECTED]
+55 (11) 4390 5383  ICQ/AIM: 61287803


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Michael Glaesemann


On Jan 19, 2006, at 9:31 , Leandro Guimarães Faria Corcete Dutra wrote:


OK.  Again, data independence should be the goal here.


snip /


Again, implementation details... levels mixup.


snip /


Lack of data implementation biting us again.


snip /


Yet real user-defined data types could make it all much simpler.


Again, again, and again, patches welcome! PostgreSQL is an open- 
source project, and people contribute in a variety of ways, two of  
which include submitting code and sponsoring others to develop code.  
If you look at the todo list, there are *lots* of things people would  
like to see improved in PostgreSQL, but the pace at which PostgreSQL  
is improved and what is improved is driven in large part by what  
people are willing to do themselves or sponsor. If these are things  
you're interested in (and it certainly appears you are), why not  
contribute?


Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: FW: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Dann Corbit
Maybe it goes better into Advocacy or something, but I have found a quote by 
database big-wigs that I strongly disagree with:

From:
http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf

We have this.
PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be assigned by 
the DBMS only if a user-defined primary key is not available.
Second generation systems support the notion of a primary key, which is a 
user-assigned unique identifier.  If a primary key exists for a collection that 
is known never to change, for example social security number, student 
registration number, or employee number, then no additional system-assigned UID 
is required. An immutable primary key has an extra advantage over a 
system-assigned unique identifier because it has a natural, human readable 
meaning. Consequently, in data interchange or debugging this may be an 
advantage.  If no primary key is available for a collection, then it is 
imperative that a system-assigned UID be provided. Because SQL supports update 
through a cursor, second generation systems must be able to update the last 
record retrieved, and this is only possible if it can be uniquely identified. 
If no primary key serves this purpose, the system must include an extra UID. 
Therefore, several second generation systems already obey this proposition. 
Moreover, as will be noted in Proposition 2.3, some collections, e.g. views, do 
not necessarily have system assigned UIDs, so building a system that requires 
them is likely to be proven undesirable. We close our discussion on Tenet 1 
with a final proposition that deals with the notion of rules.

This is a bad idea.

Let's take the example of a Social Security Number.

Not everyone has one:
http://www.ssa.gov/pubs/10002.html#how2

If people do have one, they can definitely change it.  If someone has stolen a 
SSN, then the wronged party is able to get their SSN changed:
http://101-identitytheft.com/ssn.htm
The odds of this happening are low, but if you cannot handle it, then the 
damage caused is considerable.
Now what happens if you want to have customers outside of the USA? {Don't 
worry, we'll never go global...}  I hope that my objections are very plain and 
obvious.

The primary key should be immutable, meaning that its value should not be 
changed during the course of normal operations of the database.  What natural 
key is immutable?  The answer is that such an attribute does not exist.  To use 
them for such a purpose is begging for trouble.

I saw the argument that there is a great volume of space wasted by adding a 
column that does not naturally occur in the data.  That argument is simply 
absurd.  Consider a database with 10 billion rows of data in it.  Each of those 
tables gets an 8 byte primary key added for every row, resulting in 80 GB 
consumed.  The cost of 80 GB is perhaps $200.  With a database that large 
(where the extra space consumed by an artificial key column has a cost that can 
easily be measured) the odds of a problem arising due to a natural column 
changing its value are huge.  The cost of such a tragedy is certainly more than 
the $200 pittance!

If there is an argument that we also have the parent key values propagated into 
the child tables as foreign keys, that argument has no merit.  The other 
attribute that would have been chosen would also be propagated.  And so (for 
instance) there is no savings to propagating a SSN field into child tables 
verses propagating an 8 byte integer.

I also saw an argument that the propagated ID values are confusing to 
end-users.  That is the fault of the database designer who game them a stupid 
name.  If they were things like InvoiceID and LineItemID then there will not be 
the same sort of confusion.  The meaning and purpose of the column is 
immediately apparent.  As an alternative, the ubiquitous OID name for a column 
on a table is also very transparent.  Of course, when it is used in a foreign 
key, it must be given a role name to avoid confusion in that case.

At any rate, the use of natural keys is a mistake made by people who have never 
had to deal with very large database systems.

IMO-YMMV.


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Leandro Guimarães Faria Corcete Dutra
 Sent: Wednesday, January 18, 2006 4:31 PM
 To: Jim C. Nasby
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: FW: [HACKERS] Surrogate keys (Was: enums)
 
 Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu:
  
 Forgive me my ignorance, but are ints inherently faster to compare
 than
  strings, or is it just an implementation detail?  Ideally, if this is
 so
  a fully data-independent system would create a hash behind the back of
  user in order to get performance.
 
  The CPU can do an integer comparison with one instruction; it can't do
  that with a text string.
 
   OK.  Again, data independence should be the goal here.
 
 
 OK, hardly a typical example.  As I think

Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Michael Glaesemann


On Jan 19, 2006, at 10:34 , Dann Corbit wrote:


http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf


PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be  
assigned by the DBMS only if a user-defined primary key is not  
available.


snip /

An immutable primary key has an extra advantage over a system- 
assigned unique identifier because it has a natural, human readable  
meaning. Consequently, in data interchange or debugging this may be  
an advantage.  If no primary key is available for a collection,  
then it is imperative that a system-assigned UID be provided.


snip /

Dann Corbit:

The primary key should be immutable, meaning that its value should  
not be changed during the course of normal operations of the  
database.  What natural key is immutable?  The answer is that such  
an attribute does not exist.  To use them for such a purpose is  
begging for trouble.


As far as I can tell, the only difference between your position,  
Dann, and Date and Darwen's, is that you think no natural key is  
immutable. If you *could* find an immutable natural key, would it be  
an acceptable key for you? Date and Darwen say explicitly that if no  
immutable (natural) (primary) key is available a system-assigned UID  
is required. If you think there is no immutable natural key  
available, Darwen and Date would agree that you should use a system- 
generated key. Or do you think I'm misreading you or The Third  
Manifesto?


Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Dann Corbit
 -Original Message-
 From: Michael Glaesemann [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, January 18, 2006 5:48 PM
 To: Dann Corbit
 Cc: Leandro Guimarães Faria Corcete Dutra; Jim C. Nasby; pgsql-
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] Surrogate keys (Was: enums)
 
 
 On Jan 19, 2006, at 10:34 , Dann Corbit wrote:
 
  http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf
 
  PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be
  assigned by the DBMS only if a user-defined primary key is not
  available.
 
 snip /
 
  An immutable primary key has an extra advantage over a system-
  assigned unique identifier because it has a natural, human readable
  meaning. Consequently, in data interchange or debugging this may be
  an advantage.  If no primary key is available for a collection,
  then it is imperative that a system-assigned UID be provided.
 
 snip /
 
 Dann Corbit:
 
  The primary key should be immutable, meaning that its value should
  not be changed during the course of normal operations of the
  database.  What natural key is immutable?  The answer is that such
  an attribute does not exist.  To use them for such a purpose is
  begging for trouble.
 
 As far as I can tell, the only difference between your position,
 Dann, and Date and Darwen's, is that you think no natural key is
 immutable. If you *could* find an immutable natural key, would it be
 an acceptable key for you? Date and Darwen say explicitly that if no
 immutable (natural) (primary) key is available a system-assigned UID
 is required. If you think there is no immutable natural key
 available, Darwen and Date would agree that you should use a system-
 generated key. Or do you think I'm misreading you or The Third
 Manifesto?

If you could find an immutable natural key, it would be the *BEST* thing to 
use.  Unfortunately, I believe that immutable natural keys are rarer than horse 
feathers and pickle smoke.  Furthermore, because of statements like the one 
that I collected and pasted from the above document, I believe that people will 
choose totally inappropriate things (I have seen it many times and had to deal 
with the repercussions) to use as natural keys (e.g. SSN) and cause enormous 
damage through those choices.

But I suppose on a sort of mathematical level the statement is fully true.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: FW: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Josh Berkus
Dann,

 The primary key should be immutable, meaning that its value should not be
 changed during the course of normal operations of the database.

Why?   I don't find this statement to be self-evident.   Why would we have ON 
UPDATE CASCADE if keys didn't change sometimes?

 At any rate, the use of natural keys is a mistake made by people who have
 never had to deal with very large database systems.

Oh, I guess I'm dumb then.  The biggest database system I ever had to deal 
with was merely 5 TB ...

Anyway, my opinion on this, in detail, will be on the ITToolBox blog.  You can 
argue with me there.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Tom Lane
Michael Glaesemann [EMAIL PROTECTED] writes:
 As far as I can tell, the only difference between your position,  
 Dann, and Date and Darwen's, is that you think no natural key is  
 immutable.

DD's examples of natural keys are worth a second look though:

 If a primary key exists for a collection that is known never to change,
 for example social security number, student registration number, or
 employee number, then no additional system-assigned UID is required.

The problem with SSN is that somebody other than you controls it.
If you are the college registrar, then you control the student's
registration number, and you don't have to change it.  In fact, guess
what: you probably generated it in the same way as a surrogate key.

I'd argue that all of these are in reality the exact same thing as
a surrogate key --- from the point of view of the issuing authority.
But from anyone else's point of view, they are external data and you
can't hang your own database design on the assumption that they won't
change.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Greg Stark

  If a primary key exists for a collection that is known never to change,
  for example social security number, student registration number, or
  employee number, then no additional system-assigned UID is required.

In point of fact Social security numbers *can* change.

-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-18 Thread Martijn van Oosterhout
On Wed, Jan 18, 2006 at 03:58:50PM -0800, Josh Berkus wrote:
 Martjin,
 
  Interesting. However, in my experience very few things have natural
  keys. There are no combination of attributes for people, phone calls
  or even real events that make useful natural keys.
 
 I certainly hope that I never have to pick up one of your projects.   A 
 table without a natural key is a data management disaster.   Without a 
 key, it's not data, it's garbage.

???

Please provides natural keys for any of the following:

- A Person
- A phone call: (from,to,date,time,duration) is not enough
- A physical address
- A phone line: (phone numbers arn't unique over time)
- An internet account: (usernames not unique over time either)

In any of these either misspellings, changes of names, ownership or
even structure over time render the obvious useless as keys. There are
techniques for detecting and reducing duplication but the point is that
for any of these duplicates *can* be valid data.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-16 Thread Jim C. Nasby
On Sat, Jan 14, 2006 at 07:28:21PM +0900, Michael Glaesemann wrote:
 
 On Jan 13, 2006, at 21:42 , Leandro Guimar?es Faria Corcete DUTRA wrote:
 
 If you still declare the natural key(s) as UNIQUEs, you have just made
 performance worse.  Now there are two keys to be checked on UPDATEs  
 and
 INSERTs, two indexes to be updated, and probably a SEQUENCE too.
 
 For UPDATEs and INSERTs, the proper primary key also needs to be  
 checked, but keys are used for more than just checking uniqueness:  
 they're also often used in JOINs. Joining against a single integer  
 I'd think it quite a different proposition (I'd think faster in terms  
 of performance) than joining against, say, a text column or a  
 composite key.

a) the optimizer does a really poor job on multi-column index statistics
b) If each parent record will have many children, the space savings from
using a surrogate key can be quite large
c) depending on how you view things, putting actual keys all over the
place is denormalized

Generally, I just use surrogate keys for everything unless performance
dictates something else.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread Leandro Guimarães Faria Corcete DUTRA
Rod Taylor pg at rbt.ca writes:

 The basic idea is that most of us break out schemas by creating fake
 primary keys for the purpose of obtaining performance because using the
 proper primary key (single or multiple columns) is often very slow.

This is one thing I simply can't understand.

If you still declare the natural key(s) as UNIQUEs, you have just made 
performance worse.  Now there are two keys to be checked on UPDATEs and 
INSERTs, two indexes to be updated, and probably a SEQUENCE too.

If you don't, you have just thrown away centralised, optimised integrity 
checking, and will probably have to remember to do a slower SELECT before 
updating.

Certainly decoupling presentation from storage would be nice, but even before 
that generalised use of surrogate keys seems to me a knee-jerk reaction.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread Michael Glaesemann


On Jan 13, 2006, at 21:42 , Leandro Guimarães Faria Corcete DUTRA wrote:


If you still declare the natural key(s) as UNIQUEs, you have just made
performance worse.  Now there are two keys to be checked on UPDATEs  
and

INSERTs, two indexes to be updated, and probably a SEQUENCE too.


For UPDATEs and INSERTs, the proper primary key also needs to be  
checked, but keys are used for more than just checking uniqueness:  
they're also often used in JOINs. Joining against a single integer  
I'd think it quite a different proposition (I'd think faster in terms  
of performance) than joining against, say, a text column or a  
composite key.


Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread Lukas Smith

Michael Glaesemann wrote:


On Jan 13, 2006, at 21:42 , Leandro Guimarães Faria Corcete DUTRA wrote:


If you still declare the natural key(s) as UNIQUEs, you have just made
performance worse.  Now there are two keys to be checked on UPDATEs and
INSERTs, two indexes to be updated, and probably a SEQUENCE too.


For UPDATEs and INSERTs, the proper primary key also needs to be 
checked, but keys are used for more than just checking uniqueness: 
they're also often used in JOINs. Joining against a single integer I'd 
think it quite a different proposition (I'd think faster in terms of 
performance) than joining against, say, a text column or a composite key.


Well this is a balancing decision. You certainly slow down inserts. You 
might also increase the stress on the table because you have to 
translate between the different keys. It also depends on the join type 
you end up doing. It also obviously depends on how large your original 
primary key is. However whatever your situation is: make sure you do not 
end up doing premature optimization.


regards,
Lukas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread Rod Taylor
On Fri, 2006-01-13 at 12:42 +, Leandro Guimarães Faria Corcete DUTRA
wrote:
 Rod Taylor pg at rbt.ca writes:
 
  The basic idea is that most of us break out schemas by creating fake
  primary keys for the purpose of obtaining performance because using the
  proper primary key (single or multiple columns) is often very slow.
 
 This is one thing I simply can't understand.
 
 If you still declare the natural key(s) as UNIQUEs, you have just made 
 performance worse.  Now there are two keys to be checked on UPDATEs and 
 INSERTs, two indexes to be updated, and probably a SEQUENCE too.

Indeed. Using a surrogate key is not free and that is why it would be
something the DBA would specify during table creation.

The main goal would be to give the option of using a surrogate key
without being forced to expose it to the applications using the
database. It is a feature akin to table spaces in that it can help
performance but without the application or standard users knowing why.

-- 


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread Lukas Smith

Rod Taylor wrote:

If you still declare the natural key(s) as UNIQUEs, you have just made 
performance worse.  Now there are two keys to be checked on UPDATEs and 
INSERTs, two indexes to be updated, and probably a SEQUENCE too.


Indeed. Using a surrogate key is not free and that is why it would be
something the DBA would specify during table creation.

The main goal would be to give the option of using a surrogate key
without being forced to expose it to the applications using the
database. It is a feature akin to table spaces in that it can help
performance but without the application or standard users knowing why.


Just this morning my father came to me (he is getting into SQL now that 
he is retired) with an issue where a surrogate key probably makes sense. 
He is storing a tree of plant families that can get fairly deep. The 
primary key is the scientific name. In order to improve performance and 
get rid of the recursive lookups he currently does he now wants to use 
materialized paths (parent name/sub name/sub sub name). He decided 
not to go with nested paths since that makes it very hard to hand fix 
things in the tree structure. Obviously using the scientific name in the 
materialized paths can quickly give you a really wide column if you have 
a fairly deep tree. In that case it could be beneficial to introduce a 
surrogate key.


The only annoying bit is that he frequently needs to sync with an 
external database where they use no surrogate key so the import slows 
down because he needs to check if a surrogate key has been introduced 
for every given scientific name before writing to the database.


regards,
Lukas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread mark
On Fri, Jan 13, 2006 at 12:42:55PM +, Leandro Guimarães Faria Corcete DUTRA 
wrote:
 Rod Taylor pg at rbt.ca writes:
  The basic idea is that most of us break out schemas by creating fake
  primary keys for the purpose of obtaining performance because using the
  proper primary key (single or multiple columns) is often very slow.
 This is one thing I simply can't understand.

 If you still declare the natural key(s) as UNIQUEs, you have just made 
 performance worse.  Now there are two keys to be checked on UPDATEs and 
 INSERTs, two indexes to be updated, and probably a SEQUENCE too.

Not to completely defend the practice - but in some applications,
INSERT is much less frequent than UPDATE, and that UPDATE requires a
unique check on the primary key and the surrogate key, as well as an
update, should be considered (and I believe is considered) a
PostgreSQL performance bug. It's undesirable and unnecessary behaviour
for the majority of uses (where they key does not change as a part of
the update).

 Certainly decoupling presentation from storage would be nice, but even before
 that generalised use of surrogate keys seems to me a knee-jerk reaction.

Yes, I agree. As per a previous thread, I'm one of those using it to
generalize my query / update implementation into common base code. I
have other reasons - but I confess to this being the real reason.

In my case, the cost of maintaining the code that queries / updates is
more expensive than the cost of having an extra unique index, and the
storage and performance impacts this has on my data. :-)

Is my primary reason good on its own, without the other more legitimate
justifications? It's good enough for me. I expect others to strongly
disagree.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread Martijn van Oosterhout
On Sat, Jan 14, 2006 at 11:06:07AM -0500, [EMAIL PROTECTED] wrote:
 Not to completely defend the practice - but in some applications,
 INSERT is much less frequent than UPDATE, and that UPDATE requires a
 unique check on the primary key and the surrogate key, as well as an
 update, should be considered (and I believe is considered) a
 PostgreSQL performance bug. It's undesirable and unnecessary behaviour
 for the majority of uses (where they key does not change as a part of
 the update).

Unique check? An index is an index and when you do an UPDATE the new
tuple has to be added to the index. At this point it doesn't matter if
the index is unique or not, all indexes cost something.

Since after the UPDATE the tuple with that primary key appears two (or
more) times in the table, a check needs to be made that they don't
overlap timewise. Are you claiming you could avoid this check and still
guarentee correctness in the face of concurrent transactions?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread mark
On Sat, Jan 14, 2006 at 09:33:39PM +0100, Martijn van Oosterhout wrote:
 On Sat, Jan 14, 2006 at 11:06:07AM -0500, [EMAIL PROTECTED] wrote:
  Not to completely defend the practice - but in some applications,
  INSERT is much less frequent than UPDATE, and that UPDATE requires a
  unique check on the primary key and the surrogate key, as well as an
  update, should be considered (and I believe is considered) a
  PostgreSQL performance bug. It's undesirable and unnecessary behaviour
  for the majority of uses (where they key does not change as a part of
  the update).
 Unique check? An index is an index and when you do an UPDATE the new
 tuple has to be added to the index. At this point it doesn't matter if
 the index is unique or not, all indexes cost something.

 Since after the UPDATE the tuple with that primary key appears two (or
 more) times in the table, a check needs to be made that they don't
 overlap timewise. Are you claiming you could avoid this check and still
 guarentee correctness in the face of concurrent transactions?

I'm claiming that I agree with this TODO item:

- Prevent index uniqueness checks when UPDATE does not modify the column

  Uniqueness (index) checks are done when updating a column
  even if the column is not modified by the UPDATE.

Definately, the check is unnecessary. If it was unique before we made
the change, we know it will be unique after we've made the change.
The check shouldn't be performed for the primary key, or for the
surrogate key, if neither of these keys are modified in any way.

Perhaps you are challenging my addition of the phrase as well as an
update, with a hint on my part, that I feel the update is unnecessary
as well. I may have been wrong to add these 5 words. The MVCC
implementation has numerous costs, and perhaps this is one of them
that cannot be avoided. :-(

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Surrogate keys (Was: enums)

2006-01-14 Thread Greg Stark

Leandro Guimarães Faria Corcete DUTRA [EMAIL PROTECTED] writes:

 Certainly decoupling presentation from storage would be nice, but even before 
 that generalised use of surrogate keys seems to me a knee-jerk reaction.

I hate knee-jerk reactions too, but just think of all the pain of people
dealing with databases where they used Social Security numbers for primary
keys. I would never use an attribute that represents some real-world datum as
a primary key any more.

In my experience there are very few occasions where I want a real non-sequence
generated primary key. I've never regretted having a sequence generated
primary key, and I've certainly had occasions to regret not having one.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org