Re: [PERFORM] Data type to use for primary key

2004-11-24 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  This could hurt if they ever reuse an old previously retired postal code,
  which isn't an entirely impossible case. As far as I know it hasn't happened
  yet though.
 
 One would suppose that the guys who are in charge of this point at
 Canada Post consider the postal code to be their primary key, and
 are no more eager to reuse one than you are to see it reused.

Well, they may eventually be forced to. For the same sort of hierarchic issue
that causes the shortage of IPv4 address space even though there's far less
than 4 billion hosts online.

But as far as I can see today the only postal codes that are being retired are
rural areas that are being developed and have blocks of codes assigned instead
of having a single broad code.

 Basically this comes down to I'm going to use some externally supplied
 primary key as my primary key.  Do I trust the upstream DBA to know what
 a primary key is?

Well there's another issue here I think. Often people see something that looks
unique and is clearly intended to be a primary key and think aha, nice
primary key but they miss a subtle distinction between what the external
primary key represents and what their data model is tracking.

The typical example is social security numbers. SSNs are a perfectly
reasonable primary key -- as long as you're tracking Social Security accounts,
not people. Most people in the US have exactly one SS account, so people often
think it looks like a primary key for people. In fact not everyone has a
Social Security account (aliens who have never worked in the US, or for that
matter people who have never been in the US) and others have had multiple
Social Security accounts (victims of identity theft).

Another example that comes to mind is the local telephone company. When I
changed my phone number they created a new account without telling me, because
their billing system's primary key for accounts is... the phone number. So all
my automated bill payments started disappearing into the black hole of the old
account and my new account went negative. I wonder what they do for customers
who buy services from them but don't have a telephone line.

-- 
greg


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Data type to use for primary key

2004-11-24 Thread Alexandre Leclerc
On 24 Nov 2004 01:52:52 -0500, Greg Stark [EMAIL PROTECTED] wrote:
 Alexandre Leclerc [EMAIL PROTECTED] writes:
 
  Thanks for those tips. I'll print and keep them. So in my case, the
  product_code being varchar(24) is:
  4 bytes + string size (so possibly up to 24) = possible 28 bytes. I
  did the good thing using a serial. For my shorter keys (4 bytes + up
  to 6 char) I will use the natural key.
 
 Realize that space usage is really only part of the issue.

Thank you for this additionnal information. This will help out in the
futur. In my situation this is a good thing to have integer key where
I decided to have them. Event if I was obliged to add UNIQUE
constraints to some other columns. I think they call this candidate
key and it's still 3NF (whatever; but only if my db is correctly
organised)... I try to be logical and efficient for good performance.
But in the end, the time (the db will get bigger) and good EXPLAIN
ANALYSE commands will help fine tuning later! This will give me more
experience at that point.

 Actually I see one interesting exception to my policy in my current database
 schema. And I don't think I would do this one differently given the choice
 either. The primary key of the postal code table is the postal code. (postal
 codes are up here in the great white north like zip codes down there.)

(I do understand this one, living in the province of Quebec. ;) And
the great white north is still not arrived; end november! - Still, not
very exceptionnal.)

Regards.

-- 
Alexandre Leclerc

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Data type to use for primary key

2004-11-23 Thread Alexandre Leclerc
On Mon, 22 Nov 2004 16:54:56 -0800, Josh Berkus [EMAIL PROTECTED] wrote:
 Alexandre,
 
  What is the common approach? Should I use directly the product_code as
  my ID, or use a sequantial number for speed? (I did the same for the
  company_id, this is a 'serial' and not the shor name of the customer.
  I just don't know what is usually done.
 
 Don't use SERIAL just because it's there.Ideally, you *want* to use the
 product_code if you can.   It's your natural key and a natural key is always
 superior to a surrogate key all other things being equal.
 
 Unfortunately, all other things are NOT equal.Here's the reasons why you'd
 use a surrogate key (i.e. SERIAL):
 
 1) because the product code is a large text string  (i.e.  10bytes) and you
 will have many millions of records, so having it as an FK in other tables
 will add significantly to the footprint of the database;

Thanks for those tips. I'll print and keep them. So in my case, the
product_code being varchar(24) is:
4 bytes + string size (so possibly up to 24) = possible 28 bytes. I
did the good thing using a serial. For my shorter keys (4 bytes + up
to 6 char) I will use the natural key.

This is interesting, because this is what I did right now.

The transparent surrogate keying proposal that is discussed bellow
in the thread is a very good idea. It would be nice to see that. It
would be easier for the DB admin and the coder; the moment this is not
slowing the system. : )

Best regards.

-- 
Alexandre Leclerc

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Data type to use for primary key

2004-11-23 Thread Pierre-Frdric Caillaud

All,
	Well, you should still escape any strings you're getting from a web  
page so
you can ensure you're not subject to a SQL insert attack, even if you're
expecting integers.
Thanks,
Peter Darley
Well, your framework should do this for you :
	integer specified in your database object class description
	%d appears in in your generated queries (or you put it in your hand  
written queries)
	= if the parameter is not an integer, an exception is thrown, then  
catched, then an error page is displayed...

Or, just casting to int should throw an exception...
	Forms should be validated, but hidden parameters in links are OK imho to  
display an error page if they are incorrect, after all, if the user edits  
the get or post parameters, well...

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Data type to use for primary key

2004-11-23 Thread Greg Stark
Alexandre Leclerc [EMAIL PROTECTED] writes:

 Thanks for those tips. I'll print and keep them. So in my case, the
 product_code being varchar(24) is:
 4 bytes + string size (so possibly up to 24) = possible 28 bytes. I
 did the good thing using a serial. For my shorter keys (4 bytes + up
 to 6 char) I will use the natural key.

Realize that space usage is really only part of the issue.

If you ever have two records with the same natural key or a record whose
natural key has changed you'll be in for a world of hurt if you use the
natural key as the primary key in your database.

Basically I never use natural keys except when they're arbitrarily chosen
values defined by the application itself.

Situations where I've used varchars instead of integer keys are things like:

. Individual privileges grantable in a security system.
  (things like VIEWUSER EDITUSER privileges)

. Reference tables for one letter codes used to indicate the type of object
  represented by the record.

Actually I see one interesting exception to my policy in my current database
schema. And I don't think I would do this one differently given the choice
either. The primary key of the postal code table is the postal code. (postal
codes are up here in the great white north like zip codes down there.)

This could hurt if they ever reuse an old previously retired postal code,
which isn't an entirely impossible case. As far as I know it hasn't happened
yet though. And it's just so much more convenient having the postal code handy
instead of having to join against another table to look it up.

-- 
greg


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Data type to use for primary key

2004-11-22 Thread Alexandre Leclerc
Good day,

I'm asking myself if there is a performance issue in using an integer
of varchar(24) PRIMARY KEY in a product table.

I've read that there is no speed issue in the query, but that the only
performance issue is the database size of copying the key in other
tables that require it.

My product_id is copied in orders, jobs, and other specific tables.

What is the common approach? Should I use directly the product_code as
my ID, or use a sequantial number for speed? (I did the same for the
company_id, this is a 'serial' and not the shor name of the customer.
I just don't know what is usually done.

Right now I did the following:
CREATE TABLE design.products (
product_id  serial  PRIMARY KEY,
company_id  integer NOT NULL REFERENCES sales.companies ON
UPDATE CASCADE,
product_codevarchar(24) NOT NULL,
...
CONSTRAINT product_code_already_used_for_this_company UNIQUE
(company_id, product_code)
);

CREATE TABLE sales.companies (
company_idinteger  PRIMARY KEY,
company_name  varchar(48)  NOT NULL UNIQUE,
...
);

The company_id is also copied in many tables like product, contacts, etc.

Thank you very much for any good pointers on this 'already seen' issue.

-- 
Alexandre Leclerc

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data type to use for primary key

2004-11-22 Thread Pierre-Frdric Caillaud

What is the common approach? Should I use directly the product_code as
my ID, or use a sequantial number for speed? (I did the same for the
company_id, this is a 'serial' and not the shor name of the customer.
I just don't know what is usually done.
	Use a serial :
	- you can change product_code for a product easily
	- you can pass around integers easier around, in web forms for instance,  
you don't have to ask 'should I escape this string ?'
	- it's faster
	- it uses less space
	- if one day you must manage products from another source whose  
product_code overlap yours, you won't have problems
	- you can generate them with a serial uniquely and easily

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Data type to use for primary key

2004-11-22 Thread Alexandre Leclerc
Mr Caillaud,

Merci! Many points you bring were also my toughts. I was asking myself
really this was the way to go. I'm happy to see that my view of the
problem was good.

Encore merci! (Thanks again!)

On Tue, 23 Nov 2004 00:06:13 +0100, Pierre-Frédéric Caillaud
[EMAIL PROTECTED] wrote:
 
  What is the common approach? Should I use directly the product_code as
  my ID, or use a sequantial number for speed? (I did the same for the
  company_id, this is a 'serial' and not the shor name of the customer.
  I just don't know what is usually done.
 
 Use a serial :
 - you can change product_code for a product easily
 - you can pass around integers easier around, in web forms for 
 instance,
 you don't have to ask 'should I escape this string ?'
 - it's faster
 - it uses less space
 - if one day you must manage products from another source whose
 product_code overlap yours, you won't have problems
 - you can generate them with a serial uniquely and easily
 
 ---(end of broadcast)---
 TIP 9: the planner will ignore your desire to choose an index scan if your
   joining column's datatypes do not match
 


-- 
Alexandre Leclerc

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data type to use for primary key

2004-11-22 Thread Josh Berkus
Alexandre,

 What is the common approach? Should I use directly the product_code as
 my ID, or use a sequantial number for speed? (I did the same for the
 company_id, this is a 'serial' and not the shor name of the customer.
 I just don't know what is usually done.

Don't use SERIAL just because it's there.Ideally, you *want* to use the 
product_code if you can.   It's your natural key and a natural key is always 
superior to a surrogate key all other things being equal.   

Unfortunately, all other things are NOT equal.Here's the reasons why you'd 
use a surrogate key (i.e. SERIAL):

1) because the product code is a large text string  (i.e.  10bytes) and you 
will have many millions of records, so having it as an FK in other tables 
will add significantly to the footprint of the database;

2) because product codes get blanket changes frequently, where thousands of 
them pet re-mapped to new codes, and the ON CASCADE UPDATE slow performance 
will kill your database;

3) Because every other table in the database has a SERIAL key and consistency 
reduces errors;

4) or because your interface programmers get annoyed with using different 
types of keys for different tables and multicolumn keys.

If none of the above is true (and I've had it not be, in some tables and some 
databases) then you want to stick with your natural key, the product_code.


-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Data type to use for primary key

2004-11-22 Thread Rod Taylor
On Mon, 2004-11-22 at 16:54 -0800, Josh Berkus wrote:
 Alexandre,
 
  What is the common approach? Should I use directly the product_code as
  my ID, or use a sequantial number for speed? (I did the same for the
  company_id, this is a 'serial' and not the shor name of the customer.
  I just don't know what is usually done.
 
 Don't use SERIAL just because it's there.Ideally, you *want* to use the 
 product_code if you can.   It's your natural key and a natural key is always 
 superior to a surrogate key all other things being equal.   

It would be nice if PostgreSQL had some form of transparent surrogate
keying in the background which would automatically run around and
replace your real data with SERIAL integers. It could use a lookup table
for conversions between the surrogate and real values so the user never
knows that it's done, a bit like ENUM. Then we could all use the real
values with no performance issues for 1) because it's an integer in the
background, and 2) because a cascade only touches a single tuple in the
lookup table.


-- 


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Data type to use for primary key

2004-11-22 Thread Josh Berkus
Rod,

 It would be nice if PostgreSQL had some form of transparent surrogate
 keying in the background which would automatically run around and
 replace your real data with SERIAL integers. It could use a lookup table
 for conversions between the surrogate and real values so the user never
 knows that it's done, a bit like ENUM. Then we could all use the real
 values with no performance issues for 1) because it's an integer in the
 background, and 2) because a cascade only touches a single tuple in the
 lookup table.

Sybase does this, and it's a feature I would dearly love to emulate.  You can 
just refer to another table, without specifying the column, as an FK and it 
will create an invisible hashed key.   This is the type of functionality Codd 
was advocating -- invisible, implementation-automated surrogate keys -- in 
the mid 90's (don't have a paper citation at the moment).

So you'd just do:

create table client_contacts (
fname text not null,
lname text not null,
client foriegn key clients,
position text,
notes text
);

and the client column would create an invisible hashed key that would drag 
in the relevant row from the clients table; thus a:

select * from client_contacts

would actually show the whole record from clients as well.


-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match