[SQL] Warehouse Schema

2006-05-24 Thread Worky Workerson
I'm developing a schema for a large data warehouse (10 billion records) and had a couple of questions about how to optimize it.  My biggest question is on the design of the fact table and how much normalization I should actually do of the data.  
The data is going to be keyed by IP address, stored as a non-unique IP4 type.  Each record is also tagged with various attributes, such as a category and a type.  Assuming that a category and type are VARCHAR, would it make sense to normalize these out of the fact table into their respective tables and key them by an INTEGER? 
I.e.CREATE TABLE big_fact_table_A (
    identifier IP4,
    data1   BYTEA,
    data2   BYTEA,
    ...
    dataN  BYTEA,
    category    VARCHAR(16),
    type    VARCHAR(16)
);... vs ...CREATE TABLE big_fact_table_B (    identifier IP4,    data1   BYTEA,    data2   BYTEA,    ...    dataN  BYTEA,    category    INTEGER REFERENCES categories (category_id),
    type    INTEGER REFERENCES types (type_id));I figure that the normalized fact table should be quicker, as the integer is much smaller than the varchar.  On query, however, the table will need to be joined against the two other tables (categories, types), but I still figure that this is a win because the other tables are fairly small and should stay resident in memory.  Is this reasoning valid?
The downside to this (from my perspective) is that the data comes in the form of big_fact_table_A and could be easily COPYed straight into the table.  with big_fact_table_B it looks like I will have to do the "unJOIN" in a script.  Also, I have separate installations of the warehouse (with different data sources) and it will be difficult to share data between them unless their categories/types tables are keyed with exactly the same integer IDs which, as I don't directly control the other installations, is not guaranteed.
Any suggestions to the above "problems"?


[SQL] Displaying first, last, count columns

2006-06-21 Thread Worky Workerson

I'm having a bit of a brain freeze and can't seem to come up with
decent SQL for the following problem:

I have a table "t" of the form "time_occurred TIMESTAMP, prog_data
VARCHAR" and would like to create a query that outputs something of
the form "first_seen, last_seen, count, prog_data".

I have the current query which gets the first_seen and last_seen via
subqueries, ala

SELECT t1.time_occurred AS first_seen, t2.time_occurred AS last_seen,
t3.count, t1.prog_data
FROM t AS t1, t AS t2
WHERE t1.prog_data = t2.prog_data
   AND t1.time_occurred IN (SELECT min(time_occurred) FROM t WHERE
prog_data = t1.prog_data)
   AND t2.time_occurred IN (SELECT max(time_occurred) FROM t WHERE
prog_data = t1.prog_data)

but I can't seem to work out how to get the count of all the records
that have.  I figure that this is probably a relatively common idiom
... can anyone suggest ways to go about doing this.  Also, the
performance of this is pretty horrible, but I figure that creating a
column on t.prog_data should speed things up noticably, right?

Thanks!

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[SQL] Updating from table with multiple possibilities

2006-10-03 Thread Worky Workerson

I have two tables, ips and ranges, where ips contains a list of ips
(INET) and ranges contains a list of ip blocks (CIDR), ala:

CREATE TABLE ips (ip INET, info varchar);
CREATE TABLE ranges (range CIDR, info varchar);

I would like to update the info column in ips to be the info column of
the most specific block in ranges.  For example, if ranges had the
following data:

0.0.0.0/0, 'top level class'
18.0.0.0/8, 'MIT'
18.228.0.0/20, 'MIT Group B'

and ips only had the row "'18.228.0.1', NULL" then doing an update
would change the column in ips to "'18.228.0.1', 'MIT Group B'"

I can't quite figure out the update syntax to get this to work.  This
obviously does not work (as mentioned in the manual), as there are
multiple results returned from the 'join' and I only want the 'best'
match:

UPDATE ips SET info = ranges.info FROM ranges WHERE ip <<= range

I figure that I have to use subselects, but I can't seem to figure out
how to get the data out of the subselect (i.e. the info column).  Do I
need to join ips and ranges explicitly?

Thanks!

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster