Re: Re: Re: [PERFORM] Data warehousing requirements

2004-10-08 Thread Tom Lane
[EMAIL PROTECTED] writes:
 Unfortunately, yes thats true - thats is for correctness, not an
 optimization decision. Outer joins constrain you on both join order AND
 on join type. Nested loops and hash joins avoid touching all rows in
 the right hand table, which is exactly what you don't want when you
 have a right outer join to perform, since you wish to include rows in
 that table when there is no match. Thus, we MUST choose a merge join
 even when (if it wasn't an outer join) we would have chosen a nested
 loops or hash.

The alternative of course is to flip it around to be a left outer join
so that we can use those plan types.  But depending on the relative
sizes of the two tables this may be a loser.

If you are using a FULL join then it is indeed true that mergejoin is
the only supported plan type.  I don't think that was at issue here
though.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Data warehousing requirements

2004-10-07 Thread Aaron Werman
Consider how the fact table is going to be used, and review hacking it up
based on usage. Fact tables should be fairly narrow, so if there are extra
columns beyond keys and dimension keys consider breaking it into parallel
tables (vertical partitioning).

Horizontal partitioning is your friend; especially if it is large - consider
slicing the data into chunks. If the fact table is date driven it might be
worthwhile to break it into separate tables based on date key.  This wins in
reducing the working set of queries and in buffering. If there is a real
hotspot, such as current month's activity, you might want to keep a separate
table with just the (most) active data.Static tables of unchanged data can
simplify backups, etc., as well.

Consider summary tables if you know what type of queries you'll hit.
Especially here, MVCC is not your friend because it has extra work to do for
aggregate functions.

Cluster helps if you bulk load.

In most warehouses, the data is downstream data from existing operational
systems. Because of that you're not able to use database features to
preserve integrity. In most cases, the data goes through an
extract/transform/load process - and the output is considered acceptable.
So, no RI is correct for star or snowflake design. Pretty much no anything
else that adds intelligence - no triggers, no objects, no constraints of any
sort. Many designers try hard to avoid nulls.

On the hardware side - RAID5 might work here because of the low volume if
you can pay the write performance penalty. To size hardware you need to
estimate load in terms of transaction type (I usually make bucket categories
of small, medium, and large effort needs) and transaction rate. Then try to
estimate how much CPU and I/O they'll use.

/Aaron

Let us not speak of them; but look, and pass on.

- Original Message - 
From: Gabriele Bartolini [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, October 06, 2004 5:36 PM
Subject: [PERFORM] Data warehousing requirements


 Hi guys,

 I just discussed about my problem on IRC. I am building a Web usage
 mining system based on Linux, PostgreSQL and C++ made up of an OLTP
 database which feeds several and multi-purpose data warehouses about
users'
 behaviour on HTTP servers.

 I modelled every warehouse using the star schema, with a fact table
and
 then 'n' dimension tables linked using a surrogate ID.

 Discussing with the guys of the chat, I came up with these
conclusions,
 regarding the warehouse's performance:

 1) don't use referential integrity in the facts table
 2) use INTEGER and avoid SMALLINT and NUMERIC types for dimensions' IDs
 3) use an index for every dimension's ID in the fact table

 As far as administration is concerned: run VACUUM ANALYSE daily and
 VACUUM FULL periodically.

 Is there anything else I should keep in mind?

 Also, I was looking for advice regarding hardware requirements for a
 data warehouse system that needs to satisfy online queries. I have indeed
 no idea at the moment. I can only predict 4 million about records a month
 in the fact table, does it make sense or not? is it too much?

 Data needs to be easily backed up and eventually replicated.

 Having this in mind, what hardware architecture should I look for? How
 many hard disks do I need, what kind and what RAID solution do you suggest
 me to adopt (5 or 10 - I think)?

 Thank you so much,
 -Gabriele
 --
 Gabriele Bartolini: Web Programmer, ht://Dig  IWA/HWG Member, ht://Check
 maintainer
 Current Location: Prato, Toscana, Italia
 [EMAIL PROTECTED] | http://www.prato.linux.it/~gbartolini | ICQ#129221447
   Leave every hope, ye who enter!, Dante Alighieri, Divine Comedy, The
 Inferno








 ---
 Outgoing mail is certified Virus Free.
 Checked by AVG anti-virus system (http://www.grisoft.com).
 Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004








 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data warehousing requirements

2004-10-07 Thread Gabriele Bartolini
At 13.30 07/10/2004, Aaron Werman wrote:
Consider how the fact table is going to be used, and review hacking it up
based on usage. Fact tables should be fairly narrow, so if there are extra
columns beyond keys and dimension keys consider breaking it into parallel
tables (vertical partitioning).
Hmm ... I have only an extra column. Sorry if I ask you to confirm this, 
but practically vertical partitioning allows me to divide a table into 2 
tables (like if I cut them vertically, right?) having the same key. If I 
had 2 extra columns, that could be the case, couldn't it?

Horizontal partitioning is your friend; especially if it is large - consider
slicing the data into chunks. If the fact table is date driven it might be
worthwhile to break it into separate tables based on date key.  This wins in
reducing the working set of queries and in buffering. If there is a real
hotspot, such as current month's activity, you might want to keep a separate
table with just the (most) active data.Static tables of unchanged data can
simplify backups, etc., as well.
In this case, you mean I can chunk data into: facts_04_08 for the august 
2004 facts. Is this the case?

Otherwise, is it right my point of view that I can get good results by 
using a different approach, based on mixing vertical partitioning and the 
CLUSTER facility of PostgreSQL? Can I vertically partition also dimension 
keys from the fact table or not?

However, this subject is awesome and interesting. Far out ... data 
warehousing seems to be really continous modeling, doesn't it! :-)

Consider summary tables if you know what type of queries you'll hit.
At this stage, I can't predict it yet. But of course I need some sort of 
summary. I will keep it in mind.

Especially here, MVCC is not your friend because it has extra work to do for
aggregate functions.
Why does it have extra work? Do you mind being more precise, Aaron? It is 
really interesting. (thanks)

Cluster helps if you bulk load.
Is it maybe because I can update or build them once the load operation has 
finished?

In most warehouses, the data is downstream data from existing operational
systems.
That's my case too.
Because of that you're not able to use database features to
preserve integrity. In most cases, the data goes through an
extract/transform/load process - and the output is considered acceptable.
So, no RI is correct for star or snowflake design. Pretty much no anything
else that adds intelligence - no triggers, no objects, no constraints of any
sort. Many designers try hard to avoid nulls.
That's another interesting argument. Again, I had in mind the space 
efficiency principle and I decided to use null IDs for dimension tables if 
I don't have the information. I noticed though that in those cases I can't 
use any index and performances result very poor.

I have a dimension table 'categories' referenced through the 'id_category' 
field in the facts table. I decided to set it to NULL in case I don't have 
any category to associate to it. I believe it is better to set a '0' value 
if I don't have any category, allowing me not to use a SELECT * from facts 
where id_category IS NULL which does not use the INDEX I had previously 
created on that field.

On the hardware side - RAID5 might work here because of the low volume if
you can pay the write performance penalty. To size hardware you need to
estimate load in terms of transaction type (I usually make bucket categories
of small, medium, and large effort needs) and transaction rate. Then try to
estimate how much CPU and I/O they'll use.
Thank you so much again Aaron. Your contribution has been really important 
to me.

Ciao,
-Gabriele
Let us not speak of them; but look, and pass on.
P.S.: Dante rules ... :-)
--
Gabriele Bartolini: Web Programmer, ht://Dig  IWA/HWG Member, ht://Check 
maintainer
Current Location: Prato, Toscana, Italia
[EMAIL PROTECTED] | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 Leave every hope, ye who enter!, Dante Alighieri, Divine Comedy, The 
Inferno

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Data warehousing requirements

2004-10-07 Thread Josh Berkus
Tom,

Well, I sit corrected.   Obviously I misread that.

 It's not so much that they are necessarily inefficient as that they
 constrain the planner's freedom of action.  You need to think a lot more
 carefully about the order of joining than when you use inner joins.

I've also found that OUTER JOINS constrain the types of joins that can/will be 
used as well as the order.  Maybe you didn't intend it that way, but (for 
example) OUTER JOINs seem much more likely to use expensive merge joins.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Data warehousing requirements

2004-10-06 Thread Gabriele Bartolini
Hi guys,
   I just discussed about my problem on IRC. I am building a Web usage 
mining system based on Linux, PostgreSQL and C++ made up of an OLTP 
database which feeds several and multi-purpose data warehouses about users' 
behaviour on HTTP servers.

   I modelled every warehouse using the star schema, with a fact table and 
then 'n' dimension tables linked using a surrogate ID.

   Discussing with the guys of the chat, I came up with these conclusions, 
regarding the warehouse's performance:

1) don't use referential integrity in the facts table
2) use INTEGER and avoid SMALLINT and NUMERIC types for dimensions' IDs
3) use an index for every dimension's ID in the fact table
   As far as administration is concerned: run VACUUM ANALYSE daily and 
VACUUM FULL periodically.

   Is there anything else I should keep in mind?
   Also, I was looking for advice regarding hardware requirements for a 
data warehouse system that needs to satisfy online queries. I have indeed 
no idea at the moment. I can only predict 4 million about records a month 
in the fact table, does it make sense or not? is it too much?

   Data needs to be easily backed up and eventually replicated.
   Having this in mind, what hardware architecture should I look for? How 
many hard disks do I need, what kind and what RAID solution do you suggest 
me to adopt (5 or 10 - I think)?

Thank you so much,
-Gabriele
--
Gabriele Bartolini: Web Programmer, ht://Dig  IWA/HWG Member, ht://Check 
maintainer
Current Location: Prato, Toscana, Italia
[EMAIL PROTECTED] | http://www.prato.linux.it/~gbartolini | ICQ#129221447
 Leave every hope, ye who enter!, Dante Alighieri, Divine Comedy, The 
Inferno

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]