subject:"\[PERFORM\] Data Warehouse Reevaluation \- MySQL vs Postgres \-\-"

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-17 Thread Simon Riggs

 Iain
 Joe's example wasn't excluding partions, as he didn't use a
 predicated UNION
 ALL view to select from. His queries use an indexed column that allow the
 various partitions to be probed at low cost, and he was satisfied
 wth that.

Agreed - very very interesting design though.

 My point in my previous post was that you could still do all that that if
 you wanted to, by building the predicated view with UNION ALL of
 each of the
 child tables.


AFAICS of all the designs proposed there is still only one design *using
current PostgreSQL* that allows partitions to be excluded from queries as a
way of speeding up queries against very large tables: UNION ALL with
appended constants.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-16 Thread Iain

Joe, Christopher,

Joe's example wasn't excluding partions, as he didn't use a predicated UNION
ALL view to select from. His queries use an indexed column that allow the
various partitions to be probed at low cost, and he was satisfied wth that.

My point in my previous post was that you could still do all that that if
you wanted to, by building the predicated view with UNION ALL of each of the
child tables.

regards
Iain
- Original Message - 
From: Joe Conway [EMAIL PROTECTED]
To: Christopher Browne [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, September 16, 2004 2:17 PM
Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --


 Christopher Browne wrote:
  In the last exciting episode, [EMAIL PROTECTED] (Joe Conway) wrote:
 That's exactly what we're doing, but using inherited tables instead of
 a union view. With inheritance, there is no need to rebuild the view
 each time a table is added or removed. Basically, in our application,
 tables are partitioned by either month or week, depending on the type
 of data involved, and queries are normally date qualified.

  Where does the constraint come in that'll allow most of the data to be
  excluded?

 Not sure I follow this.

  Or is this just that the entries are all part of bigtable so that
  the self join is only 2-way?

 We don't have a need for self-joins in our application. We do use a
 crosstab function to materialize some transposed views of the data,
 however. That allows us to avoid self-joins in the cases where we might
 otherwise need them.

 Joe

 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-16 Thread Markus Schaber

Hi, Mischa,

On Tue, 14 Sep 2004 22:58:20 GMT
Mischa Sandberg [EMAIL PROTECTED] wrote:

 Googling 'upsert' (an Oraclism, I believe) will get you hits on Oracle 
 and DB2's implementation of MERGE, which does what AMOUNTS to what is
 described below (one mass UPDATE...FROM, one mass INSERT...WHERE NOT 
 EXISTS).
 
 No, you shouldn't iterate row-by-row through the temp table.
 Whenever possible, try to do updates in one single (mass) operation.
 Doing it that way gives the optimizer the best chance at amortizing
 fixed costs, and batching operations.

But when every updated row has a different value for the column(s) to be
updated, then I still have to use one update statement per row, which I
expect to be faster when done via a stored procedure than having the
whole client-server roundtrip including parsing every time. Or did I
miss some nice SQL statement?


Have a nice day,
Markus
-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-16 Thread Markus Schaber

Hi, Steve,

On Wed, 15 Sep 2004 21:17:03 -0700
Steve Atkins [EMAIL PROTECTED] wrote:

 On Wed, Sep 15, 2004 at 11:16:44AM +0200, Markus Schaber wrote:
  But you have to add table constraints restricting the time after adding
  the partition?
 
 Uhm... unless I'm confused that's not a meaningful thing in this context.
 There's no rule that's putting insertions into an inherited table - the
 decision of which inherited table to insert into is made at application
 level.

I thought of the query optimizer. I thought it could use the table
constraints to drop tables when creating the union. But now I think that
an index gives enough win, because the tree-based indices are rather
quick at returning zero rows when the queried value is out of the
indexed range.

Greetings,
Markus


-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-16 Thread Joe Conway

Iain wrote:
Joe's example wasn't excluding partions, as he didn't use a predicated UNION
ALL view to select from. His queries use an indexed column that allow the
various partitions to be probed at low cost, and he was satisfied wth that.
Right.
My point in my previous post was that you could still do all that that if
you wanted to, by building the predicated view with UNION ALL of each of the
child tables.
Right. It doesn't look that much different:
create or replace view foo_vw as
select * from foo_2004_01 where f2 = '2004-jan-01' and f2 = '2004-jan-31'
union all
select * from foo_2004_02 where f2 = '2004-feb-01' and f2 = '2004-feb-29'
union all
select * from foo_2004_03 where f2 = '2004-mar-01' and f2 = '2004-mar-31'
;
 -- needed just for illustration since these are toy tables
set enable_seqscan to false;
explain analyze select * from foo_vw where f2 = '2004-feb-15';
 QUERY PLAN
--
 Subquery Scan foo_vw  (cost=0.00..14.54 rows=3 width=16) (actual 
time=0.022..0.027 rows=1 loops=1)
   -  Append  (cost=0.00..14.51 rows=3 width=16) (actual 
time=0.019..0.022 rows=1 loops=1)
 -  Subquery Scan *SELECT* 1  (cost=0.00..4.84 rows=1 
width=16) (actual time=0.004..0.004 rows=0 loops=1)
   -  Index Scan using foo_2004_01_idx2 on foo_2004_01 
(cost=0.00..4.83 rows=1 width=16) (actual time=0.003..0.003 rows=0 loops=1)
 Index Cond: ((f2 = '2004-01-01'::date) AND (f2 = 
'2004-01-31'::date) AND (f2 = '2004-02-15'::date))
 -  Subquery Scan *SELECT* 2  (cost=0.00..4.84 rows=1 
width=16) (actual time=0.013..0.015 rows=1 loops=1)
   -  Index Scan using foo_2004_02_idx2 on foo_2004_02 
(cost=0.00..4.83 rows=1 width=16) (actual time=0.009..0.010 rows=1 loops=1)
 Index Cond: ((f2 = '2004-02-01'::date) AND (f2 = 
'2004-02-29'::date) AND (f2 = '2004-02-15'::date))
 -  Subquery Scan *SELECT* 3  (cost=0.00..4.84 rows=1 
width=16) (actual time=0.001..0.001 rows=0 loops=1)
   -  Index Scan using foo_2004_03_idx2 on foo_2004_03 
(cost=0.00..4.83 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=1)
 Index Cond: ((f2 = '2004-03-01'::date) AND (f2 = 
'2004-03-31'::date) AND (f2 = '2004-02-15'::date))
 Total runtime: 0.188 ms
(12 rows)

regression=# explain analyze select * from foo where f2 = '2004-feb-15';
 QUERY PLAN
--
 Result  (cost=1.00..10073.70 rows=20 width=16) (actual 
time=0.059..0.091 rows=1 loops=1)
   -  Append  (cost=1.00..10073.70 rows=20 width=16) 
(actual time=0.055..0.086 rows=1 loops=1)
 -  Seq Scan on foo  (cost=1.00..10022.50 rows=5 
width=16) (actual time=0.001..0.001 rows=0 loops=1)
   Filter: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_01_idx2 on foo_2004_01 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.045..0.045 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_02_idx2 on foo_2004_02 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.008..0.009 rows=1 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_03_idx2 on foo_2004_03 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.029..0.029 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 Total runtime: 0.191 ms
(11 rows)

The main difference being that the view needs to be recreated every time 
a table is added or dropped, whereas with the inherited tables method 
that isn't needed.

Joe
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Leeuw van der, Tim

Hi,

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Michael Kleiser
Sent: Tuesday, September 14, 2004 4:23 PM
To: Leeuw van der, Tim
Cc: Steinar H. Gunderson; PostgreSQL Performance List
Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

 What MySQl-table-type did you use?
 Was it MyISAM which don't supports transactions ?
 Yes I read about that bulk-inserts with this table-type are very fast.
 In Data Warehouse one often don't need transactions.

Although totally beyond the scope of this thread, we used InnoDB tables with MySQL 
because of the transaction-support.

regards,

--Tim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Christopher Browne

In the last exciting episode, [EMAIL PROTECTED] (Joe Conway) wrote:
 That's exactly what we're doing, but using inherited tables instead of
 a union view. With inheritance, there is no need to rebuild the view
 each time a table is added or removed. Basically, in our application,
 tables are partitioned by either month or week, depending on the type
 of data involved, and queries are normally date qualified.

Sounds interesting, and possibly usable.

Where does the constraint come in that'll allow most of the data to be
excluded?

Or is this just that the entries are all part of bigtable so that
the self join is only 2-way?
-- 
let name=cbbrowne and tld=acm.org in name ^ @ ^ tld;;
http://linuxfinances.info/info/advocacy.html
Be humble.   A lot happened before  you were born.   - Life's Little
Instruction Book

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread simon


Joe Conway [EMAIL PROTECTED] wrote on 15.09.2004, 06:30:24:
 Chris Browne wrote:
  Might we set up the view as:
  
  create view combination_of_logs as
select * from table_1 where txn_date between 'this' and 'that' 
 union all
select * from table_2 where txn_date between 'this2' and 'that2' 
 union all
select * from table_3 where txn_date between 'this3' and 'that3' 
 union all
select * from table_4 where txn_date between 'this4' and 'that4' 
 union all
 ... ad infinitum
 union all
select * from table_n where txn_date  'start_of_partition_n';
  
  and expect that to help, as long as the query that hooks up to this
  has date constraints?
  
  We'd have to regenerate the view with new fixed constants each time we
  set up the tables, but that sounds like it could work...
 
 That's exactly what we're doing, but using inherited tables instead of a 
 union view. With inheritance, there is no need to rebuild the view each 
 time a table is added or removed. Basically, in our application, tables 
 are partitioned by either month or week, depending on the type of data 
 involved, and queries are normally date qualified.
 
 We're not completely done with our data conversion (from a commercial 
 RDBMSi), but so far the results have been excellent. Similar to what 
 others have said in this thread, the conversion involved restructuring 
 the data to better suit Postgres, and the application (data 
 analysis/mining vs. the source system which is operational). As a result 
 we've compressed a  1TB database down to ~0.4TB, and seen at least one 
 typical query reduced from ~9 minutes down to ~40 seconds.

Sounds interesting.

The performance gain comes from partition elimination of the inherited
tables under the root?

I take it the compression comes from use of arrays, avoiding the need
for additional rows and key overhead?

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Markus Schaber

Hi,

On Tue, 14 Sep 2004 22:10:04 -0700
Steve Atkins [EMAIL PROTECTED] wrote:

  Is there by any chance a set of functions to manage adding and removing
  partitions? Certainly this can be done by hand, but having a set of
  tools would make life much easier. I just looked but didn't see anything
  on GBorg.
 
 I've done a similar thing with time-segregated data by inheriting
 all the partition tables from an (empty) parent table.
 
 Adding a new partition is just a create table tablefoo () inherits(bigtable)
 and removing a partition just drop table tablefoo.

But you have to add table constraints restricting the time after adding
the partition?

Thanks,
Markus

-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway

[EMAIL PROTECTED] wrote:
Joe Conway [EMAIL PROTECTED] wrote on 15.09.2004, 06:30:24:
We're not completely done with our data conversion (from a commercial 
RDBMSi), but so far the results have been excellent. Similar to what 
others have said in this thread, the conversion involved restructuring 
the data to better suit Postgres, and the application (data 
analysis/mining vs. the source system which is operational). As a result 
we've compressed a  1TB database down to ~0.4TB, and seen at least one 
typical query reduced from ~9 minutes down to ~40 seconds.
Sounds interesting.
The performance gain comes from partition elimination of the inherited
tables under the root?
I take it the compression comes from use of arrays, avoiding the need
for additional rows and key overhead?
Sorry, in trying to be concise I was not very clear. I'm using the term 
compression very generally here. I'll try to give a bit more background,

The original data source is a database schema designed for use by an 
operational application that my company sells to provide enhanced 
management of equipment that we also sell. The application needs to be 
very flexible in exactly what data it stores in order to be useful 
across a wide variety of equipment models and versions. In order to do 
that there is a very large central transaction table that stores 
name-value pairs in varchar columns. The name-value pairs come from 
parsed output of the equipment, and as such there is a fair amount of 
redundancy and unneeded data that ends up getting stored. At each 
installation in the field this table can get very large ( billion 
rows). Additionally the application prematerializes a variety of 
summaries for use by the operators using the GUI.

We collect the data exported from each of the systems in the field and 
accumulate it in a single central database for data mining and analysis. 
This is the database that is actually being converted. By compression I 
really mean that unneeded and redundant data is being stripped out, and 
data known to be of a certain datatype is stored in that type instead of 
varchar (e.g. values known to be int are stored as int). Also the 
summaries are not being converted (although we do some post processing 
to create new materialized summaries).

My points in telling this were:
 - the use of inherited tables to partition this huge number of rows and
   yet allow simple query access to it seems to work well, at least in
   early validation tests
 - had we simply taken the original database and slammed it into
   Postgres with no further thought, we would not have seen the big
   improvements, and thus the project might have been seen as a failure
   (even though it saves substantial $)
Hope that's a bit more clear. I'm hoping to write up a more detailed 
case study once we've cut the Postgres system into production and the 
dust settles a bit.

Joe
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Josh Berkus

Joe,

   - the use of inherited tables to partition this huge number of rows and
 yet allow simple query access to it seems to work well, at least in
 early validation tests
   - had we simply taken the original database and slammed it into
 Postgres with no further thought, we would not have seen the big
 improvements, and thus the project might have been seen as a failure
 (even though it saves substantial $)

Any further thoughts on developing this into true table partitioning?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway

Josh Berkus wrote:
 - the use of inherited tables to partition this huge number of rows and
   yet allow simple query access to it seems to work well, at least in
   early validation tests
 - had we simply taken the original database and slammed it into
   Postgres with no further thought, we would not have seen the big
   improvements, and thus the project might have been seen as a failure
   (even though it saves substantial $)

Any further thoughts on developing this into true table partitioning?
Just that I'd love to see it happen ;-)
Maybe someday I'll be able to find the time to work on it myself, but 
for the moment I'm satisfied with the workarounds we've made.

Joe
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables?

2004-09-15 Thread Simon Riggs


Chris Browne [EMAIL PROTECTED] wrote on 15.09.2004, 04:34:53:
 [EMAIL PROTECTED] (Simon Riggs) writes:
  Well, its fairly straightforward to auto-generate the UNION ALL view,
and
  important as well, since it needs to be re-specified each time a new
  partition is loaded or an old one is cleared down. The main point is
that
  the constant placed in front of each table must in some way relate to
the
  data, to make it useful in querying. If it is just a unique constant,
chosen
  at random, it won't do much for partition elimination. So, that tends to
  make the creation of the UNION ALL view an application/data specific
thing.

 Ah, that's probably a good thought.

 When we used big UNION ALL views, it was with logging tables, where
 there wasn't really any meaningful distinction between partitions.

 So you say that if the VIEW contains, within it, meaningful constraint
 information, that can get applied to chop out irrelevant bits?

 That suggests a way of resurrecting the idea...

 Might we set up the view as:

 create view combination_of_logs as
   select * from table_1 where txn_date between 'this' and 'that'
union all
   select * from table_2 where txn_date between 'this2' and 'that2'
union all
   select * from table_3 where txn_date between 'this3' and 'that3'
union all
   select * from table_4 where txn_date between 'this4' and 'that4'
union all
... ad infinitum
union all
   select * from table_n where txn_date  'start_of_partition_n';

 and expect that to help, as long as the query that hooks up to this
 has date constraints?


That way of phrasing the view can give you the right answer to the
query, but does not exclude partitions.

With the above way of doing things, you end up with predicate phrases of
the form ((PARTLIMITLO  partcol) AND (PARTLIMITHI  partcol) AND
(partcol  QUERYLIMITLO) AND (partcol  QUERYLIMITHI))
...if the values in capitals are constants, then this should evaluate to
a true or false value for each partition table. The optimizer doesn't
yet do this

If you specify the view the way I showed, then the predicate query
becomes a comparison of constants, which DOES get evaluated prior to
full executionyou will see this as a one time test: false in the
EXPLAIN.

The way you've phrased the view is the more obvious way to phrase it,
and I'd spent a few days trying to work out how to solve the algebra
above in codebut that was wasted effort.

Anyway, if you use constants you can still specify ranges and betweens
and have them work... hence my example showed date-like integers - but
I don't think it just applies to one datatype.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Simon Riggs


Joe,

Your application is very interesting. I've just read your OSCON paper. I'd
like to talk more about that. Very similar to Kalido.

...but back to partitioning momentarily: Does the performance gain come from
partition elimination of the inherited tables under the root?

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway

Simon Riggs wrote:
Joe,
Your application is very interesting. I've just read your OSCON paper. I'd
like to talk more about that. Very similar to Kalido.
...but back to partitioning momentarily: Does the performance gain come from
partition elimination of the inherited tables under the root?
I think the major part of the peformance gain comes from the fact that 
the source database has different needs in terms of partitioning 
criteria because of it's different purpose. The data is basically 
partitioned by customer installation instead of by date. Our converted 
scheme partitions by date, which is in line with the analytical queries 
run at the corporate office. Again, this is an argument in favor of not 
simply porting what you're handed.

We might get similar query performance with a single large table and 
multiple partial indexes (e.g. one per month), but there would be one 
tradeoff and one disadvantage to that:
1) The indexes would need to be generated periodically -- this is a 
tradeoff since we currently need to create inherited tables at the same 
periodicity
2) It would be much more difficult to roll off a month's worth of data 
when needed. The general idea is that each month we create a new monthly 
table, then archive and drop the oldest monthly table. If all the data 
were in one big table we would have to delete many millions of rows from 
a (possibly) multibillion row table, and then vacuum that table -- no 
thanks ;-)

Joe
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway

Iain wrote:
That's exactly what we're doing, but using inherited tables instead of a
union view. With inheritance, there is no need to rebuild the view each
time a table is added or removed. Basically, in our application, tables
are partitioned by either month or week, depending on the type of data
involved, and queries are normally date qualified.
That sounds interesting. I have to admit that I havn't touched iheritance in
pg at all yet so I find it hard to imagine how this would work. If you have
a chance, would you mind elaborating on it just a little?
OK, see below:
=
create table foo(f1 int, f2 date, f3 float8);
create table foo_2004_01() inherits (foo);
create table foo_2004_02() inherits (foo);
create table foo_2004_03() inherits (foo);
create index foo_2004_01_idx1 on foo_2004_01(f2);
create index foo_2004_02_idx1 on foo_2004_02(f2);
create index foo_2004_03_idx1 on foo_2004_03(f2);
insert into foo_2004_02 values(1,'2004-feb-15',3.14);
 -- needed just for illustration since these are toy tables
set enable_seqscan to false;
explain analyze select * from foo where f2 = '2004-feb-15';
QUERY PLAN

 Result  (cost=1.00..10061.32 rows=16 width=16) (actual 
time=0.224..0.310 rows=1 loops=1)
   -  Append  (cost=1.00..10061.32 rows=16 width=16) 
(actual time=0.214..0.294 rows=1 loops=1)
 -  Seq Scan on foo  (cost=1.00..10022.50 rows=5 
width=16) (actual time=0.004..0.004 rows=0 loops=1)
   Filter: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_01_idx1 on foo_2004_01 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.101..0.101 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_02_idx1 on foo_2004_02 foo 
(cost=0.00..4.68 rows=1 width=16) (actual time=0.095..0.101 rows=1 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_03_idx1 on foo_2004_03 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.066..0.066 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 Total runtime: 0.582 ms
(11 rows)

create table foo_2004_04() inherits (foo);
create index foo_2004_04_idx1 on foo_2004_04(f2);
explain analyze select * from foo where f2 = '2004-feb-15';
QUERY PLAN

 Result  (cost=1.00..10078.38 rows=21 width=16) (actual 
time=0.052..0.176 rows=1 loops=1)
   -  Append  (cost=1.00..10078.38 rows=21 width=16) 
(actual time=0.041..0.159 rows=1 loops=1)
 -  Seq Scan on foo  (cost=1.00..10022.50 rows=5 
width=16) (actual time=0.004..0.004 rows=0 loops=1)
   Filter: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_01_idx1 on foo_2004_01 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.012..0.012 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_02_idx1 on foo_2004_02 foo 
(cost=0.00..4.68 rows=1 width=16) (actual time=0.016..0.022 rows=1 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_03_idx1 on foo_2004_03 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.008..0.008 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 -  Index Scan using foo_2004_04_idx1 on foo_2004_04 foo 
(cost=0.00..17.07 rows=5 width=16) (actual time=0.095..0.095 rows=0 loops=1)
   Index Cond: (f2 = '2004-02-15'::date)
 Total runtime: 0.443 ms
(13 rows)

For loading data, we COPY into foo, and have a trigger that redirects 
the rows to the appropriate partition.

Notice that the partitions which do not contain any data of interest are 
still probed for data, but since they have none it is very quick. In a 
real life example I got the following results just this afternoon:

 - aggregate row count = 471,849,665
 - total number inherited tables = 216
   (many are future dated and therefore contain no data)
 - select one month's worth of data for one piece of equipment by serial
   number (49,257 rows) = 526.015 ms
Not too bad -- quick enough for my needs. BTW, this is using NFS mounted 
storage (NetApp NAS).

Joe
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Steve Atkins

On Wed, Sep 15, 2004 at 11:16:44AM +0200, Markus Schaber wrote:
 Hi,
 
 On Tue, 14 Sep 2004 22:10:04 -0700
 Steve Atkins [EMAIL PROTECTED] wrote:
 
   Is there by any chance a set of functions to manage adding and removing
   partitions? Certainly this can be done by hand, but having a set of
   tools would make life much easier. I just looked but didn't see anything
   on GBorg.
  
  I've done a similar thing with time-segregated data by inheriting
  all the partition tables from an (empty) parent table.
  
  Adding a new partition is just a create table tablefoo () inherits(bigtable)
  and removing a partition just drop table tablefoo.
 
 But you have to add table constraints restricting the time after adding
 the partition?

Uhm... unless I'm confused that's not a meaningful thing in this context.
There's no rule that's putting insertions into an inherited table - the
decision of which inherited table to insert into is made at application
level.

As I was using it to segregate data based on creation timestamp the
application just inserts into the 'newest' inherited table until it's
full, then creates a new inherited table.

I've no doubt you could set up rules to scatter inserted data across
a number of tables, but that's not something that's been applicaable
for the problems I tend to work with, so I've not looked at it.

Cheers,
  Steve


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-15 Thread Mischa Sandberg

Simon Riggs wrote:
Jim C. Nasby
On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
PostgreSQL's functionality is in many ways similar to Oracle
Partitioning.
Loading up your data in many similar tables, then creating a view like:
CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
SELECT 200409130800, col1, col2, col3... FROM table200409130800
UNION ALL
SELECT 200409131000, col1, col2, col3... FROM table200409131000
UNION ALL
SELECT 200409131200, col1, col2, col3... FROM table200409131200
...etc...
will allow the PostgreSQL optimizer to eliminate partitions
from the query
when you run queries which include a predicate on the
partitioning_col, e.g.
select count(*) from bigtable where idate = 200409131000
The partitions are just tables, so no need for other management tools.
Oracle treats the partitions as sub-tables, so you need a range of commands
to add, swap etc the partitions of the main table.
A few years ago I wrote a federated query engine (wrapped as an ODBC 
driver) that had to handle thousands of contributors (partitions) to a 
pseudotable / VIEWofUNIONs. Joins did require some special handling in 
the optimizer, because of the huge number of crossproducts between 
different tables. It was definitely worth the effort at the time, 
because you need different strategies for: joining a partition to 
another partition on the same subserver; joining two large partitions on 
different servers; and joining a large partition on one server to a 
small one on another.

The differences may not be so great for a solitary server;
but they're still there, because of disparity in subtable sizes. The 
simplistic query plans tend to let you down, when you're dealing with 
honking warehouses.

I'm guessing that Oracle keeps per-subtable AND cross-all-subtables 
statistics, rather than building the latter from scratch in the course 
of evaluating the query plan. That's the one limitation I see in 
emulating their partitioned tables with Views.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Mischa Sandberg

Googling 'upsert' (an Oraclism, I believe) will get you hits on Oracle 
and DB2's implementation of MERGE, which does what AMOUNTS to what is
described below (one mass UPDATE...FROM, one mass INSERT...WHERE NOT 
EXISTS).

No, you shouldn't iterate row-by-row through the temp table.
Whenever possible, try to do updates in one single (mass) operation.
Doing it that way gives the optimizer the best chance at amortizing
fixed costs, and batching operations.
-
In any database other than Postgres, I would recommend doing the
INSERT /followed by/ the UPDATE. That order looks wonky --- your update
ends up pointlessly operating on the rows just INSERTED. The trick is, 
UPDATE acquires and holds write locks (the rows were previously visible 
to other processes), while INSERT's write locks refer to rows that no 
other process could try to lock.

Stephen Frost wrote:
* Markus Schaber ([EMAIL PROTECTED]) wrote:
Generally, what is the fastest way for doing bulk processing of 
update-if-primary-key-matches-and-insert-otherwise operations?

This is a very good question, and I havn't seen much of an answer to it
yet.  I'm curious about the answer myself, actually.  In the more recent
SQL specs, from what I understand, this is essentially what the 'MERGE'
command is for.  This was recently added and unfortunately is not yet
supported in Postgres.  Hopefully it will be added soon.
Otherwise, what I've done is basically an update followed by an insert
using outer joins.  If there's something better, I'd love to hear about
it.  The statements looks something like:
update X
  set colA = a.colA,
  colB = a.colB
  from Y a
  where keyA = a.keyA and
keyB = a.keyB;
insert into X
  select a.keyA,
 a.keyB,
 a.colA,
 a.colB
  from Y a left join X b
   using (keyA, keyB)
  where b.keyA is NULL and
b.keyB is NULL;
With the appropriate indexes, this is pretty fast but I think a merge
would be much faster.
Thanks,
			Stephen
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Christopher Kings-Lynne

insert into X
  select a.keyA,
 a.keyB,
 a.colA,
 a.colB
  from Y a left join X b
   using (keyA, keyB)
  where b.keyA is NULL and
b.keyB is NULL;
With the appropriate indexes, this is pretty fast but I think a merge
would be much faster.
Problem is it's subject to race conditions if another process is 
inserting stuff at the same time...

Chris
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Iain

Hi Joe,

You went to quite a bit of effort, thanks I have the picture now.

Using inheritence seems to be a useful refinement on top of the earlier
outlined aproach using the UNION ALL view with appropriate predicates on the
condition used to do the partitioning. Having the individual partitions
derived from a parent table makes a lot of sense.

regards
Iain


- Original Message - 
From: Joe Conway [EMAIL PROTECTED]
To: Iain [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, September 16, 2004 1:07 PM
Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --


 Iain wrote:
 That's exactly what we're doing, but using inherited tables instead of a
 union view. With inheritance, there is no need to rebuild the view each
 time a table is added or removed. Basically, in our application, tables
 are partitioned by either month or week, depending on the type of data
 involved, and queries are normally date qualified.
 
  That sounds interesting. I have to admit that I havn't touched
iheritance in
  pg at all yet so I find it hard to imagine how this would work. If you
have
  a chance, would you mind elaborating on it just a little?

 OK, see below:
 =

 create table foo(f1 int, f2 date, f3 float8);

 create table foo_2004_01() inherits (foo);
 create table foo_2004_02() inherits (foo);
 create table foo_2004_03() inherits (foo);

 create index foo_2004_01_idx1 on foo_2004_01(f2);
 create index foo_2004_02_idx1 on foo_2004_02(f2);
 create index foo_2004_03_idx1 on foo_2004_03(f2);

 insert into foo_2004_02 values(1,'2004-feb-15',3.14);


   -- needed just for illustration since these are toy tables
 set enable_seqscan to false;
 explain analyze select * from foo where f2 = '2004-feb-15';

 QUERY PLAN
 --
--
   Result  (cost=1.00..10061.32 rows=16 width=16) (actual
 time=0.224..0.310 rows=1 loops=1)
 -  Append  (cost=1.00..10061.32 rows=16 width=16)
 (actual time=0.214..0.294 rows=1 loops=1)
   -  Seq Scan on foo  (cost=1.00..10022.50 rows=5
 width=16) (actual time=0.004..0.004 rows=0 loops=1)
 Filter: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_01_idx1 on foo_2004_01 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.101..0.101 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_02_idx1 on foo_2004_02 foo
 (cost=0.00..4.68 rows=1 width=16) (actual time=0.095..0.101 rows=1
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_03_idx1 on foo_2004_03 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.066..0.066 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   Total runtime: 0.582 ms
 (11 rows)

 create table foo_2004_04() inherits (foo);
 create index foo_2004_04_idx1 on foo_2004_04(f2);

 explain analyze select * from foo where f2 = '2004-feb-15';

 QUERY PLAN
 --
--
   Result  (cost=1.00..10078.38 rows=21 width=16) (actual
 time=0.052..0.176 rows=1 loops=1)
 -  Append  (cost=1.00..10078.38 rows=21 width=16)
 (actual time=0.041..0.159 rows=1 loops=1)
   -  Seq Scan on foo  (cost=1.00..10022.50 rows=5
 width=16) (actual time=0.004..0.004 rows=0 loops=1)
 Filter: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_01_idx1 on foo_2004_01 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.012..0.012 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_02_idx1 on foo_2004_02 foo
 (cost=0.00..4.68 rows=1 width=16) (actual time=0.016..0.022 rows=1
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_03_idx1 on foo_2004_03 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.008..0.008 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   -  Index Scan using foo_2004_04_idx1 on foo_2004_04 foo
 (cost=0.00..17.07 rows=5 width=16) (actual time=0.095..0.095 rows=0
loops=1)
 Index Cond: (f2 = '2004-02-15'::date)
   Total runtime: 0.443 ms
 (13 rows)

 For loading data, we COPY into foo, and have a trigger that redirects
 the rows to the appropriate partition.

 Notice that the partitions which do not contain any data of interest are
 still probed for data, but since they have none it is very quick. In a
 real life example I got the following results just this afternoon:

   - aggregate row count = 471,849,665
   - total number inherited tables = 216
 (many are future dated and therefore contain no data)
   - select one month's worth of data

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-15 Thread Joe Conway

Christopher Browne wrote:
In the last exciting episode, [EMAIL PROTECTED] (Joe Conway) wrote:
That's exactly what we're doing, but using inherited tables instead of
a union view. With inheritance, there is no need to rebuild the view
each time a table is added or removed. Basically, in our application,
tables are partitioned by either month or week, depending on the type
of data involved, and queries are normally date qualified.

Where does the constraint come in that'll allow most of the data to be
excluded?
Not sure I follow this.
Or is this just that the entries are all part of bigtable so that
the self join is only 2-way?
We don't have a need for self-joins in our application. We do use a 
crosstab function to materialize some transposed views of the data, 
however. That allows us to avoid self-joins in the cases where we might 
otherwise need them.

Joe
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Pierre-Frdric Caillaud


Performance hint :
	For static data, do not normalize too much.
	For instance if you have a row which can be linked to several other rows,  
you can do this :

create table parents (
id  serial primary key,
values... )
create table children (
id serial primary key,
parent_id references parents(id),
integer slave_value )
Or you can do this, using an array :
create table everything (
id  serial primary key,
integer[] children_values,
values... )
	Pros :
	No Joins. Getting the list of chilndren_values from table everything is  
just a select.
	On an application with several million rows, a query lasting 150 ms with  
a Join takes 30 ms with an array.
	You can build the arrays from normalized tables by using an aggregate  
function.
	You can index the array elements with a GIST index...

	Cons :
	No joins, thus your queries are a little bit limited ; problems if the  
array is too long ;



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Markus Schaber

Hi, Mischa,

On Sun, 12 Sep 2004 20:47:17 GMT
Mischa Sandberg [EMAIL PROTECTED] wrote:

 On the other hand, if you do warehouse-style loading (Insert, or PG 
 COPY, into a temp table; and then 'upsert' into the perm table), I can 
 guarantee 2500 inserts/sec is no problem.

As we can forsee that we'll have similar insert rates to cope with in
the not-so-far future, what do you mean with 'upsert'? Do you mean a
stored procedure that iterates over the temp table?

Generally, what is the fastest way for doing bulk processing of 
update-if-primary-key-matches-and-insert-otherwise operations?

Thanks,
Markus Schaber

-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Leeuw van der, Tim

Hi,

I found bulk-insert to perform very slow, when compared to MySQL / Oracle. All inserts 
were done in 1 transaction. However, mitigating factors here were:
- Application was a .Net application using ODBC drivers
- PostgreSQL 7.3 running on CYGWIN with cygipc daemon
- Probably very bad tuning in the config file, if any tuning done at all
- The application was issuing 'generic' SQL since it was generally used with Oracle 
and MySQL databases. So no tricks like using COPY or multiple rows with 1 INSERT 
statement. No stored procedures either.
- When doing queries, most of the time the results were comparable to or better than 
MySQL (the only other database that I tested with myself).


So what I can say is, that if you want fast INSERT performance from PostgreSQL then 
you'll probably have to do some trickery that you wouldn't have to do with a default 
MySQL installation.

regards,

--Tim


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Markus Schaber
Sent: Tuesday, September 14, 2004 2:15 PM
To: PostgreSQL Performance List
Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --


Hi, Mischa,

On Sun, 12 Sep 2004 20:47:17 GMT
Mischa Sandberg [EMAIL PROTECTED] wrote:

 On the other hand, if you do warehouse-style loading (Insert, or PG 
 COPY, into a temp table; and then 'upsert' into the perm table), I can 
 guarantee 2500 inserts/sec is no problem.

As we can forsee that we'll have similar insert rates to cope with in
the not-so-far future, what do you mean with 'upsert'? Do you mean a
stored procedure that iterates over the temp table?

Generally, what is the fastest way for doing bulk processing of 
update-if-primary-key-matches-and-insert-otherwise operations?

Thanks,
Markus Schaber

-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Aaron Werman

 Mark Cotner wrote:

  The time has come to reevaluate/rearchitect an
  application which I built about 3 years ago.  There
  are no performance concerns with MySQL, but it would
  benefit greatly from stored procedures, views, etc.


From: Mischa Sandberg [EMAIL PROTECTED]

 If your company is currently happy with MySQL, there probably are other
 (nontechnical) reasons to stick with it. I'm impressed that you'd
 consider reconsidering PG.

I'd like to second Mischa on that issue. In general, if you migrate an
*existing* application from one RDBMS to another, you should expect
performance to decrease significantly. This is always true in a well
performing system even if the replacement technology is more sophisticated.
This is because of several factors.

Even if you try to develop in totally agnostic generic SQL, you are always
customizing to a feature set, namely the ones in the current system. Any
existing application has had substantial tuning and tweaking, and the new
one is at a disadvantage. Moreover, an existing system is a Skinnerian
reward/punishment system to the developers and DBAs, rewarding or punishing
them for very environment specific choices - resulting in an application,
dbms, OS, and platform that are both explicitly and unconsciously customized
to work together in a particular manner.

The net effect is a rule of thumb that I use:

NEVER reimplement an existing system unless the project includes substantial
functional imporovement.

Every time I've broken that rule, I've found that users expectations, based
on the application they are used to, are locked in. Any place where the new
system is slower, the users are dissatisfied; where it exceeds expectations
it isn't appreciated: the users are used to the old system quirks, and the
improvements only leave them uncomforable since the system acts
differently. (I've broken the rule on occation for standardization
conversions.)

My expectation is that pg will not get a fair shake here. If you do it - I'd
like to see the results anyway.

/Aaron

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Steinar H. Gunderson

On Tue, Sep 14, 2004 at 02:42:20PM +0200, Leeuw van der, Tim wrote:
 - PostgreSQL 7.3 running on CYGWIN with cygipc daemon

Isn't this doomed to kill your performance anyhow?

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Leeuw van der, Tim

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Steinar H. Gunderson
Sent: Tuesday, September 14, 2004 3:33 PM
To: PostgreSQL Performance List
Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

 On Tue, Sep 14, 2004 at 02:42:20PM +0200, Leeuw van der, Tim wrote:
  - PostgreSQL 7.3 running on CYGWIN with cygipc daemon

 Isn't this doomed to kill your performance anyhow?

Yes and no, therefore I mentioned it explicitly as one of the caveats. When doing 
selects I could get performance very comparable to MySQL, so I don't want to blame 
poor insert-performance on cygwin/cygipc per se.
I'm not working on this app. anymore and don't have a working test-environment for it 
anymore so I cannot retest now with more recent versions.

regards,

--Tim

 /* Steinar */
 -- 
 Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 8: explain analyze is your friend

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Harald Fuchs

In article [EMAIL PROTECTED],
Leeuw van der, Tim [EMAIL PROTECTED] writes:

 So what I can say is, that if you want fast INSERT performance from
 PostgreSQL then you'll probably have to do some trickery that you
 wouldn't have to do with a default MySQL installation.

I think the word INSERT is superfluous in the above sentence ;-)

Contrary to MySQL, you can't expect decent PostgreSQL performance on
decent hardware without some tuning.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Michael Kleiser

What MySQl-table-type did you use?
Was it MyISAM which don't supports transactions ?
Yes I read about that bulk-inserts with this table-type are very fast.
In Data Warehouse one often don't need transactions.
Leeuw van der, Tim schrieb:
-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] Behalf Of Steinar H. 
Gunderson
Sent: Tuesday, September 14, 2004 3:33 PM
To: PostgreSQL Performance List
Subject: Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

  On Tue, Sep 14, 2004 at 02:42:20PM +0200, Leeuw van der, Tim wrote:
   - PostgreSQL 7.3 running on CYGWIN with cygipc daemon
 
  Isn't this doomed to kill your performance anyhow?
Yes and no, therefore I mentioned it explicitly as one of the caveats. 
When doing selects I could get performance very comparable to MySQL, so 
I don't want to blame poor insert-performance on cygwin/cygipc per se.
I'm not working on this app. anymore and don't have a working 
test-environment for it anymore so I cannot retest now with more recent 
versions.

regards,
--Tim
 
  /* Steinar */
  --
  Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 8: explain analyze is your friend
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Stephen Frost

* Markus Schaber ([EMAIL PROTECTED]) wrote:
 Generally, what is the fastest way for doing bulk processing of 
 update-if-primary-key-matches-and-insert-otherwise operations?

This is a very good question, and I havn't seen much of an answer to it
yet.  I'm curious about the answer myself, actually.  In the more recent
SQL specs, from what I understand, this is essentially what the 'MERGE'
command is for.  This was recently added and unfortunately is not yet
supported in Postgres.  Hopefully it will be added soon.

Otherwise, what I've done is basically an update followed by an insert
using outer joins.  If there's something better, I'd love to hear about
it.  The statements looks something like:

update X
  set colA = a.colA,
  colB = a.colB
  from Y a
  where keyA = a.keyA and
keyB = a.keyB;

insert into X
  select a.keyA,
 a.keyB,
 a.colA,
 a.colB
  from Y a left join X b
   using (keyA, keyB)
  where b.keyA is NULL and
b.keyB is NULL;

With the appropriate indexes, this is pretty fast but I think a merge
would be much faster.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Harald Lau (Sector-X)

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

 From: Mischa Sandberg [EMAIL PROTECTED]
 
  If your company is currently happy with MySQL, there probably are
  other (nontechnical) reasons to stick with it. I'm impressed that
  you'd consider reconsidering PG.
 
 I'd like to second Mischa on that issue.

Though both of you are right from my point of view, I don't think
it's very useful to discuss this item here.

Having once migrated a MySQL-DB to PG I can confirm, that in fact
chances are good you will be unhappy if you adopt the MySQL
data-model and the SQL 1:1.
As well as PG has to be much more configured and optimized than
MySQL.
As well as the client-application is supposed to be modified to a
certain extend, particularly if you want to take over some -or some
more- business-logic from client to database.

But, from what Mark stated so far I'm sure he is not going to migrate
his app just for fun, resp. without having considered this.

 NEVER reimplement an existing system unless the project includes
 substantial functional imporovement.

or monetary issues
I know one big database that was migrated from Oracle to PG and
another from SQLServer to PG because of licence-costs. Definitely
there are some more.
That applies to MySQL, too; licence policy is somewhat obscure to me,
but under certain circumstances you have to pay

regards Harald

-BEGIN PGP SIGNATURE-
Version: PGPfreeware 6.5.3 for non-commercial use http://www.pgp.com

iQA/AwUBQUb+O8JpD/drhCuMEQJCZACgqdJsrWjOwdP779PFaFMjxdgvqkwAoIPc
jPONy6urLRLf3vylVjVlEyci
=/1Ka
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Vivek Khera

 MC == Mark Cotner [EMAIL PROTECTED] writes:

MC I've finished porting the schema and am importing the
MC data now.  My estimates for just two-thirds(60 of the
MC 90 days) of one of our 30 cable systems(MySQL dbs) is
MC estimated to take about 16 hours.  This may seem like
MC a lot, but I'm satisfied with the performance.  I've

be sure to load your data without indexes defined for your initial
import.

check your logs to see if increasing checkpoint_segments is
recommended.  I found that bumping it up to 50 helped speed up my
data loads (restore from dump) significantly.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.Khera Communications, Inc.
Internet: [EMAIL PROTECTED]   Rockville, MD  +1-301-869-4449 x806
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread aaron werman

From: Harald Lau (Sector-X) [EMAIL PROTECTED]
...
 From: Mischa Sandberg [EMAIL PROTECTED]

  If your company is currently happy with MySQL, there probably are
  other (nontechnical) reasons to stick with it. I'm impressed that
  you'd consider reconsidering PG.

 I'd like to second Mischa on that issue.
Though both of you are right from my point of view, I don't think
it's very useful to discuss this item here.
It is kinda windy for the list, but the point is that a big part of 
performance is developer expectation and user expectation. I'd hope to lower 
expectations before we see an article in eWeek. Perhaps this thread should 
move to the advocacy list until the migration needs specific advice.

_
Get ready for school! Find articles, homework help and more in the Back to 
School Guide! http://special.msn.com/network/04backtoschool.armx

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Jim C. Nasby

On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
 PostgreSQL's functionality is in many ways similar to Oracle Partitioning.
 
 Loading up your data in many similar tables, then creating a view like:
 
 CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
 SELECT 200409130800, col1, col2, col3... FROM table200409130800
 UNION ALL
 SELECT 200409131000, col1, col2, col3... FROM table200409131000
 UNION ALL
 SELECT 200409131200, col1, col2, col3... FROM table200409131200
 ...etc...
 
 will allow the PostgreSQL optimizer to eliminate partitions from the query
 when you run queries which include a predicate on the partitioning_col, e.g.
 
 select count(*) from bigtable where idate = 200409131000
 
 will scan the last two partitions only...
 
 There are a few other ways of creating the view that return the same answer,
 but only using constants in that way will allow the partitions to be
 eliminated from the query, and so run for much longer.

Is there by any chance a set of functions to manage adding and removing
partitions? Certainly this can be done by hand, but having a set of
tools would make life much easier. I just looked but didn't see anything
on GBorg.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Simon Riggs

 Jim C. Nasby
 On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
  PostgreSQL's functionality is in many ways similar to Oracle
 Partitioning.
 
  Loading up your data in many similar tables, then creating a view like:
 
  CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
  SELECT 200409130800, col1, col2, col3... FROM table200409130800
  UNION ALL
  SELECT 200409131000, col1, col2, col3... FROM table200409131000
  UNION ALL
  SELECT 200409131200, col1, col2, col3... FROM table200409131200
  ...etc...
 
  will allow the PostgreSQL optimizer to eliminate partitions
 from the query
  when you run queries which include a predicate on the
 partitioning_col, e.g.
 
  select count(*) from bigtable where idate = 200409131000
 
  will scan the last two partitions only...
 
  There are a few other ways of creating the view that return the
 same answer,
  but only using constants in that way will allow the partitions to be
  eliminated from the query, and so run for much longer.

 Is there by any chance a set of functions to manage adding and removing
 partitions? Certainly this can be done by hand, but having a set of
 tools would make life much easier. I just looked but didn't see anything
 on GBorg.

Well, its fairly straightforward to auto-generate the UNION ALL view, and
important as well, since it needs to be re-specified each time a new
partition is loaded or an old one is cleared down. The main point is that
the constant placed in front of each table must in some way relate to the
data, to make it useful in querying. If it is just a unique constant, chosen
at random, it won't do much for partition elimination. So, that tends to
make the creation of the UNION ALL view an application/data specific thing.

The partitions are just tables, so no need for other management tools.
Oracle treats the partitions as sub-tables, so you need a range of commands
to add, swap etc the partitions of the main table.

I guess a set of tools that emulates that functionality would be generically
a good thing, if you can see a way to do that.

Oracle partitions were restricted in only allowing a single load statement
into a single partition at any time, whereas multiple COPY statements can
access a single partition table on PostgreSQL.

BTW, multi-dimensional partitioning is also possible using the same general
scheme

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Michael Glaesemann

On Sep 15, 2004, at 8:32 AM, Simon Riggs wrote:
The partitions are just tables, so no need for other management 
tools.
Oracle treats the partitions as sub-tables, so you need a range of 
commands
to add, swap etc the partitions of the main table.

I guess a set of tools that emulates that functionality would be 
generically
a good thing, if you can see a way to do that.

Oracle partitions were restricted in only allowing a single load 
statement
into a single partition at any time, whereas multiple COPY statements 
can
access a single partition table on PostgreSQL.
How does this compare to DB2 partitioning?
Michael Glaesemann
grzm myrealbox com
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Christopher Browne

[EMAIL PROTECTED] (Simon Riggs) wrote:
 The main point is that the constant placed in front of each table
 must in some way relate to the data, to make it useful in
 querying. If it is just a unique constant, chosen at random, it
 won't do much for partition elimination.

It just struck me - this is much the same notion as that of cutting
planes used in Integer Programming.

The approach, there, is that you take a linear program, which can give
fractional results, and throw on as many additional constraints as you
need in order to force the likelihood of particular variable falling
on integer values.  The constraints may appear redundant, but
declaring them allows the answers to be pushed in the right
directions.

In this particular case, the (arguably redundant) constraints let the
query optimizer have criteria for throwing out unnecessary tables.
Thanks for pointing this out; it may turn a fowl into a feature, when
I can get some round tuits :-).  That should allow me to turn an
81-way evil join into something that's 4-way at the worst.

Cheers!
-- 
cbbrowne,@,linuxfinances.info
http://linuxfinances.info/info/nonrdbms.html
Implementing systems is 95% boredom and 5% sheer terror.

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Chris Browne

[EMAIL PROTECTED] (Simon Riggs) writes:
 Well, its fairly straightforward to auto-generate the UNION ALL view, and
 important as well, since it needs to be re-specified each time a new
 partition is loaded or an old one is cleared down. The main point is that
 the constant placed in front of each table must in some way relate to the
 data, to make it useful in querying. If it is just a unique constant, chosen
 at random, it won't do much for partition elimination. So, that tends to
 make the creation of the UNION ALL view an application/data specific thing.

Ah, that's probably a good thought.

When we used big UNION ALL views, it was with logging tables, where
there wasn't really any meaningful distinction between partitions.

So you say that if the VIEW contains, within it, meaningful constraint
information, that can get applied to chop out irrelevant bits?  

That suggests a way of resurrecting the idea...

Might we set up the view as:

create view combination_of_logs as
  select * from table_1 where txn_date between 'this' and 'that' 
   union all
  select * from table_2 where txn_date between 'this2' and 'that2' 
   union all
  select * from table_3 where txn_date between 'this3' and 'that3' 
   union all
  select * from table_4 where txn_date between 'this4' and 'that4' 
   union all
   ... ad infinitum
   union all
  select * from table_n where txn_date  'start_of_partition_n';

and expect that to help, as long as the query that hooks up to this
has date constraints?

We'd have to regenerate the view with new fixed constants each time we
set up the tables, but that sounds like it could work...
-- 
cbbrowne,@,acm.org
http://www3.sympatico.ca/cbbrowne/x.html
But  what can  you  do with  it?   -- ubiquitous  cry from  Linux-user
partner.  -- Andy Pearce, [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Joe Conway

Chris Browne wrote:
Might we set up the view as:
create view combination_of_logs as
  select * from table_1 where txn_date between 'this' and 'that' 
   union all
  select * from table_2 where txn_date between 'this2' and 'that2' 
   union all
  select * from table_3 where txn_date between 'this3' and 'that3' 
   union all
  select * from table_4 where txn_date between 'this4' and 'that4' 
   union all
   ... ad infinitum
   union all
  select * from table_n where txn_date  'start_of_partition_n';

and expect that to help, as long as the query that hooks up to this
has date constraints?
We'd have to regenerate the view with new fixed constants each time we
set up the tables, but that sounds like it could work...
That's exactly what we're doing, but using inherited tables instead of a 
union view. With inheritance, there is no need to rebuild the view each 
time a table is added or removed. Basically, in our application, tables 
are partitioned by either month or week, depending on the type of data 
involved, and queries are normally date qualified.

We're not completely done with our data conversion (from a commercial 
RDBMSi), but so far the results have been excellent. Similar to what 
others have said in this thread, the conversion involved restructuring 
the data to better suit Postgres, and the application (data 
analysis/mining vs. the source system which is operational). As a result 
we've compressed a  1TB database down to ~0.4TB, and seen at least one 
typical query reduced from ~9 minutes down to ~40 seconds.

Joe
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --

2004-09-14 Thread Iain

Hi Joe,

 That's exactly what we're doing, but using inherited tables instead of a
 union view. With inheritance, there is no need to rebuild the view each
 time a table is added or removed. Basically, in our application, tables
 are partitioned by either month or week, depending on the type of data
 involved, and queries are normally date qualified.

That sounds interesting. I have to admit that I havn't touched iheritance in
pg at all yet so I find it hard to imagine how this would work. If you have
a chance, would you mind elaborating on it just a little?

Regards
Iain


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Steve Atkins

On Tue, Sep 14, 2004 at 05:33:33PM -0500, Jim C. Nasby wrote:
 On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote:
  PostgreSQL's functionality is in many ways similar to Oracle Partitioning.
  
  Loading up your data in many similar tables, then creating a view like:
  
  CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
  SELECT 200409130800, col1, col2, col3... FROM table200409130800
  UNION ALL
  SELECT 200409131000, col1, col2, col3... FROM table200409131000
  UNION ALL
  SELECT 200409131200, col1, col2, col3... FROM table200409131200
  ...etc...
[...]
 
 Is there by any chance a set of functions to manage adding and removing
 partitions? Certainly this can be done by hand, but having a set of
 tools would make life much easier. I just looked but didn't see anything
 on GBorg.

I've done a similar thing with time-segregated data by inheriting
all the partition tables from an (empty) parent table.

Adding a new partition is just a create table tablefoo () inherits(bigtable)
and removing a partition just drop table tablefoo.

Cheers,
  Steve


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Mark Cotner

See comments . . . thanks for the feedback.

'njoy,
Mark

--- Christopher Browne [EMAIL PROTECTED] wrote:

 The world rejoiced as Mischa Sandberg
 [EMAIL PROTECTED] wrote:
  Mark Cotner wrote:
  Requirements:
  Merge table definition equivalent.  We use these
  extensively.
 
  Looked all over mysql.com etc, and afaics merge
 table is indeed
  exactly a view of a union-all. Is that right?
 
  PG supports views, of course, as well (now) as
 tablespaces, allowing
  you to split tables/tablesets across multiple disk
 systems.  PG is
  also pretty efficient in query plans on such
 views, where (say) you
  make one column a constant (identifier, sort of)
 per input table.
 
 The thing that _doesn't_ work well with these sorts
 of UNION views are
 when you do self-joins.  Supposing you have 10
 members, a self-join
 leads to a 100-way join, which is not particularly
 pretty.
 
 I'm quite curious as to how MySQL(tm) copes with
 this, although it may
 not be able to take place; they may not support
 that...
 
  Um, gonna sound silly, but the web interface has
 to remain snappy
  under load.  I don't see this as a major concern
 since you don't
  require table locking.
 
  Agreed. It's more in your warehouse design, and
 intelligent bounding
  of queries. I'd say PG's query analyzer is a few
 years ahead of
  MySQL for large and complex queries.
 
 The challenge comes in if the application has had
 enormous amounts of
 effort put into it to attune it exactly to
 MySQL(tm)'s feature set.
 
 The guys working on RT/3 have found this a
 challenge; they had rather
 a lot of dependancies on its case-insensitive string
 comparisons,
 causing considerable grief.


Not so much, I've tried to be as agnostic as possible.
 Much of the more advanced mining that I've written is
kinda MySQL specific, but needs to be rewritten as
stored procedures anyway.
 
  On the other hand, if you do warehouse-style
 loading (Insert, or PG
  COPY, into a temp table; and then 'upsert' into
 the perm table), I
  can guarantee 2500 inserts/sec is no problem.
 
 The big wins are thus:
 
  1.  Group plenty of INSERTs into a single
 transaction.
 
  2.  Better still, use COPY to cut parsing costs
 plenty more.
 
  3.  Adding indexes _after_ the COPY are a further
 win.
 
 Another possibility is to do clever things with
 stored procs; load
 incoming data using the above optimizations, and
 then run stored
 procedures to use some more or less fancy logic to
 put the data where
 it's ultimately supposed to be.  Having the logic
 running inside the
 engine is the big optimization.

Agreed, I did some preliminary testing today and am
very impressed.  I wasn't used to running analyze
after a data load, but once I did that everything was
snappy.

My best results from MySQL bulk inserts was around 36k
rows per second on a fairly wide table.  Today I got
42k using the COPY command, but with the analyze post
insert the results were similar.  These are excellent
numbers.  It basically means we could have our
cake(great features) and eat it too(performance that's
good enough to run the app).

Queries from my test views were equally pleasing.  I
won't bore you with the details just yet, but
PostgreSQL is doing great.   Not that you all are
surprised.  ;)


 -- 
 wm(X,Y):-write(X),write('@'),write(Y).
 wm('cbbrowne','linuxfinances.info').
 http://linuxfinances.info/info/spreadsheets.html
 Rules  of  the  Evil  Overlord   #198.   I  will 
 remember  that  any
 vulnerabilities I have  are to be revealed strictly 
 on a need-to-know
 basis.  I will also remember that no one needs to
 know.
 http://www.eviloverlord.com/
 
 ---(end of
 broadcast)---
 TIP 8: explain analyze is your friend
 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Christopher Browne

A long time ago, in a galaxy far, far away, [EMAIL PROTECTED] (Mark Cotner) wrote:
 Agreed, I did some preliminary testing today and am very impressed.
 I wasn't used to running analyze after a data load, but once I did
 that everything was snappy.

Something worth observing is that this is true for _any_ of the
database systems supporting a cost-based optimization system,
including Oracle and DB2.

When working with SAP R/3 Payroll, on one project, we found that when
the system was empty of data, the first few employee creates were
quick enough, but it almost immediately got excruciatingly slow.  One
of the DBAs told the Oracle instance underneath to collect statistics
on the main table, and things _immediately_ got snappy again.  But it
didn't get snappy until the conversion folk had run the conversion
process for several minutes, to the point to which it would get
painfully slow :-(.  There, with MILLIONS of dollars worth of license
fees being paid, across the various vendors, it still took a fair bit
of manual fiddling.

MySQL(tm) is just starting to get into cost-based optimization; in
that area, they're moving from where the big DBs were about 10 years
ago.  It was either version 7 or 8 where Oracle started moving to
cost-based optimization, and (as with the anecdote above) it took a
release or two for people to get accustomed to the need to 'feed' the
optimizer with statistics.  This is a growing pain that bites users
with any database where this optimization gets introduced.  It's
worthwhile, but is certainly not costless.

I expect some forseeable surprises will be forthcoming for MySQL AB's
customers in this regard...

 My best results from MySQL bulk inserts was around 36k rows per
 second on a fairly wide table.  Today I got 42k using the COPY
 command, but with the analyze post insert the results were similar.
 These are excellent numbers.  It basically means we could have our
 cake(great features) and eat it too(performance that's good enough
 to run the app).

In the end, performance for inserts is always fundamentally based on
how much disk I/O there is, and so it should come as no shock that
when roughly the same amount of data is getting laid down on disk,
performance won't differ much on these sorts of essentials.

There are a few places where there's some need for cleverness; if you
see particular queries running unusually slowly, it's worth doing an
EXPLAIN or EXPLAIN ANALYZE on them, to see how the query plans are
being generated.  There's some collected wisdom out here on how to
encourage the right plans.

There are also unexpected results that are OK.  We did a system
upgrade a few days ago that led to one of the tables starting out
totally empty.  A summary report that looks at that table wound up
with a pretty wacky looking query plan (compared to what's usual)
because the postmaster knew that the query would be reading in
essentially the entire table.  You'd normally expect an index scan,
looking for data for particular dates.  In this case, it did a scan
the whole table; filter out a few irrelevant entries plan.  

It looked wacky, compared to what's usual, but it ran in about 2
seconds, which was way FASTER than what's usual.  So the plan was
exactly the right one.

Telling the difference between the right plan and a poor one is a bit
of an art; we quite regularly take a look at query plans on this list
to figure out what might be not quite right.  If you find slow ones,
make sure you have run ANALYZE on the tables recently, to be sure that
the plans are sane, and you may want to consider posting some of them
to see if others can point to improvements that can be made.
-- 
If this was helpful, http://svcs.affero.net/rm.php?r=cbbrowne rate me
http://linuxfinances.info/info/linuxdistributions.html
I can't believe my room doesn't have Ethernet!  Why wasn't it wired
when the house was built?
The house was built in 1576. 
-- Alex Kamilewicz on the Oxford breed of `conference American.'

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-13 Thread Mischa Sandberg

Mark Cotner wrote:
Hi all,
I had a difficult time deciding which list to post
this to, so please forgive me if this list doesn't
perfectly match my questions.  My decision will not
solely be based on performance, but it is the primary
concern.  I would be very appreciative if you all
could comment on my test plan.  Real world examples of
a Postgres implementation of =600G with a web
front-end would be great, or any data warehouse with
some size to it.
I'm only in the 30GB range of database, in case that's a consideration 
for my comments that follow.

At this time, I'm working out the best ROLAP options for our PG 
transaction store. The transaction store is highly volatile -- longest a 
txn stays in it is 15 days ... so you imagine the need for historic 
summaries :-)

I've also implemented multiple data servers, including
a federated server that had to make the best of existing engines
(like MySQL, PG and everything from MSJet to RedBrick in the commercial 
world).

The time has come to reevaluate/rearchitect an
application which I built about 3 years ago.  There
are no performance concerns with MySQL, but it would
benefit greatly from stored procedures, views, etc. 
If your company is currently happy with MySQL, there probably are other 
(nontechnical) reasons to stick with it. I'm impressed that you'd 
consider reconsidering PG.

Some of the mining that we do could benefit from
stored procedures as well.  MySQL may have these in
the works, but we won't be able to move to a version
of MySQL that supports stored procs for another year
or two.
And PG lets you back-end with some powerful pattern- and 
aggregate-handling languages, like Perl. This was definitely a plus for 
data mining of web traffic, for example. The power of server-side 
extensibility for bailing you out of a design dead-end is not 
inconsequential.

PG doesn't have PIVOT operators (qv Oracle and MSSQL), but it makes the 
translation from data to column fairly painless otherwise.

Requirements:
Merge table definition equivalent.  We use these
extensively.
Looked all over mysql.com etc, and afaics merge table
is indeed exactly a view of a union-all. Is that right?
PG supports views, of course, as well (now) as tablespaces, allowing you 
to split tables/tablesets across multiple disk systems.
PG is also pretty efficient in query plans on such views, where (say) 
you make one column a constant (identifier, sort of) per input table.

Merge table equivalent with all tables containing over
100M rows(and about 40 columns, some quite wide) will
need to do index scans in at least 5 seconds(MySQL
currently does 2, but we can live with 5) and return
~200 rows.
PG has TOAST for handling REALLY BIG columns, and the generic TEXT type 
is as efficient as any size-specific VARCHAR() type ... should make 
things easier for you.

Um, gonna sound silly, but the web interface has to
remain snappy under load.  I don't see this as a
major concern since you don't require table locking.
Agreed. It's more in your warehouse design, and intelligent bounding of 
queries. I'd say PG's query analyzer is a few years ahead of MySQL for 
large and complex queries.

If business logic is moved to the database(likely with
Postgres) performance for inserting with light logic
on each insert has to keep up with the 4.5M inserts
per 2 hours(which MySQL completes in ~35min
currently).  Acceptable numbers for this aggregation
would be 45-55min using stored procedures.
Again, it's a matter of pipeline design. The tools for creating an 
efficient pipeline are at least as good in PG as MySQL.

If you try to insert and postprocess information one row at a time,
procedures or no, there's no offhand way to guarantee your performance 
without a test/prototype.

On the other hand, if you do warehouse-style loading (Insert, or PG 
COPY, into a temp table; and then 'upsert' into the perm table), I can 
guarantee 2500 inserts/sec is no problem.

Here's our case study if you're interested . . . 
http://www.mysql.com/customers/customer.php?id=16
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres

2004-09-12 Thread Gaetano Mendola

Mark Cotner wrote:
Requirements:
Merge table definition equivalent.  We use these
extensively.
What do you mean with merge table definition equivalent?

Regards
Gaetano Mendola



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

[PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres

2004-09-11 Thread Mark Cotner

Hi all,
I had a difficult time deciding which list to post
this to, so please forgive me if this list doesn't
perfectly match my questions.  My decision will not
solely be based on performance, but it is the primary
concern.  I would be very appreciative if you all
could comment on my test plan.  Real world examples of
a Postgres implementation of =600G with a web
front-end would be great, or any data warehouse with
some size to it.

The dilemma:
The time has come to reevaluate/rearchitect an
application which I built about 3 years ago.  There
are no performance concerns with MySQL, but it would
benefit greatly from stored procedures, views, etc. 
It is a very large rolling data warehouse that inserts
about 4.5 million rows every 2 hours and subsequently
rolls this data off the back end of a 90 day window. 
A web interface has been designed for querying the
data warehouse.  

Migration planning is much easier with views and
stored procedures and this is my primary reason for
evaluating Postgres once again.  As the application
grows I want to have the ability to provide backward
compatible views for those who are accustomed to the
current structure.  This is not possible in MySQL.  

Some of the mining that we do could benefit from
stored procedures as well.  MySQL may have these in
the works, but we won't be able to move to a version
of MySQL that supports stored procs for another year
or two.

Requirements:
Merge table definition equivalent.  We use these
extensively.

Merge table equivalent with all tables containing over
100M rows(and about 40 columns, some quite wide) will
need to do index scans in at least 5 seconds(MySQL
currently does 2, but we can live with 5) and return
~200 rows.

Um, gonna sound silly, but the web interface has to
remain snappy under load.  I don't see this as a
major concern since you don't require table locking.

If business logic is moved to the database(likely with
Postgres) performance for inserting with light logic
on each insert has to keep up with the 4.5M inserts
per 2 hours(which MySQL completes in ~35min
currently).  Acceptable numbers for this aggregation
would be 45-55min using stored procedures.

About 3 years ago I did some performance
characterizations of Postgres vs. MySQL and didn't
feel Postgres was the best solution.  3 years later
we've won runner-up for MySQL application of the
year(behind Saabre).  Oddly enough this reevaluting
database strategy is right on the coattails of this
award.  I'll  begin writing my business logic within
the next week and start migrating test data shortly
thereafter.  Case studies would be very beneficial as
I put together my analysis.

Also, this is for a Fortune 500 company that uses this
data warehouse extensively.  It is an internal
application that is widely used and gets about 4 hits
per employee per day.  Much of customer care, data
engineering, plant engineering(it's a cable company),
and marketing use the interface.  I've done a great
deal of press for MySQL and would be equally willing
to tout the benefits of Postgres to trade rags,
magazines, etc provided the results are favorable.

Here's our case study if you're interested . . . 
http://www.mysql.com/customers/customer.php?id=16

Thoughts, suggestions?

'njoy,
Mark

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres

2004-09-11 Thread Dennis Bjorklund

On Sat, 11 Sep 2004, Mark Cotner wrote:

 There are no performance concerns with MySQL, but it would benefit
 greatly from stored procedures, views, etc.  It is a very large rolling
 data warehouse that inserts about 4.5 million rows every 2 hours and
 subsequently rolls this data off the back end of a 90 day window.

While it is impossible to know without testing, postgresql has the benefit
of readers and writers that does not block each other. So in situations
where you do lots of concurrent inserts and selects postgresql should
behave well.

 Merge table definition equivalent.  We use these extensively.

As far as I can tell a merge table in mysql is the same as a view over a 
number of unions of other tables. And possibly a rule that defines how 
inserts will be done if you do inserts in the merged table.

 Merge table equivalent with all tables containing over 100M rows(and
 about 40 columns, some quite wide) will need to do index scans in at
 least 5 seconds(MySQL currently does 2, but we can live with 5) and
 return ~200 rows.

Since each table that are merged will have it's own index the speed should 
be proportional to the number of tables. Index scans in them self are very 
fast, and of you have 30 tables you need 30 index scans.

Also, are you sure you really need merge tables? With pg having row locks
and mvcc, maybe you could go for a simpler model with just one big table.  
Often you can also combine that with partial indexes to get a smaller
index to use for lots of your queries.

 Thoughts, suggestions?

I see nothing in what you have written that indicates that pg can not do 
the job, and do it well. It's however very hard to know exactly what is 
the bottleneck before one tries. There are lots of cases where people have 
converted mysql applications to postgresql and have gotten a massive 
speedup. You could be lucky and have such a case, who knows..

I spend some time each day supporting people using postgresql in the
#postgresql irc channel (on the freenode.net network). There I talk to
people doing both small and big conversions and the majority is very happy
with the postgresql performance. Postgresql have gotten faster and faster 
with each release and while speed was a fair argument a number of years 
ago it's not like that today.

That said, in the end it depends on the application.

We are all interested in how it goes (well, at least me :-), so feel free
to send more mails keeping us posted. Good luck.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

48 matches

Mail list logo