[PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Ramon Bastiaans
Hi all,
I am doing research for a project of mine where I need to store several 
billion values for a monitoring and historical tracking system for a big 
computer system. My currect estimate is that I have to store (somehow) 
around 1 billion values each month (possibly more).

I was wondering if anyone has had any experience with these kind of big 
numbers of data in a postgres sql database and how this affects database 
design and optimization.

What would be important issues when setting up a database this big, and 
is it at all doable? Or would it be a insane to think about storing up 
to 5-10 billion rows in a postgres database.

The database's performance is important. There would be no use in 
storing the data if a query will take ages. Query's should be quite fast 
if possible.

I would really like to hear people's thoughts/suggestions or go see a 
shrink, you must be mad statements ;)

Kind regards,
Ramon Bastiaans

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Jeff
On Mar 1, 2005, at 4:34 AM, Ramon Bastiaans wrote:
What would be important issues when setting up a database this big, 
and is it at all doable? Or would it be a insane to think about 
storing up to 5-10 billion rows in a postgres database.

Buy a bunch of disks.
And then go out and buy more disks.
When you are done with that - go buy some more disks.
Then buy some ram.
Then buy more disks.
You want the fastest IO possible.
I'd also recommend the opteron route since you can also put heaping 
gobules of ram in there as well.

The database's performance is important. There would be no use in 
storing the data if a query will take ages. Query's should be quite 
fast if possible.

And make sure you tune your queries.
--
Jeff Trout [EMAIL PROTECTED]
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Markus Schaber
Hi, Ramon,

Ramon Bastiaans schrieb:

 The database's performance is important. There would be no use in
 storing the data if a query will take ages. Query's should be quite fast
 if possible.

Which kind of query do you want to run?

Queries that involve only a few rows should stay quite fast when you set
up the right indices.

However, queries that involve sequential scans over your table (like
average computation) will take years. Get faast I/O for this. Or,
better, use a multidimensional data warehouse engine. Those can
precalculate needed aggregate functions and reports. But they need loads
of storage (because of very redundant data storage), and I don't know
any open source or cheap software.

Markus
-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Markus Schaber
Hi, John,

John Arbash Meinel schrieb:

 I am doing research for a project of mine where I need to store
 several billion values for a monitoring and historical tracking system
 for a big computer system. My currect estimate is that I have to store
 (somehow) around 1 billion values each month (possibly more).

 If you have that 1 billion perfectly distributed over all hours of the
 day, then you need 1e9/30/24/3600 = 385 transactions per second.

I hope that he does not use one transaction per inserted row.

In your in-house tests, we got a speedup factor of up to some hundred
when bundling rows on insertions. The fastest speed was with using
bunches of some thousand rows per transaction, and running about 5
processes in parallel.

Regard the usual performance tips: Use a small, but fast-writing RAID
for transaction log (no RAID-5 or RAID-6 variants), possibly a mirroring
of two harddisk-backed SSD. Use different disks for the acutal data
(here, LVM2 with growing volumes could be very handy). Have enough RAM.
Use a fast file system.

BTW, as you read about the difficulties that you'll face with this
enormous amount of data: Don't think that your task will much be easier
or cheaper using any other DBMS, neither commercial nor open source. For
all of them, you'll need big iron hardware, and a skilled team of
admins to set up and maintain the database.

Markus

-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Inheritence versus delete from

2005-03-01 Thread Sven Willenberger
On Tue, 2005-03-01 at 09:48 -0600, John Arbash Meinel wrote:
 Sven Willenberger wrote:
 
 Trying to determine the best overall approach for the following
 scenario:
 
 Each month our primary table accumulates some 30 million rows (which
 could very well hit 60+ million rows per month by year's end). Basically
 there will end up being a lot of historical data with little value
 beyond archival.
 
 The question arises then as the best approach of which I have enumerated
 three:
 
 
 I just thought of another possibility. You could create each table
 month-by-month, and then use a view to combine them, and possibly a rule
 to keep things clean.
 
 So you would do something like:
 
 I will assume you already have the data in one big table to show the
 easiest way to create the small tables.
 
 create table tblname-2005-01 as select * from orig_tbl where day =
 '2005-01-01' and day  '2005-02-01';
 create table tblname-2005-02 as select * from orig_tbl where day =
 '2005-02-01' and day  '2005-03-01';
 create table tblname-2005-03 as select * from orig_tbl where day =
 '2005-03-01' and day  '2005-04-01';
 -- create appropriate indicies, rules, constraints on these tables
 
 Then you create a view which includes all of these tables.
 
 create or replace view tblname as
 select * from tblname-2005-01
 union all select * from tblname-2005-02
 union all select * from tblname-2005-03
 ;
 
 Then create insert and update rules which fixe which table gets the new
 data.
 
 create rule up_tblname as on update to tblname do instead
 update tblname-2005-03 set
col1 = NEW.col1,
col2 = NEW.col2,
...
 where id = NEW.id;
 -- This assumes that you have a unique id on your tables. This is just
 whatever your
 -- primary key is, so it should be a decent assumption.
 
 create rule ins_tblname as on insert to tblname do instead
 insert into tblname-2005-03 (col1, col2, ...)
 values (new.col1, new.col2, ...);
 
 Now the downside of this method, is that every month you need to create
 a new table, and then update the views and the rules. The update rules
 are pretty straightforward, though.
 
 The nice thing is that it keeps your data partitioned, and you don't
 ever have a large select/delete step. You probably will want a small one
 each month to keep the data exactly aligned by month. You don't really
 have to have exact alignments, but as humans, we tend to like that stuff. :)
 
 Probably this is more overhead than you would like to do. Especially if
 you know that you can get away with method 2 (keep 1 big table, and just
 remove old rows out of it every month.)
 
 But this method means that all of your data stays live, but queries with
 appropriate restrictions should stay fast. You also have the ability
 (with v8.0) to move the individual tables onto separate disks.
 
 One more time, though, if you can get away with removing old data and
 just archiving it, do so. But if you want to keep the data live, there
 are a couple of alternatives.
 

Actually that was the thought behind my using inheritance; when querying
the bigtable, it basically does a union all; also, I think it would be
quicker to insert directly into the child table (simply by modifying my
query once a month) rather than the overhead sustained by the rule.

Since the children tables are individual tables, all the benefits you
cite above still hold. 

Thanks for the input on this ... will have to try a couple things to see
which is most manageable.\

Sven


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread John Arbash Meinel
Markus Schaber wrote:
Hi, John,
John Arbash Meinel schrieb:

I am doing research for a project of mine where I need to store
several billion values for a monitoring and historical tracking system
for a big computer system. My currect estimate is that I have to store
(somehow) around 1 billion values each month (possibly more).

If you have that 1 billion perfectly distributed over all hours of the
day, then you need 1e9/30/24/3600 = 385 transactions per second.

I hope that he does not use one transaction per inserted row.
In your in-house tests, we got a speedup factor of up to some hundred
when bundling rows on insertions. The fastest speed was with using
bunches of some thousand rows per transaction, and running about 5
processes in parallel.

You're right. I guess it just depends on how the data comes in, and what
you can do at the client ends. That is kind of where I was saying put a
machine in front which gathers up the information, and then does a batch
update. If your client can do this directly, then you have the same
advantage.

John
=:-


signature.asc
Description: OpenPGP digital signature


[PERFORM] Confusion about locales and 'like' indexes

2005-03-01 Thread Dan Harris
Greetings,
I have been beating myself up today trying to optimize indices for a 
query that uses LIKE.  In my research I have read that the locale 
setting may affect PostgreSQL's choice of seq scan vs index scan.  I am 
running Fedora Core 2 and it appears when I run locale that it is set 
to 'en.US-UTF-8'.

Did I fall into a gotcha trap here about C vs non-C locales?  I'm not 
much of a C programmer so I have no idea what all this touches and 
everything has been left as default during PG compilation as well as 
Fedora install.   I can pg_dump and initdb again with --locale=C if 
this will allow my LIKE queries to use indexes, but I just wanted to 
know if there was some other place I needed to change locales in the 
system? e.g. postgresql.conf or env vars?  Or, would the initdb and 
reload alone fix it?

I'm running 8.0.1 if that matters.
Thanks
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Inheritence versus delete from

2005-03-01 Thread Gaetano Mendola
Sven Willenberger wrote:
 Trying to determine the best overall approach for the following
 scenario:
 
 Each month our primary table accumulates some 30 million rows (which
 could very well hit 60+ million rows per month by year's end). Basically
 there will end up being a lot of historical data with little value
 beyond archival.
 
 The question arises then as the best approach of which I have enumerated
 three:
 
 1) Just allow the records to accumulate and maintain constant vacuuming,
 etc allowing for the fact that most queries will only be from a recent
 subset of data and should be mostly cached.
 
 2) Each month:
 SELECT * INTO 3monthsago_dynamically_named_table FROM bigtable WHERE
 targetdate  $3monthsago;
 DELETE FROM bigtable where targetdate  $3monthsago;
 VACUUM ANALYZE bigtable;
 pg_dump 3monthsago_dynamically_named_table for archiving;


In my experience copy/delete in a single transaction 60+ million rows
is not feseable, at least on my 1 GB ram, 2 way CPU box.



Regards
Gaetano Mendola


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Josh Berkus
Ramon,

 What would be important issues when setting up a database this big, and
 is it at all doable? Or would it be a insane to think about storing up
 to 5-10 billion rows in a postgres database.

What's your budget?You're not going to do this on a Dell 2650.   Do you 
have the kind of a budget necessary to purchase/build a good SAN, 
Quad-opteron machine, etc.?   Or at least hire some tuning help?


-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Confusion about locales and 'like' indexes

2005-03-01 Thread Tom Lane
Dan Harris [EMAIL PROTECTED] writes:
 query that uses LIKE.  In my research I have read that the locale 
 setting may affect PostgreSQL's choice of seq scan vs index scan.

Non-C-locale indexes can't support LIKE because the sort ordering
isn't necessarily right.

 I am running Fedora Core 2 and it appears when I run locale that it
 is set to 'en.US-UTF-8'.

This is not a definitive indication of the environment the database
sees, though.  Try show lc_collate.

 I can pg_dump and initdb again with --locale=C if 
 this will allow my LIKE queries to use indexes, but I just wanted to 
 know if there was some other place I needed to change locales in the 
 system? e.g. postgresql.conf or env vars?  Or, would the initdb and 
 reload alone fix it?

That would do it.  Alternatively you can create special-purpose indexes
with one of the xxx_pattern_ops operator classes to support LIKE.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Alan Stange
Ramon Bastiaans wrote:
I am doing research for a project of mine where I need to store 
several billion values for a monitoring and historical tracking system 
for a big computer system. My currect estimate is that I have to store 
(somehow) around 1 billion values each month (possibly more).

I was wondering if anyone has had any experience with these kind of 
big numbers of data in a postgres sql database and how this affects 
database design and optimization.

What would be important issues when setting up a database this big, 
and is it at all doable? Or would it be a insane to think about 
storing up to 5-10 billion rows in a postgres database.

The database's performance is important. There would be no use in 
storing the data if a query will take ages. Query's should be quite 
fast if possible.

I would really like to hear people's thoughts/suggestions or go see a 
shrink, you must be mad statements ;)
It just dawned on me that we're doing something that, while not the 
same, might be relevant.  One of our tables has ~85M rows in it 
according to the output from an explain select * from table.  I don't 
plan on trying a select count(*) any time soon :)We add and remove 
about 25M rows a day to/from this table which would be about 750M 
rows/month total.  Given our current usage of the database, it could 
handle a larger row/day rate without too much trouble.  (The problem 
isn't adding rows but deleting rows.)

   Column|   Type   | Modifiers
--+--+---
timeseriesid | bigint   |
bindata  | bytea|
binsize  | integer  |
rateid   | smallint |
ownerid  | smallint |
Indexes:
   idx_timeseries btree (timeseriesid)
In this case, each bytea entry is typically about  2KB of data, so the 
total table size is about 150GB, plus some index overhead.

A second table has ~100M rows according to explain select *.  Again it 
has about 30M rows added and removed / day.  

  Column   | Type  | Modifiers
+---+---
uniqid | bigint|
type   | character varying(50) |
memberid   | bigint|
tag| character varying(50) |
membertype | character varying(50) |
ownerid| smallint  |
Indexes:
   composite_memberid btree (memberid)
   composite_uniqid btree (uniqid)
There are some additional tables that have a few million rows / day of 
activity, so call it 60M rows/day added and removed.  We run a vacuum 
every day.

The box is an dual Opteron 248 from Sun.   Linux 2.6, 8GB of memory.  We 
use reiserfs.  We started with XFS but had several instances of file 
system corruption.  Obviously, no RAID 5.  The xlog is on a 2 drive 
mirror and the rest is on separate mirrored volume.  The drives are 
fiber channel but that was a mistake as the driver from IBM wasn't very 
good.

So, while we don't have a billion rows we do have ~200M total rows in 
all the tables and we're certainly running the daily row count that 
you'd need to obtain.   But scaling this sort of thing up can be tricky 
and your milage may vary.

In a prior career I ran a data intensive computing center and helped 
do some design work for a high energy physics experiment:  petabytes of 
data, big tape robots, etc., the usual Big Science toys.   You might 
take a look at ROOT and some of the activity from those folks if you 
don't need transactions and all the features of a general database like 
postgresql.

-- Alan
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Thomas Ganss
Ramon Bastiaans schrieb:
My currect estimate is that I have to store (somehow) 
around 1 billion values each month (possibly more).
You should post the actual number or power of ten,
since billion is not always interpreted the same way...
rgds
thomas
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


[PERFORM] What is the postgres sql command for last_user_id ???

2005-03-01 Thread amrit
I would like to know whether there is any command which the server will give the
record ID back to the client when client puts the data and the server generates
an autoincrement ID for that record.
For example if many clients try to put the money data to the server and each
record from each client has its own record ID by autoincrement process of the
server [x+1] and i don't need to lock the record since it will bring the system
to slow down. That client wil then want to know which ID that server gives to
that record in order to select that record to print the reciept [bill].
I know that in mysql there is a command last_record_id which acts the same as
I mention above. Does anybody know that , please give me the detail?

Amrit,Thailand



---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Help with tuning this query

2005-03-01 Thread Ken Egervari



I've tried to use Dan Tow's tuning method and 
created all the right indexes from his diagraming method, but the query still 
performs quite slow both inside the application and just inside pgadmin 
III. Can anyone be kind enough to help me tune it so that it performs 
better in postgres? I don't think it's using the right indexes, or maybe 
postgres needs special treatment.

I've converted the below query to SQL from a 
Hibernate query, so the syntax is probably not perfect but it's semantics 
are exactly the same. I've done so by 
looking at the source code, but I can't run it to get the exact SQL since I 
don't have the database on my home machine.

selects.*from shipment 
s inner join carrier_code cc on s.carrier_code_id = 
cc.id
 inner join carrier c on 
cc.carrier_id = c.id
 inner join carrier_to_person ctp 
on ctp.carrier_id = c.id
 inner join person p on p.id = 
ctp.person_id
 inner join shipment_status cs on 
s.current_status_id = cs.id
 inner join release_code rc on 
cs.release_code_id = rc.id
 left join shipment_status ss on 
ss.shipment_id = s.idwhere p.id = :personId 
and s.is_purged = false and 
rc.number = '9' andcs is not null 
and cs.date = current_date - 31order by cs.date 
desc
Just assume I have no indexes for the moment 
because while some of the indexes I made make it work faster, it's still around 
250 milliseconds and under heavy load, the query performs very badly (6-7 
seconds).

For your information:

shipment contains 40,000 rows
shipment_status contains 80,000 rows
release_code contains 8 rows
person contains 300 rows
carrier contains 60 rows
carrier_code contains 70 rows

The filter ratios are:

rc.number = '9'(0.125)
cs.date = current_date - 31 (.10)
p.id = ?(0.003)
s.is_purged = false (.98)

I really hope someone can help since I'm pretty 
much stuck.

Best regards and many thanks,
Ken