[PERFORM] multi billion row tables: possible or insane?
Hi all, I am doing research for a project of mine where I need to store several billion values for a monitoring and historical tracking system for a big computer system. My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more). I was wondering if anyone has had any experience with these kind of big numbers of data in a postgres sql database and how this affects database design and optimization. What would be important issues when setting up a database this big, and is it at all doable? Or would it be a insane to think about storing up to 5-10 billion rows in a postgres database. The database's performance is important. There would be no use in storing the data if a query will take ages. Query's should be quite fast if possible. I would really like to hear people's thoughts/suggestions or go see a shrink, you must be mad statements ;) Kind regards, Ramon Bastiaans ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] multi billion row tables: possible or insane?
On Mar 1, 2005, at 4:34 AM, Ramon Bastiaans wrote: What would be important issues when setting up a database this big, and is it at all doable? Or would it be a insane to think about storing up to 5-10 billion rows in a postgres database. Buy a bunch of disks. And then go out and buy more disks. When you are done with that - go buy some more disks. Then buy some ram. Then buy more disks. You want the fastest IO possible. I'd also recommend the opteron route since you can also put heaping gobules of ram in there as well. The database's performance is important. There would be no use in storing the data if a query will take ages. Query's should be quite fast if possible. And make sure you tune your queries. -- Jeff Trout [EMAIL PROTECTED] http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] multi billion row tables: possible or insane?
Hi, Ramon, Ramon Bastiaans schrieb: The database's performance is important. There would be no use in storing the data if a query will take ages. Query's should be quite fast if possible. Which kind of query do you want to run? Queries that involve only a few rows should stay quite fast when you set up the right indices. However, queries that involve sequential scans over your table (like average computation) will take years. Get faast I/O for this. Or, better, use a multidimensional data warehouse engine. Those can precalculate needed aggregate functions and reports. But they need loads of storage (because of very redundant data storage), and I don't know any open source or cheap software. Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:[EMAIL PROTECTED] | www.logi-track.com ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] multi billion row tables: possible or insane?
Hi, John, John Arbash Meinel schrieb: I am doing research for a project of mine where I need to store several billion values for a monitoring and historical tracking system for a big computer system. My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more). If you have that 1 billion perfectly distributed over all hours of the day, then you need 1e9/30/24/3600 = 385 transactions per second. I hope that he does not use one transaction per inserted row. In your in-house tests, we got a speedup factor of up to some hundred when bundling rows on insertions. The fastest speed was with using bunches of some thousand rows per transaction, and running about 5 processes in parallel. Regard the usual performance tips: Use a small, but fast-writing RAID for transaction log (no RAID-5 or RAID-6 variants), possibly a mirroring of two harddisk-backed SSD. Use different disks for the acutal data (here, LVM2 with growing volumes could be very handy). Have enough RAM. Use a fast file system. BTW, as you read about the difficulties that you'll face with this enormous amount of data: Don't think that your task will much be easier or cheaper using any other DBMS, neither commercial nor open source. For all of them, you'll need big iron hardware, and a skilled team of admins to set up and maintain the database. Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:[EMAIL PROTECTED] | www.logi-track.com ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Inheritence versus delete from
On Tue, 2005-03-01 at 09:48 -0600, John Arbash Meinel wrote: Sven Willenberger wrote: Trying to determine the best overall approach for the following scenario: Each month our primary table accumulates some 30 million rows (which could very well hit 60+ million rows per month by year's end). Basically there will end up being a lot of historical data with little value beyond archival. The question arises then as the best approach of which I have enumerated three: I just thought of another possibility. You could create each table month-by-month, and then use a view to combine them, and possibly a rule to keep things clean. So you would do something like: I will assume you already have the data in one big table to show the easiest way to create the small tables. create table tblname-2005-01 as select * from orig_tbl where day = '2005-01-01' and day '2005-02-01'; create table tblname-2005-02 as select * from orig_tbl where day = '2005-02-01' and day '2005-03-01'; create table tblname-2005-03 as select * from orig_tbl where day = '2005-03-01' and day '2005-04-01'; -- create appropriate indicies, rules, constraints on these tables Then you create a view which includes all of these tables. create or replace view tblname as select * from tblname-2005-01 union all select * from tblname-2005-02 union all select * from tblname-2005-03 ; Then create insert and update rules which fixe which table gets the new data. create rule up_tblname as on update to tblname do instead update tblname-2005-03 set col1 = NEW.col1, col2 = NEW.col2, ... where id = NEW.id; -- This assumes that you have a unique id on your tables. This is just whatever your -- primary key is, so it should be a decent assumption. create rule ins_tblname as on insert to tblname do instead insert into tblname-2005-03 (col1, col2, ...) values (new.col1, new.col2, ...); Now the downside of this method, is that every month you need to create a new table, and then update the views and the rules. The update rules are pretty straightforward, though. The nice thing is that it keeps your data partitioned, and you don't ever have a large select/delete step. You probably will want a small one each month to keep the data exactly aligned by month. You don't really have to have exact alignments, but as humans, we tend to like that stuff. :) Probably this is more overhead than you would like to do. Especially if you know that you can get away with method 2 (keep 1 big table, and just remove old rows out of it every month.) But this method means that all of your data stays live, but queries with appropriate restrictions should stay fast. You also have the ability (with v8.0) to move the individual tables onto separate disks. One more time, though, if you can get away with removing old data and just archiving it, do so. But if you want to keep the data live, there are a couple of alternatives. Actually that was the thought behind my using inheritance; when querying the bigtable, it basically does a union all; also, I think it would be quicker to insert directly into the child table (simply by modifying my query once a month) rather than the overhead sustained by the rule. Since the children tables are individual tables, all the benefits you cite above still hold. Thanks for the input on this ... will have to try a couple things to see which is most manageable.\ Sven ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] multi billion row tables: possible or insane?
Markus Schaber wrote: Hi, John, John Arbash Meinel schrieb: I am doing research for a project of mine where I need to store several billion values for a monitoring and historical tracking system for a big computer system. My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more). If you have that 1 billion perfectly distributed over all hours of the day, then you need 1e9/30/24/3600 = 385 transactions per second. I hope that he does not use one transaction per inserted row. In your in-house tests, we got a speedup factor of up to some hundred when bundling rows on insertions. The fastest speed was with using bunches of some thousand rows per transaction, and running about 5 processes in parallel. You're right. I guess it just depends on how the data comes in, and what you can do at the client ends. That is kind of where I was saying put a machine in front which gathers up the information, and then does a batch update. If your client can do this directly, then you have the same advantage. John =:- signature.asc Description: OpenPGP digital signature
[PERFORM] Confusion about locales and 'like' indexes
Greetings, I have been beating myself up today trying to optimize indices for a query that uses LIKE. In my research I have read that the locale setting may affect PostgreSQL's choice of seq scan vs index scan. I am running Fedora Core 2 and it appears when I run locale that it is set to 'en.US-UTF-8'. Did I fall into a gotcha trap here about C vs non-C locales? I'm not much of a C programmer so I have no idea what all this touches and everything has been left as default during PG compilation as well as Fedora install. I can pg_dump and initdb again with --locale=C if this will allow my LIKE queries to use indexes, but I just wanted to know if there was some other place I needed to change locales in the system? e.g. postgresql.conf or env vars? Or, would the initdb and reload alone fix it? I'm running 8.0.1 if that matters. Thanks ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Inheritence versus delete from
Sven Willenberger wrote: Trying to determine the best overall approach for the following scenario: Each month our primary table accumulates some 30 million rows (which could very well hit 60+ million rows per month by year's end). Basically there will end up being a lot of historical data with little value beyond archival. The question arises then as the best approach of which I have enumerated three: 1) Just allow the records to accumulate and maintain constant vacuuming, etc allowing for the fact that most queries will only be from a recent subset of data and should be mostly cached. 2) Each month: SELECT * INTO 3monthsago_dynamically_named_table FROM bigtable WHERE targetdate $3monthsago; DELETE FROM bigtable where targetdate $3monthsago; VACUUM ANALYZE bigtable; pg_dump 3monthsago_dynamically_named_table for archiving; In my experience copy/delete in a single transaction 60+ million rows is not feseable, at least on my 1 GB ram, 2 way CPU box. Regards Gaetano Mendola ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] multi billion row tables: possible or insane?
Ramon, What would be important issues when setting up a database this big, and is it at all doable? Or would it be a insane to think about storing up to 5-10 billion rows in a postgres database. What's your budget?You're not going to do this on a Dell 2650. Do you have the kind of a budget necessary to purchase/build a good SAN, Quad-opteron machine, etc.? Or at least hire some tuning help? -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Confusion about locales and 'like' indexes
Dan Harris [EMAIL PROTECTED] writes: query that uses LIKE. In my research I have read that the locale setting may affect PostgreSQL's choice of seq scan vs index scan. Non-C-locale indexes can't support LIKE because the sort ordering isn't necessarily right. I am running Fedora Core 2 and it appears when I run locale that it is set to 'en.US-UTF-8'. This is not a definitive indication of the environment the database sees, though. Try show lc_collate. I can pg_dump and initdb again with --locale=C if this will allow my LIKE queries to use indexes, but I just wanted to know if there was some other place I needed to change locales in the system? e.g. postgresql.conf or env vars? Or, would the initdb and reload alone fix it? That would do it. Alternatively you can create special-purpose indexes with one of the xxx_pattern_ops operator classes to support LIKE. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] multi billion row tables: possible or insane?
Ramon Bastiaans wrote: I am doing research for a project of mine where I need to store several billion values for a monitoring and historical tracking system for a big computer system. My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more). I was wondering if anyone has had any experience with these kind of big numbers of data in a postgres sql database and how this affects database design and optimization. What would be important issues when setting up a database this big, and is it at all doable? Or would it be a insane to think about storing up to 5-10 billion rows in a postgres database. The database's performance is important. There would be no use in storing the data if a query will take ages. Query's should be quite fast if possible. I would really like to hear people's thoughts/suggestions or go see a shrink, you must be mad statements ;) It just dawned on me that we're doing something that, while not the same, might be relevant. One of our tables has ~85M rows in it according to the output from an explain select * from table. I don't plan on trying a select count(*) any time soon :)We add and remove about 25M rows a day to/from this table which would be about 750M rows/month total. Given our current usage of the database, it could handle a larger row/day rate without too much trouble. (The problem isn't adding rows but deleting rows.) Column| Type | Modifiers --+--+--- timeseriesid | bigint | bindata | bytea| binsize | integer | rateid | smallint | ownerid | smallint | Indexes: idx_timeseries btree (timeseriesid) In this case, each bytea entry is typically about 2KB of data, so the total table size is about 150GB, plus some index overhead. A second table has ~100M rows according to explain select *. Again it has about 30M rows added and removed / day. Column | Type | Modifiers +---+--- uniqid | bigint| type | character varying(50) | memberid | bigint| tag| character varying(50) | membertype | character varying(50) | ownerid| smallint | Indexes: composite_memberid btree (memberid) composite_uniqid btree (uniqid) There are some additional tables that have a few million rows / day of activity, so call it 60M rows/day added and removed. We run a vacuum every day. The box is an dual Opteron 248 from Sun. Linux 2.6, 8GB of memory. We use reiserfs. We started with XFS but had several instances of file system corruption. Obviously, no RAID 5. The xlog is on a 2 drive mirror and the rest is on separate mirrored volume. The drives are fiber channel but that was a mistake as the driver from IBM wasn't very good. So, while we don't have a billion rows we do have ~200M total rows in all the tables and we're certainly running the daily row count that you'd need to obtain. But scaling this sort of thing up can be tricky and your milage may vary. In a prior career I ran a data intensive computing center and helped do some design work for a high energy physics experiment: petabytes of data, big tape robots, etc., the usual Big Science toys. You might take a look at ROOT and some of the activity from those folks if you don't need transactions and all the features of a general database like postgresql. -- Alan ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] multi billion row tables: possible or insane?
Ramon Bastiaans schrieb: My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more). You should post the actual number or power of ten, since billion is not always interpreted the same way... rgds thomas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] What is the postgres sql command for last_user_id ???
I would like to know whether there is any command which the server will give the record ID back to the client when client puts the data and the server generates an autoincrement ID for that record. For example if many clients try to put the money data to the server and each record from each client has its own record ID by autoincrement process of the server [x+1] and i don't need to lock the record since it will bring the system to slow down. That client wil then want to know which ID that server gives to that record in order to select that record to print the reciept [bill]. I know that in mysql there is a command last_record_id which acts the same as I mention above. Does anybody know that , please give me the detail? Amrit,Thailand ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] Help with tuning this query
I've tried to use Dan Tow's tuning method and created all the right indexes from his diagraming method, but the query still performs quite slow both inside the application and just inside pgadmin III. Can anyone be kind enough to help me tune it so that it performs better in postgres? I don't think it's using the right indexes, or maybe postgres needs special treatment. I've converted the below query to SQL from a Hibernate query, so the syntax is probably not perfect but it's semantics are exactly the same. I've done so by looking at the source code, but I can't run it to get the exact SQL since I don't have the database on my home machine. selects.*from shipment s inner join carrier_code cc on s.carrier_code_id = cc.id inner join carrier c on cc.carrier_id = c.id inner join carrier_to_person ctp on ctp.carrier_id = c.id inner join person p on p.id = ctp.person_id inner join shipment_status cs on s.current_status_id = cs.id inner join release_code rc on cs.release_code_id = rc.id left join shipment_status ss on ss.shipment_id = s.idwhere p.id = :personId and s.is_purged = false and rc.number = '9' andcs is not null and cs.date = current_date - 31order by cs.date desc Just assume I have no indexes for the moment because while some of the indexes I made make it work faster, it's still around 250 milliseconds and under heavy load, the query performs very badly (6-7 seconds). For your information: shipment contains 40,000 rows shipment_status contains 80,000 rows release_code contains 8 rows person contains 300 rows carrier contains 60 rows carrier_code contains 70 rows The filter ratios are: rc.number = '9'(0.125) cs.date = current_date - 31 (.10) p.id = ?(0.003) s.is_purged = false (.98) I really hope someone can help since I'm pretty much stuck. Best regards and many thanks, Ken