On Thu, Mar 16, 2006 at 09:53:27PM +0100, Daniel Franke wrote: > > > That would be an excellent question to add to the FAQ: > > "How do I estimate the resource requirements for a database?" > > I spent some time to create 3GB of sample data (just zeros, about half the > size of the actual data set I have to deal with). I'm currently importing it > into the database. As far as I already can tell: the main index will be > approx 4GB in size. AFAIK, 32bit systems never grant more than 2GB RAM per > process ... > > This may take a while, about 20 hours maybe. The partition has approx 10GB, I > can't afford more. Let's hope that this is sufficient.
20 hours seems rather long. Even if you have to worry about uniqueness constraints, there are ways to deal with that that should be much faster (deal with the data in chunks, load into temp tables, check for dupes, etc). I've not used SQLite, so I can't really speak to it's capabilities. I can tell you that even 750M rows wouldn't be a huge deal for PostgreSQL, and 20G of data is nothing. Though your table would take somewhere around 30G due to the higher per-row overhead in PostgreSQL; I'm not really sure how large the indexes would be. As for performance, I haven't seen a single mention of any kind of metrics you'd like to hit, so it's impossible to guess as to whether SQLite, PostgreSQL, or anything else would suffice. I can give you this metric though: the main table behind http://stats.distributed.net has 134M rows and I can do a group-by count on it in 188 seconds (and that's after blowing out the cache). This table is wider than yours: Table "public.email_contrib" Column | Type | Modifiers ------------+---------+----------- project_id | integer | not null id | integer | not null date | date | not null team_id | integer | work_units | bigint | not null select project_id, count(*) from email_contrib group by project_id; project_id | count ------------+---------- 5 | 56427141 205 | 58512516 3 | 336550 25 | 6756695 8 | 11975041 24 | 626317 It takes up about 8G on disk. I can also do index-scans fairly fast. http://cvs.distributed.net/viewcvs.cgi/stats-proc/daily/audit.sql?annotate=1.45 is an auditing script that (among other things) scans through every record in email_contrib for a particular project_id, while joining to another table (see the code starting at line 170). It takes about 70 seconds to do this for project_id 8 or 25. All this is happening on a dual Opteron (242, I think) with the OS and transaction logs on a 2 SATA drive mirror and the data stored on a 6 SATA drive RAID 10. The machine has 4G of memory. This certainly isn't what I'd consider to be 'big expensive hardware'. As for partitioning, you might still have a win if you can identify some common groupings, and partition based on that. Even if you can't, you could at least get a win on single-person queries. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461