On Thu, Mar 16, 2006 at 09:53:27PM +0100, Daniel Franke wrote:
> 
> > That would be an excellent question to add to the FAQ:
> > "How do I estimate the resource requirements for a database?"
> 
> I spent some time to create 3GB of sample data (just zeros, about half the 
> size of the actual data set I have to deal with). I'm currently importing it 
> into the database. As far as I already can tell: the main index will be 
> approx 4GB in size. AFAIK, 32bit systems never grant more than 2GB RAM per 
> process ... 
> 
> This may take a while, about 20 hours maybe. The partition has approx 10GB, I 
> can't afford more. Let's hope that this is sufficient.

20 hours seems rather long. Even if you have to worry about uniqueness
constraints, there are ways to deal with that that should be much faster
(deal with the data in chunks, load into temp tables, check for dupes,
etc).

I've not used SQLite, so I can't really speak to it's capabilities. I
can tell you that even 750M rows wouldn't be a huge deal for PostgreSQL,
and 20G of data is nothing. Though your table would take somewhere
around 30G due to the higher per-row overhead in PostgreSQL; I'm not
really sure how large the indexes would be.

As for performance, I haven't seen a single mention of any kind of
metrics you'd like to hit, so it's impossible to guess as to whether
SQLite, PostgreSQL, or anything else would suffice. I can give you this
metric though: the main table behind http://stats.distributed.net has
134M rows and I can do a group-by count on it in 188 seconds (and that's
after blowing out the cache). This table is wider than yours:

   Table "public.email_contrib"
   Column   |  Type   | Modifiers 
------------+---------+-----------
 project_id | integer | not null
 id         | integer | not null
 date       | date    | not null
 team_id    | integer | 
 work_units | bigint  | not null

select project_id, count(*) from email_contrib group by project_id;
 project_id |  count
------------+----------
          5 | 56427141
        205 | 58512516
          3 |   336550
         25 |  6756695
          8 | 11975041
         24 |   626317

It takes up about 8G on disk.

I can also do index-scans fairly fast.
http://cvs.distributed.net/viewcvs.cgi/stats-proc/daily/audit.sql?annotate=1.45
is an auditing script that (among other things) scans through every
record in email_contrib for a particular project_id, while joining to
another table (see the code starting at line 170). It takes about 70
seconds to do this for project_id 8 or 25. All this is happening on a
dual Opteron (242, I think) with the OS and transaction logs on a 2
SATA drive mirror and the data stored on a 6 SATA drive RAID 10. The
machine has 4G of memory. This certainly isn't what I'd consider to be
'big expensive hardware'.

As for partitioning, you might still have a win if you can identify some
common groupings, and partition based on that. Even if you can't, you
could at least get a win on single-person queries.
-- 
Jim C. Nasby, Sr. Engineering Consultant      [EMAIL PROTECTED]
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Reply via email to