Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
You should search the archives for Luke Lonegran's posting about how IO in Postgresql is significantly bottlenecked because it's not async. A 12 disk array is going to max out Postgresql's max theoretical write capacity to disk, and therefore BigRDBMS is always going to win in such a config. You can also look towards Bizgres which allegedly elimates some of these problems, and is cheaper than most BigRDBMS products. Alex. On 12/28/06, Guy Rouillier [EMAIL PROTECTED] wrote: I don't want to violate any license agreement by discussing performance, so I'll refer to a large, commercial PostgreSQL-compatible DBMS only as BigDBMS here. I'm trying to convince my employer to replace BigDBMS with PostgreSQL for at least some of our Java applications. As a proof of concept, I started with a high-volume (but conceptually simple) network data collection application. This application collects files of 5-minute usage statistics from our network devices, and stores a raw form of these stats into one table and a normalized form into a second table. We are currently storing about 12 million rows a day in the normalized table, and each month we start new tables. For the normalized data, the app inserts rows initialized to zero for the entire current day first thing in the morning, then throughout the day as stats are received, executes updates against existing rows. So the app has very high update activity. In my test environment, I have a dual-x86 Linux platform running the application, and an old 4-CPU Sun Enterprise 4500 running BigDBMS and PostgreSQL 8.2.0 (only one at a time.) The Sun box has 4 disk arrays attached, each with 12 SCSI hard disks (a D1000 and 3 A1000, for those familiar with these devices.) The arrays are set up with RAID5. So I'm working with a consistent hardware platform for this comparison. I'm only processing a small subset of files (144.) BigDBMS processed this set of data in 2 seconds, with all foreign keys in place. With all foreign keys in place, PG took 54000 seconds to complete the same job. I've tried various approaches to autovacuum (none, 30-seconds) and it doesn't seem to make much difference. What does seem to make a difference is eliminating all the foreign keys; in that configuration, PG takes about 3 seconds. Better, but BigDBMS still has it beat significantly. I've got PG configured so that that the system database is on disk array 2, as are the transaction log files. The default table space for the test database is disk array 3. I've got all the reference tables (the tables to which the foreign keys in the stats tables refer) on this array. I also store the stats tables on this array. Finally, I put the indexes for the stats tables on disk array 4. I don't use disk array 1 because I believe it is a software array. I'm out of ideas how to improve this picture any further. I'd appreciate some suggestions. Thanks. -- Guy Rouillier ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Bad iostat numbers
The problem I see with software raid is the issue of a battery backed unit: If the computer loses power, then the 'cache' which is held in system memory, goes away, and fubars your RAID. Alex On 12/5/06, Michael Stone [EMAIL PROTECTED] wrote: On Tue, Dec 05, 2006 at 01:21:38AM -0500, Alex Turner wrote: My other and most important point is that I can't find any solid recommendations for a SCSI card that can perform optimally in Linux or *BSD. Off by a factor of 3x is pretty sad IMHO. (and yes, we know the Adaptec cards suck worse, that doesn't bring us to a _good_ card). This gets back to my point about terminology. As a SCSI HBA the Adaptec is decent: I can sustain about 300MB/s off a single channel of the 39320A using an external RAID controller. As a RAID controller I can't even imagine using the Adaptec; I'm fairly certain they put that functionality on there just so they could charge more for the card. It may be that there's not much market for on-board SCSI RAID controllers; between SATA on the low end and SAS FC on the high end, there isn't a whole lotta space left for SCSI. I definitely don't think much RD is going into SCSI controllers any more, compared to other solutions like SATA or SAS RAID (the 39320 hasn't change in at least 3 years, IIRC). Anyway, since the Adaptec part is a decent SCSI controller and a lousy RAID controller, have you tried just using software RAID? Mike Stone ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware advice
The test that I did - which was somewhat limited, showed no benefit splitting disks into seperate partitions for large bulk loads. The program read from one very large file and wrote the input out to two other large files. The totaly throughput on a single partition was close to the maximum theoretical for that logical drive, even though the process was reading and writing to three seperate places on the disk. I don't know what this means for postgresql setups directly, but I would postulate that the benefit from splitting pg_xlog onto a seperate spindle is not as great as it might once have been for large bulk transactions. I am therefore going to be going to a single 6 drive RAID 5 for my data wharehouse application because I want the read speed to be availalbe. I can benefit from fast reads when I want to do large data scans at the expense of slightly slower insert speed. Alex. On 12/5/06, Alexandru Coseru [EMAIL PROTECTED] wrote: Hello.. Thanks for the advices.. Actually , i'm waiting for the clovertown to show up on the market... Regards Alex - Original Message - From: Sven Geisler [EMAIL PROTECTED] To: Alexandru Coseru [EMAIL PROTECTED] Cc: pgsql-performance@postgresql.org Sent: Tuesday, December 05, 2006 11:57 AM Subject: Re: [PERFORM] Hardware advice Hi Alexandru, Alexandru Coseru schrieb: [...] Question 1: The RAID layout should be: a) 2 hdd in raid 1 for system and pg_xlog and 6 hdd in raid10 for data ? b) 8 hdd in raid10 for all ? c) 2 hdd in raid1 for system , 2 hdd in raid1 for pg_xlog , 4 hdd in raid10 for data ? Obs: I'm going for setup a) , but i want to hear your thoughts as well. This depends on you data size. I think, option a and c are good. The potential bottleneck may the RAID 1 for pg_xlog if you have huge amount of updates and insert. What is about another setup 4 hdd in RAID 10 for System and pg_xlog - System partitions are normally not in heavy use and pg_xlog should be fast for writing. 4 hdd in RAID 10 for data. Question 2: (Don't want to start a flame here. but here is goes) What filesystem should i run for data ? ext3 or xfs ? The tables have ~ 15.000 rel_pages each. The biggest table has now over 30.000 pages. We have a database running with 60,000+ tables. The tables size is between a few kByte for the small tables and up to 30 GB for the largest one. We had no issue with ext3 in the past. Question 3: The block size in postgresql is 8kb. The strip size in the raid ctrl is 64k. Should i increase the pgsql block size to 16 or 32 or even 64k ? You should keep in mind that the file system has also a block size. Ext3 has as maximum 4k. I would set up the partitions aligned to the stripe size to prevent unaligned reads. I guess, you can imagine that a larger block size of postgresql may also end up in unaligned reads because the file system has a smaller block size. RAID Volume and File system set up 1. Make all partitions aligned to the RAID strip size. The first partition should be start at 128 kByte. You can do this with fdisk. after you created the partition switch to the expert mode (type x) and modify the begin of the partition (type b). You should change this value to 128 (default is 63). All other partition should also start on a multiple of 128 kByte. 2. Give the file system a hint that you work with larger block sizes. Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL I made a I/O test with PostgreSQL on a RAID system with stripe size of 64kByte and block size of 8 kByte in the RAID system. Stride=2 was the best value. PS: You should have a second XEON in your budget plan. Sven. -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date: 12/5/2006 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Bad iostat numbers
The RAID 10 was in there merely for filling in, not really as a compare, indeed it would be ludicrous to compare a RAID 1 to a 6 drive RAID 10!! How do I find out if it has version 2 of the driver? This discussion I think is important, as I think it would be useful for this list to have a list of RAID cards that _do_ work well under Linux/BSD for people as recommended hardware for Postgresql. So far, all I can recommend is what I've found to be good, which is 3ware 9500 series cards with 10k SATA drives. Throughput was great until you reached higher levels of RAID 10 (the bonnie++ mark I posted showed write speed is a bit slow). But that doesn't solve the problem for SCSI. What cards in the SCSI arena solve the problem optimally? Why should we settle for sub-optimal performance in SCSI when there are a number of almost optimally performing cards in the SATA world (Areca, 3Ware/AMCC, LSI). Thanks, Alex On 12/4/06, Scott Marlowe [EMAIL PROTECTED] wrote: On Mon, 2006-12-04 at 01:17, Alex Turner wrote: People recommend LSI MegaRAID controllers on here regularly, but I have found that they do not work that well. I have bonnie++ numbers that show the controller is not performing anywhere near the disk's saturation level in a simple RAID 1 on RedHat Linux EL4 on two seperate machines provided by two different hosting companies. In one case I asked them to replace the card, and the numbers got a bit better, but still not optimal. LSI MegaRAID has proved to be a bit of a disapointment. I have seen better numbers from the HP SmartArray 6i, and from 3ware cards with 7200RPM SATA drives. for the output: http://www.infoconinc.com/test/bonnie++.html (the first line is a six drive RAID 10 on a 3ware 9500S, the next three are all RAID 1s on LSI MegaRAID controllers, verified by lspci). Wait, you're comparing a MegaRAID running a RAID 1 against another controller running a 6 disk RAID10? That's hardly fair. My experience with the LSI was that with the 1.18 series drivers, they were slow but stable. With the version 2.x drivers, I found that the performance was very good with RAID-5 and fair with RAID-1 and that layered RAID was not any better than unlayered (i.e. layering RAID0 over RAID1 resulted in basic RAID-1 performance). OTOH, with the choice at my last place of employment being LSI or Adaptec, LSI was a much better choice. :) I'd ask which LSI megaraid you've tested, and what driver was used. Does RHEL4 have the megaraid 2 driver?
Re: [PERFORM] Bad iostat numbers
http://en.wikipedia.org/wiki/RAID_controller Alex On 12/4/06, Michael Stone [EMAIL PROTECTED] wrote: On Mon, Dec 04, 2006 at 12:37:29PM -0500, Alex Turner wrote: This discussion I think is important, as I think it would be useful for this list to have a list of RAID cards that _do_ work well under Linux/BSD for people as recommended hardware for Postgresql. So far, all I can recommend is what I've found to be good, which is 3ware 9500 series cards with 10k SATA drives. Throughput was great until you reached higher levels of RAID 10 (the bonnie++ mark I posted showed write speed is a bit slow). But that doesn't solve the problem for SCSI. What cards in the SCSI arena solve the problem optimally? Why should we settle for sub-optimal performance in SCSI when there are a number of almost optimally performing cards in the SATA world (Areca, 3Ware/AMCC, LSI). Well, one factor is to be more precise about what you're looking for; a HBA != RAID controller, and you may be comparing apples and oranges. (If you have an external array with an onboard controller you probably want a simple HBA rather than a RAID controller.) Mike Stone ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Bad iostat numbers
I agree, that MegaRAID is very stable, and it's very appealing from that perspective. And two years ago I would have never even mentioned cciss based cards on this list, because they sucked wind big time, but I believe some people have started seeing better number from the 6i. 20MB/sec write, when the number should be closer to 60 thats off by a factor of 3. For my data wharehouse application, thats a big difference, and if I can get a better number from 7200RPM drives and a good SATA controller, I'm gonna do that because my data isn't OLTP, and I don't care if the whole system shits itself and I have to restore from backup one day. My other and most important point is that I can't find any solid recommendations for a SCSI card that can perform optimally in Linux or *BSD. Off by a factor of 3x is pretty sad IMHO. (and yes, we know the Adaptec cards suck worse, that doesn't bring us to a _good_ card). Alex. On 12/4/06, Greg Smith [EMAIL PROTECTED] wrote: On Mon, 4 Dec 2006, Alex Turner wrote: People recommend LSI MegaRAID controllers on here regularly, but I have found that they do not work that well. I have bonnie++ numbers that show the controller is not performing anywhere near the disk's saturation level in a simple RAID 1 on RedHat Linux EL4 on two seperate machines provided by two different hosting companies. http://www.infoconinc.com/test/bonnie++.html I don't know what's going on with your www-september-06 machine, but the other two are giving 32-40MB/s writes and 53-68MB/s reads. For a RAID-1 volume, these aren't awful numbers, but I agree they're not great. My results are no better. For your comparison, here's a snippet of bonnie++ results from one of my servers: RHEL 4, P4 3GHz, MegaRAID firmware 1L37, write-thru cache setup, RAID 1; I think the drives are 10K RPM Seagate Cheetahs. This is from the end of the drive where performance is the worst (I partitioned the important stuff at the beginning where it's fastest and don't have enough free space to run bonnie there): --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 20708 50 21473 9 9603 3 34419 72 55799 7 467.1 1 21Mb/s writes, 56MB/s reads. Not too different from yours (especially if your results were from the beginning of the disk), and certainly nothing special. I might be able to tune the write performance higher if I cared; the battery backed cache sits unused and everything is tuned for paranoia rather than performance. On this machine it doesn't matter. The thing is, even though it's rarely the top performing card even when setup perfectly, the LSI SCSI Megaraid just works. The driver is stable, caching behavior is well defined, it's a pleasure to administer. I'm never concerned that it's lying to me or doing anything to put data at risk. The command-line tools for Linux work perfectly, let me look at or control whatever I want, and it was straighforward for me to make my own customized monitoring script using them. LSI MegaRAID has proved to be a bit of a disapointment. I have seen better numbers from the HP SmartArray 6i, and from 3ware cards with 7200RPM SATA drives. Whereas although I use 7200RPM SATA drives, I always try to keep an eye on them because I never really trust them. The performance list archives here also have plenty of comments about people having issues with the SmartArray controllers; search the archives for cciss and you'll see what I'm talking about. The Megaraid controller is very boring. That's why I like it. As a Linux distribution, RedHat has similar characteristics. If I were going for a performance setup, I'd dump that, too, for something sexier with a newish kernel. It all depends on which side of the performance/stability tradeoff you're aiming at. On Mon, 4 Dec 2006, Scott Marlowe wrote: Does RHEL4 have the megaraid 2 driver? This is from the moderately current RHEL4 installation I had results from above. Redhat has probably done a kernel rev since I last updated back in September, haven't needed or wanted to reboot since then: megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005) megaraid: 2.20.4.6-rh2 (Release Date: Wed Jun 28 12:27:22 EST 2006) -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Bad iostat numbers
People recommend LSI MegaRAID controllers on here regularly, but I have found that they do not work that well. I have bonnie++ numbers that show the controller is not performing anywhere near the disk's saturation level in a simple RAID 1 on RedHat Linux EL4 on two seperate machines provided by two different hosting companies. In one case I asked them to replace the card, and the numbers got a bit better, but still not optimal. LSI MegaRAID has proved to be a bit of a disapointment. I have seen better numbers from the HP SmartArray 6i, and from 3ware cards with 7200RPM SATA drives. for the output: http://www.infoconinc.com/test/bonnie++.html (the first line is a six drive RAID 10 on a 3ware 9500S, the next three are all RAID 1s on LSI MegaRAID controllers, verified by lspci). Alex. On 12/4/06, Greg Smith [EMAIL PROTECTED] wrote: On Thu, 30 Nov 2006, Carlos H. Reimer wrote: I would like to discover how much cache is present in the controller, how can I find this value from Linux? As far as I know there is no cache on an Adaptec 39320. The write-back cache Linux was reporting on was the one in the drives, which is 8MB; see http://www.seagate.com/cda/products/discsales/enterprise/tech/1,1593,541,00.html Be warned that running your database with the combination of an uncached controller plus disks with write caching is dangerous to your database integrity. There is a common problem with the Linux driver for this card (aic7902) where it enters what's they're calling an Infinite Interrupt Loop. That seems to match your readings: Here is a typical iostat -x: Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s sda 0.00 7.80 0.40 6.40 41.60 113.6020.8056.80 avgrq-sz avgqu-sz await svctm %util 22.82 570697.50 10.59 147.06 100.00 An avgqu-sz of 570697.50 is extremely large. That explains why the utilization is 100%, because there's a massive number of I/O operations queued up that aren't getting flushed out. The read and write data says these drives are barely doing anything, as 20kB/s and 57KB/s are practically idle; they're not even remotely close to saturated. See http://lkml.org/lkml/2005/10/1/47 for a suggested workaround that may reduce the magnitude of this issue; lower the card's speed to U160 in the BIOS was also listed as a useful workaround. You might get better results by upgrading to a newer Linux kernel, and just rebooting to clear out the garbage might help if you haven't tried that yet. On the pessimistic side, other people reporting issues with this controller are: http://lkml.org/lkml/2005/12/17/55 http://www.ussg.iu.edu/hypermail/linux/kernel/0512.2/0390.html http://www.linuxforums.org/forum/peripherals-hardware/59306-scsi-hangs-boot.html and even under FreeBSD at http://lists.freebsd.org/pipermail/aic7xxx/2003-August/003973.html This Adaptec card just barely works under Linux, which happens regularly with their controllers, and my guess is that you've run into one of the ways it goes crazy sometimes. I just chuckled when checking http://linux.adaptec.com/ again and noticing they can't even be bothered to keep that server up at all. According to http://www.adaptec.com/en-US/downloads/linux_source/linux_source_code?productId=ASC-39320-Rdn=Adaptec+SCSI+Card+39320-R the driver for your card is *minimally tested* for Linux Kernel v2.6 on all platforms. Adaptec doesn't care about Linux support on their products; if you want a SCSI controller that actually works under Linux, get an LSI MegaRAID. If this were really a Postgres problem, I wouldn't expect %iowait=1.10. Were the database engine waiting to read/write data, that number would be dramatically higher. Whatever is generating all these I/O requests, it's not waiting for them to complete like the database would be. Besides the driver problems that I'm very suspicious of, I'd suspect a runaway process writing garbage to the disks might also cause this behavior. Ive taken a look in the /var/log/messages and found some temperature messages about the disk drives: Nov 30 11:08:07 totall smartd[1620]: Device: /dev/sda, Temperature changed 2 Celsius to 51 Celsius since last report Can this temperature influence in the performance? That's close to the upper tolerance for this drive (55 degrees), which means the drive is being cooked and will likely wear out quickly. But that won't slow it down, and you'd get much scarier messages out of smartd if the drives had a real problem. You should improve cooling in this case if you want to drives to have a healthy life, odds are low this is relevant to your performance issue though. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[PERFORM] Performance of Perc 5i
Does anyone have any performance experience with the Dell Perc 5i controllers in RAID 10/RAID 5? Thanks, Alex
[PERFORM] Confusion and Questions about blocks read
The query expain analyze looks like this:click-counter=# explain analyze select count(*) as count, to_char(date_trunc('day',c.datestamp),'DD-Mon') as day from impression c, url u, handle h where c.url_id=u.url_id and c.handle_id=h.handle_id and h.handle like '1.19%' group by date_trunc('day',c.datestamp) order by date_trunc('day',c.datestamp); QUERY PLAN --Sort (cost=530282.76..530283.04 rows=113 width=8) (actual time= 191887.059..191887.131 rows=114 loops=1) Sort Key: date_trunc('day'::text, c.datestamp) - HashAggregate (cost=530276.65..530278.91 rows=113 width=8) (actual time=191886.081..191886.509 rows=114 loops=1) - Hash Join (cost=128.41..518482.04 rows=2358921 width=8) (actual time=17353.281..190568.890 rows=625212 loops=1) Hash Cond: (outer.handle_id = inner.handle_id) - Merge Join (cost=0.00..444641.52 rows=5896746 width=12) (actual time=34.582..183154.561 rows=5896746 loops=1) Merge Cond: (outer.url_id = inner.url_id) - Index Scan using url_pkey on url u (cost=0.00..106821.10 rows=692556 width=8) (actual time=0.078..83432.380 rows=692646 loops=1) - Index Scan using impression_url_i on impression c (cost= 0.00..262546.95 rows=5896746 width=16) (actual time=34.473..86701.410 rows=5896746 loops=1) - Hash (cost=123.13..123.13 rows=2115 width=4) (actual time=40.159..40.159 rows=2706 loops=1) - Bitmap Heap Scan on handle h (cost= 24.69..123.13 rows=2115 width=4) (actual time=20.362..36.819 rows=2706 loops=1) Filter: (handle ~~ '1.19%'::text) - Bitmap Index Scan on handles_i (cost= 0.00..24.69 rows=2115 width=0) (actual time=20.264..20.264 rows=2706 loops=1) Index Cond: ((handle = '1.19'::text) AND (handle '1.1:'::text))Total runtime: 191901.868 ms(looks like it sped up a bit the second time I did it)When I query relpages for the tables involved:click-counter=# select relpages from pg_class where relname='impression';relpages-- 56869(1 row)click-counter=# select relpages from pg_class where relname='url';relpages-- 66027(1 row)click-counter=# select relpages from pg_class where relname='handle'; relpages-- 72(1 row)click-counter=#they only total 122968.Home come the query statistics showed that 229066 blocks where read given that all the blocks in all the tables put together only total 122968? LOG: QUERY STATISTICSDETAIL: ! system usage stats: ! 218.630786 elapsed 24.16 user 13.93 system sec ! [261.00 user 85.61 sys total] ! 0/0 [0/0] filesystem blocks in/out ! 65/47 [20176/99752] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [0/0] voluntary/involuntary context switches ! buffer usage stats: ! Shared blocks: 229066 read, 2 written, buffer hit rate = 55.61% ! Local blocks: 0 read, 0 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written Alex.
Re: [PERFORM] Confusion and Questions about blocks read
ahh good pointThanksOn 9/22/06, Tom Lane [EMAIL PROTECTED] wrote: Alex Turner [EMAIL PROTECTED] writes: Home come the query statistics showed that 229066 blocks where read given that all the blocks in all the tables put together only total 122968? You forgot to count the indexes.Also, the use of indexscans in themergejoins probably causes multiple re-reads of some table blocks,depending on just what the physical ordering of the rows is.regards, tom lane
Re: [PERFORM] Confusion and Questions about blocks read
Ok - so I have another mystery:I insert virtually the same rows into two different tables:trend=# insert into fish select 2, nextval('result_entry_order_seq'), property_id from property;INSERT 0 59913 trend=# insert into result_entry select 0, nextval('result_entry_order_seq'), property_id from property;INSERT 0 59913trend=#but the stats show one as having written 20x as many blocks:LOG: statement: insert into fish select 2, nextval('result_entry_order_seq'), property_id from property; LOG: QUERY STATISTICSDETAIL: ! system usage stats: ! 2.098067 elapsed 0.807877 user 1.098833 system sec ! [23.875370 user 27.789775 sys total] ! 0/0 [0/0] filesystem blocks in/out ! 0/1 [5/62269] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 72/6 [18464/1126] voluntary/involuntary context switches ! buffer usage stats: ! Shared blocks: 79106 read, 420 written, buffer hit rate = 79.39% ! Local blocks: 0 read, 0 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written LOG: statement: insert into result_entry select 0, nextval('result_entry_order_seq'), property_id from property;LOG: QUERY STATISTICSDETAIL: ! system usage stats:! 16.963729 elapsed 3.533463 user 1.706740 system sec! [27.408833 user 29.497515 sys total]! 0/0 [0/0] filesystem blocks in/out! 0/1186 [5/63455] page faults/reclaims, 0 [0] swaps! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 59/139 [18525/1265] voluntary/involuntary context switches! buffer usage stats:! Shared blocks: 100744 read, 7352 written, buffer hit rate = 89.71%! Local blocks: 0 read, 0 written, buffer hit rate = 0.00%! Direct blocks: 0 read, 0 writtenI understand the read blocks difference, the second had to check indexes matching the foreign keys.The table definitions are given below: trend=# \d fish Table public.fish Column | Type | Modifiers+-+---result_id | bigint |result_entry_order | bigint | property_id | integer |Indexes: fish_pkey UNIQUE, btree (result_id, result_entry_order)trend=# \d result_Entry Table public.result_entry Column | Type | Modifiers +-+---result_id | bigint |result_entry_order | bigint |property_id | integer | Indexes: fish_pkey UNIQUE, btree (result_id, result_entry_order)The explain analyzes are kind of interesting:trend=# explain analyze insert into fish select 2, nextval('result_entry_order_seq'), property_id from property; QUERY PLAN -Seq Scan on property (cost= 0.00..79295.70 rows=59913 width=8) (actual time=0.275..1478.681 rows=59913 loops=1)Total runtime: 2178.600 ms(2 rows)trend=# explain analyze insert into result_entry select 0, nextval('result_entry_order_seq'), property_id from property; QUERY PLAN -Seq Scan on property (cost= 0.00..79295.70 rows=59913 width=8) (actual time=0.118..1473.352 rows=59913 loops=1)Trigger for constraint result_entry_result_fk: time=2037.351 calls=59913Trigger for constraint result_entry_property_fk: time=8622.260 calls=59913Total runtime: 12959.716 ms(4 rows)I don't understand the time for the FK check given the size of the tables they are checking against (and I understand it's the indexes, not the tables that the actualy check is made): trend=# select count(*) from result_cache;count--- 8(1 row)trend=#trend=# select count(*) from property;count---59913(1 row)trend=# The database was just re-indexed, and no changes beyond this insert were made in that time and result_entry has recently been vacuumed.Any insight would be greatly appreciatedAlex On 9/22/06, Alex Turner [EMAIL PROTECTED] wrote: ahh good pointThanksOn 9/22/06, Tom Lane [EMAIL PROTECTED] wrote: Alex Turner [EMAIL PROTECTED] writes: Home come the query statistics showed that 229066 blocks where read given that all the blocks in all the tables put together only total 122968? You forgot to count the indexes.Also, the use of indexscans in themergejoins probably causes multiple re-reads of some table blocks,depending on just what the physical ordering of the rows is.regards, tom lane
Re: [PERFORM] Large tables (was: RAID 0 not as fast as expected)
Do the basic math:If you have a table with 100million records, each of which is 200bytes long, that gives you roughtly 20 gig of data (assuming it was all written neatly and hasn't been updated much). If you have to do a full table scan, then it will take roughly 400 seconds with a single 10k RPM SCSI drive with an average read speed of 50MB/sec. If you are going to read indexes, figure out how big your index is, and how many blocks will be returned, and figure out how many blocks this will require transferring from the main table, make an estimate of the seeks, add in the transfer total, and you have a time to get your data. A big array with a good controller can pass 1000MB/sec transfer on the right bus if you buy the write technologies. But be warned, if you buy the wrong ones, your big array can end up being slower than a single drive for sequential transfer. At 1000MB/sec your scan would take 20 seconds. Be warned, the tech specs page:http://www.sun.com/servers/x64/x4500/specs.xml#anchor3doesn't mention RAID 10 as a possible, and this is probably what most would recommend for fast data access if you are doing both read and write operations. If you are doing mostly Read, then RAID 5 is passable, but it's redundancy with large numbers of drives is not so great. Alex.On 9/18/06, Bucky Jordan [EMAIL PROTECTED] wrote: good normalization skills are really important for large databases, along with materialization strategies for 'denormalized sets'.Good points- thanks. I'm especially curious what others have done for the materialization. The matview project on gborg appears dead, and I'veonly found a smattering of references on google. My guess is, you rollyour own for optimal performance... regarding the number of rows, there is no limit to how much pg can handle per se, just some practical limitations, especially vacuum and reindex times.these are important because they are required to keep a handle on mvcc bloat and its very nice to be able to vaccum bits of your database at a time.I was hoping for some actual numbers on practical. Hardware isn't toomuch of an issue (within reason- we're not talking an amazon or googlehere... the SunFire X4500 looks interesting... )- if a customer wants to store that much data, and pay for it, we'll figure out how to do it. I'djust rather not have to re-design the database. Say the requirement isto keep 12 months of data accessible, each scan produces 100M records, and I run one per month. What happens if the customer wants to run itonce a week? I was more trying to figure out at what point (ballpark)I'm going to have to look into archive tables and things of that nature (or at Bizgres/MPP). It's easier for us to add more/bigger hardware, butnot so easy to redesign/add history tables... just another fyi, if you have a really big database, you can forget about doing pg_dump for backups (unless you really don't care about being x day or days behind)...you simply have to due some type of replication/failover strategy.i would start with pitr. merlinI was originally thinking replication, but I did notice some nice pitr features in 8.x - I'll have to look into that some more.Thanks for the pointers though...- Bucky---(end of broadcast)---TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Large tables (was: RAID 0 not as fast as expected)
Sweet - thats good - RAID 10 support seems like an odd thing to leave out.AlexOn 9/18/06, Luke Lonergan [EMAIL PROTECTED] wrote:Alex,On 9/18/06 4:14 PM, Alex Turner [EMAIL PROTECTED] wrote: Be warned, the tech specs page: http://www.sun.com/servers/x64/x4500/specs.xml#anchor3 doesn't mention RAID 10 as a possible, and this is probably what most would recommend for fast data access if you are doing both read and write operations.If you are doing mostly Read, then RAID 5 is passable, but it's redundancy with large numbers of drives is not so great.RAID10 works great on the X4500 we get 1.6GB/s + per X4500 using RAID10 inZFS.We worked with the Sun Solaris kernel team to make that happen and the patches are part of Solaris 10 Update 3 due out in November.- Luke
Re: [PERFORM] How to get higher tps
First things first, run a bonnie++ benchmark, and post the numbers. That will give a good indication of raw IO performance, and is often the first inidication of problems separate from the DB. We have seen pretty bad performance from SANs in the past. How many FC lines do you have running to your server, remember each line is limited to about 200MB/sec, to get good throughput, you will need multiple connections. When you run pgbench, run a iostat also and see what the numbers say.Alex.On 8/22/06, Mark Lewis [EMAIL PROTECTED] wrote:Well, at least on my test machines running gnome-terminal, my pgbench runs tend to get throttled by gnome-terminal's lousy performance to nomore than 300 tps or so.Running with 2/dev/null to throw away all thedetailed logging gives me 2-3x improvement in scores.Caveat: in my case the db is on the local machine, so who knows what all theinteractions are.Also, when you initialized the pgbench db what scaling factor did youuse?And does running pgbench with -v improve performance at all? -- MarkOn Tue, 2006-08-22 at 09:19 -0400, Marty Jia wrote: Joshua, Here is shared_buffers = 8 fsync = on max_fsm_pages = 35 max_connections = 1000 work_mem = 65536 effective_cache_size = 61 random_page_cost = 3 Here is pgbench I used: pgbench -c 10 -t 1 -d HQDB Thanks Marty -Original Message- From: Joshua D. Drake [mailto:[EMAIL PROTECTED]] Sent: Monday, August 21, 2006 6:09 PM To: Marty Jia Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to get higher tps Marty Jia wrote: I'm exhausted to try all performance tuning ideas, like following parameters shared_buffers fsync max_fsm_pages max_connections shared_buffers work_mem max_fsm_pages effective_cache_size random_page_cost I believe all above have right size and values, but I just can not get higher tps more than 300 testd by pgbench What values did you use? Here is our hardwareDual Intel Xeon 2.8GHz 6GB RAM Linux 2.4 kernel RedHat Enterprise Linux AS 3 200GB for PGDATA on 3Par, ext3 50GB for WAL on 3Par, ext3 With PostgreSql 8.1.4 We don't have i/o bottle neck. Are you sure? What does iostat say during a pgbench? What parameters are you passing to pgbench? Well in theory, upgrading to 2.6 kernel will help as well as making your WAL ext2 instead of ext3. Whatelse I can try to better tps? Someone told me I can should get tps over 1500, it is hard to believe. 1500? Hmmm... I don't know about that, I can get 470tps or so on my measily dual core 3800 with 2gig of ram though. Joshua D. Drake Thanks Marty ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)---TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] How to get higher tps
Oh - and it's usefull to know if you are CPU bound, or IO bound. Check top or vmstat to get an idea of thatAlexOn 8/22/06, Alex Turner [EMAIL PROTECTED] wrote:First things first, run a bonnie++ benchmark, and post the numbers. That will give a good indication of raw IO performance, and is often the first inidication of problems separate from the DB. We have seen pretty bad performance from SANs in the past. How many FC lines do you have running to your server, remember each line is limited to about 200MB/sec, to get good throughput, you will need multiple connections. When you run pgbench, run a iostat also and see what the numbers say.Alex.On 8/22/06, Mark Lewis [EMAIL PROTECTED] wrote:Well, at least on my test machines running gnome-terminal, my pgbench runs tend to get throttled by gnome-terminal's lousy performance to nomore than 300 tps or so.Running with 2/dev/null to throw away all thedetailed logging gives me 2-3x improvement in scores.Caveat: in my case the db is on the local machine, so who knows what all theinteractions are.Also, when you initialized the pgbench db what scaling factor did youuse?And does running pgbench with -v improve performance at all? -- MarkOn Tue, 2006-08-22 at 09:19 -0400, Marty Jia wrote: Joshua, Here is shared_buffers = 8 fsync = on max_fsm_pages = 35 max_connections = 1000 work_mem = 65536 effective_cache_size = 61 random_page_cost = 3 Here is pgbench I used: pgbench -c 10 -t 1 -d HQDB Thanks Marty -Original Message- From: Joshua D. Drake [mailto:[EMAIL PROTECTED]] Sent: Monday, August 21, 2006 6:09 PM To: Marty Jia Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to get higher tps Marty Jia wrote: I'm exhausted to try all performance tuning ideas, like following parameters shared_buffers fsync max_fsm_pages max_connections shared_buffers work_mem max_fsm_pages effective_cache_size random_page_cost I believe all above have right size and values, but I just can not get higher tps more than 300 testd by pgbench What values did you use? Here is our hardwareDual Intel Xeon 2.8GHz 6GB RAM Linux 2.4 kernel RedHat Enterprise Linux AS 3 200GB for PGDATA on 3Par, ext3 50GB for WAL on 3Par, ext3 With PostgreSql 8.1.4 We don't have i/o bottle neck. Are you sure? What does iostat say during a pgbench? What parameters are you passing to pgbench? Well in theory, upgrading to 2.6 kernel will help as well as making your WAL ext2 instead of ext3. Whatelse I can try to better tps? Someone told me I can should get tps over 1500, it is hard to believe. 1500? Hmmm... I don't know about that, I can get 470tps or so on my measily dual core 3800 with 2gig of ram though. Joshua D. Drake Thanks Marty ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)---TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware upgraded but performance still ain't good enough
First off - very few third party tools support debian. Debian is a sure fire way to have an unsupported system. Use RedHat or SuSe (flame me all you want, it doesn't make it less true).Second, run bonnie++ benchmark against your disk array(s) to see what performance you are getting, and make sure it's reasonable. Single drives for stuff is not a great way to go for either speed or reliability, highly not recommended for a production system. Use SAS or SATA for the best speed for your $$s, don't buy SAN, they are overpriced and often don't perform. RAM could be more to be honest too. Diagnosing the bottleneck can be done with combinations of top, iostat and vmstat. If you have high iowait numbers then your system is waiting on the disks. If you have high system CPU usage, then disks are also worth a look, but not as bad as high iowait. If you have high user CPU with little iowait and little system CPU, and very little io activity in iostat, then you are CPU bound. If you are IO bound, you need to figure if it's reads or writes. If it's reads, then more RAM will help. if it's writes, then you need more spindles and more controller cache with RAID (please think carefully before using RAID 5 in a write intensive environment, it's not ideal). The other thing is you will probably want to turn on stats in postgres to figure out which queries are the bad ones (does anyone have good docs posted for this?). Once you have identified the bad queries, you can explain analyze them, and figure out why they suck. Alex.On 8/7/06, Alvaro Nunes Melo [EMAIL PROTECTED] wrote: Hi,First of all I must tell that my reality in a southern brazilian city isway different than what we read in the list. I was lookig for ways tofind the HW bottleneck and saw a configuration like:we recently upgraded our dual Xeon Dell to a brand new Sun v40z with 4 opterons, 16GB of memory and MegaRAID with enough disks. OS is DebianSarge amd64, PostgreSQL is 8.0.3. on(http://archives.postgresql.org/pgsql-performance/2005-07/msg00431.php )Our old server was a very modest Dell Xeon 2.8 (512 Kb Cache), with 1 GBRAM and one SCSI disc. This server runs PostgreSQL (8.1.4), Apache (PHP)and other minor services. We managed to get a test machine, a HP Xeon 3.2 (2 MB cache), also with 1 GB RAM but 4 SCSI discs (in one siglearray controller). They're organized in the following way:disk 0: Linux Rootdisk 1: Database Clusterdisk 2: pg_xlogdisk 3: a dir the suffers constant read/write operations The database size stands around 10 GB. The new server has a betterperformance than the old one, but sometimes it still stucks. We tried touse a HP proprietary tool to monitor the server, and find out what is the bottleneck, but it's been difficult to install it on Debian. Thetool is only certified for SuSe and RedHat. So we tried to use someLinux tools to see what's going on, like vmstat and iostat. Are thistools (vm and iostat) enough? Should we use something else? Is there any specifical material about finding bottlenecks in Linux/PostgreSQLmachines? Is our disks design proper?I really apologize for my lack of knowledge in this area, and for theexcessive number of questions in a single e-mail. Best regards,Alvaro---(end of broadcast)---TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Postgresql Performance on an HP DL385 and
These number are pretty darn good for a four disk RAID 10, pretty close to perfect infact. Nice advert for the 642 - I guess we have a Hardware RAID controller than will read indpendently from mirrors.Alex On 8/8/06, Steve Poe [EMAIL PROTECTED] wrote: Luke,Here are the results of two runs of 16GB file tests on XFS.scsi disc arrayxfs ,16G,81024,99,153016,24,73422,10,82092,97,243210,17,1043.1,0,16,3172,7,+,+++,2957,9,3197,10,+,+++,2484,8scsi disc array xfs ,16G,83320,99,155641,25,73662,10,81756,96,243352,18,1029.1,0,16,3119,10,+,+++,2789,7,3263,11,+,+++,2014,6Thanks.Steve Can you run bonnie++ version 1.03a on the machine and report the results here? It could be OK if you have the latest Linux driver for cciss, someone has reported good results to this list with the latest, bleeding edge version of Linux (2.6.17). - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire tochoose an index scan if your joining column's datatypes do not match---(end of broadcast)---TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Performance with 2 AMD/Opteron 2.6Ghz and 8gig
Although I for one have yet to see a controller that actualy does this (I believe software RAID on linux doesn't either).Alex.On 8/7/06, Markus Schaber [EMAIL PROTECTED] wrote:Hi, Charles, Charles Sprickman wrote: I've also got a 1U with a 9500SX-4 and 4 drives.I like how the 3Ware card scales there - started with 2 drives and got drive speed mirroring. Added two more and most of the bonnie numbers doubled.This is not what I'm used to with the Adaptec SCSI junk.Well, for sequential reading, you should be able to get double drivespeed on a 2-disk mirror with a good controller, as it can balance thereads among the drives. Markus--Markus Schaber | Logical TrackingTracing International AGDipl. Inf. | Software Development GISFight against software patents in EU! www.ffii.org www.nosoftwarepatents.org---(end of broadcast)---TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] RAID stripe size question
This is a great testament to the fact that very often software RAID will seriously outperform hardware RAID because the OS guys who implemented it took the time to do it right, as compared with some controller manufacturers who seem to think it's okay to provided sub-standard performance. Based on the bonnie++ numbers comming back from your array, I would also encourage you to evaluate software RAID, as you might see significantly better performance as a result. RAID 10 is also a good candidate as it's not so heavy on the cache and CPU as RAID 5. Alex.On 7/18/06, Luke Lonergan [EMAIL PROTECTED] wrote: Mikael, On 7/18/06 6:34 AM, Mikael Carneholm [EMAIL PROTECTED] wrote: However, what's more important is the seeks/s - ~530/s on a 28 disk array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array I'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500: === Single Stream With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off): Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 1801 4 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 + +++ + +++ + +++ 30850 99 + +++ + +++ === Two Streams Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together): Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 1233 2 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 26085 90 + +++ 5700 98 21448 97 + +++ 4381 97 Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 1274 3 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 26082 99 + +++ 5588 98 21399 88 + +++ 4272 97 So that's 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read. === - Luke
Re: [PERFORM] RAID stripe size question
With 18 disks dedicated to data, you could make 100/7*9 seeks/second (7ms av seeks time, 9 independant units) which is 128seeks/second writing on average 64kb of data, which is 4.1MB/sec throughput worst case, probably 10x best case so 40Mb/sec - you might want to take more disks for your data and less for your WAL. Someone check my math here...And as always - run benchmarks with your app to verifyAlex.On 7/16/06, Mikael Carneholm [EMAIL PROTECTED] wrote: I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope performance won't suffer too much from that. Raid level is 0+1, filesystem is ext3. Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? In theory, a smaller stripe size (8-32K) should increase sequential write throughput at the cost of decreased positioning performance, which sounds good for WAL (assuming WAL is never searched during normal operation). And for disks holding the data, a larger stripe size (32K) should provide for more concurrent (small) reads/writes at the cost of decreased raw throughput. This is with an OLTP type application in mind, so I'd rather have high transaction throughput than high sequential read speed. The interface is a 2Gb FC so I'm throttled to (theoretically) 192Mb/s, anyway. So, does this make sense? Has anyone tried it and seen any performance gains from it? Regards, Mikael.
Re: [PERFORM] RAID stripe size question
On 7/17/06, Mikael Carneholm [EMAIL PROTECTED] wrote: This is something I'd also would like to test, as a common best-practice these days is to go for a SAME (stripe all, mirroreverything) setup. From a development perspective it's easier to use SAME as the developers won't have to think about physical location for new tables/indices, so if there's no performance penalty with SAME I'll gladly keep it that way.Usually, it's not the developers task to care about that, but the DBAs responsibility.As we don't have a full-time dedicated DBA (although I'm the one who domost DBA related tasks) I would aim for making physical location astransparent as possible, otherwise I'm afraid I won't be doing anything else than supporting developers with that - and I *do* have other thingsto do as well :) In a previous test, using cd=5000 and cs=20 increased transaction throughput by ~20% so I'll definitely fiddle with that in the coming tests as well.How many parallel transactions do you have?That was when running BenchmarkSQL(http://sourceforge.net/projects/benchmarksql ) with 100 concurrent users(terminals), which I assume means 100 parallel transactions at most.The target application for this DB has 3-4 times as many concurrentconnections so it's possible that one would have to find other cs/cd numbers better suited for that scenario. Tweaking bgwriter is anothertask I'll look into as well..Btw, here's the bonnie++ results from two different array sets (10+18,4+24) on the MSA1500:LUN: WAL, 10 disks, stripe size 32K Version1.03 --Sequential Output-- --Sequential Input- --Random--Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPsesell0132G 5613993 7325022 16530 3 3048845 57489 5 477.3 1--Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --ReadDelete-- files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP /sec %CP 16245890 + +++ + +++312199 + +++ 1046998LUN: WAL, 4 disks, stripe size 8K--Version1.03 --Sequential Output-- --Sequential Input- --Random--Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPsesell0132G 4917082 6010819 13325 2 1577824 21489 2 266.4 0--Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --ReadDelete-- files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP /sec %CP 16243286 + +++ + +++310699 + +++ 1024898LUN: DATA, 18 disks, stripe size 32K-Version1.03 --Sequential Output-- --Sequential Input---Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPsesell0132G 5999097 8734128 19158 4 3020046 57556 6 495.4 1--Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --ReadDelete-- files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP /sec %CP 16164092 + +++ + +++173699 + +++ 1091999LUN: DATA, 24 disks, stripe size 64K-Version1.03 --Sequential Output-- --Sequential Input---Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPsesell0132G 5944397 11851539 25023 5 3092649 60835 6 531.8 1--Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --ReadDelete-- files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP /sec %CP 16249990 + +++ + +++281799 + +++ 10971 100These bonnie++ number are very worrying. Your controller should easily max out your FC interface on these tests passing 192MB/sec with ease on anything more than an 6 drive RAID 10 . This is a bad omen if you want high performance... Each mirror pair can do 60-80MB/sec. A 24Disk RAID 10 can do 12*60MB/sec which is 740MB/sec - I have seen this performance, it's not unreachable, but time and again, we see these bad perf numbers from FC and SCSI systems alike. Consider a different controller, because this one is not up to snuff. A single drive would get better numbers than your 4 disk RAID 10, 21MB/sec read speed is really pretty sorry, it should be closer to 120Mb/sec. If you can't swap out, software RAID may turn out to be your friend. The only saving grace is that this is OLTP, and perhaps, just maybe, the controller will be better at ordering IOs, but I highly doubt it. Please people, do the numbers, benchmark before you buy, many many HBAs really suck under Linux/Free BSD, and you may end up paying vast sums of money for very sub-optimal performance (I'd say sub-standard, but alas, it seems that this kind of poor performance is tolerated, even though it's way off where it should be). There's no point having a 40disk cab, if your controller can't handle
Re: [PERFORM] RAID stripe size question
On 7/17/06, Ron Peacetree [EMAIL PROTECTED] wrote: -Original Message-From: Mikael Carneholm [EMAIL PROTECTED]Sent: Jul 17, 2006 5:16 PMTo: RonPeacetree [EMAIL PROTECTED], pgsql-performance@postgresql.orgSubject: RE: [PERFORM] RAID stripe size question15Krpm HDs will have average access times of 5-6ms.10Krpm ones of 7-8ms. Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic?Ah, the games vendors play.average seek time for a 10Krpm HD may very well be 4.9ms.However, what matters to you the user is average =access= time.The 1st is how long it takes to position the heads to the correct track.The 2nd is how long it takes to actually find and get data from a specified HD sector. 28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s,~75*9= ~675MB/s.I guess it's still limited by the 2Gbit FC (192Mb/s), right?No.A decent HBA has multiple IO channels on it.So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!).Nominally, this card can push 8Gbps= 800MBps.~600-700MBps is the RW number. Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080.Very, very few RAID controllers can do = 1GBps One thing that help greatly withbursty IO patterns is to up your battery backed RAID cache as high as you possibly can.Even multiple GBs of BBC can be worth it.Another reason to have multiple controllers ;-)I use 90% of the raid cache for writes, don't think I could go higher than that.Too bad the emulex only has 256Mb though :/ If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater.I've definitely seen access patterns that benefitted from increased RAID cache for any size I could actually install.For those access patterns, no amount of RAID cache commercially available was enough to find the flattening point of the cache percentage curve.256MB of BB RAID cache per HBA is just not that much for many IO patterns. 90% as in 90% of the RAM, not 90% hit rate I'm imagining. The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID= ), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet.This is a relatively low end HBA with 1 4Gb FC on it.Max sustained IO on it is going to be ~320MBps.Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's. 28 such HDs are =definitely= IO choked on this HBA.Not they aren't. This is OLTP, not data warehousing. I already posted math for OLTP throughput, which is in the order of 8-80MB/second actual data throughput based on maximum theoretical seeks/second. The arithmatic suggests you need a better HBA or more HBAs or both. WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards.Big DB pages and big RAID stripes makes sense for WALs.unless of course you are running OLTP, in which case a big stripe isn't necessary, spend the disks on your data parition, because your WAL activity is going to be small compared with your random IO. According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? (As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.) I guess I'll have to find out which theory that holds by good ol� trial and error... :)IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW. New records will be posted at the end of a file, and will only increase the file by the number of blocks in the transactions posted at write time. Updated records are modified in place unless they have grown too big to be in place. If you are updated mutiple tables on each transaction, a 64kb stripe size or lower is probably going to be best as block sizes are just 8kb. How much data does your average transaction write? How many xacts per second, this will help determine how many writes your cache will queue up before it flushes, and therefore what the optimal stripe size will be. Of course, the fastest and most accurate way is probably just to try different settings and see how it works. Alas some controllers seem to handle some stripe sizes more effeciently in defiance of any logic. Work out how big your xacts are, how many xacts/second you can post, and you will figure out how fast WAL will be writting. Allocate enough disk for peak load plus planned expansion on WAL and then put the rest to tablespace. You may well find that a single RAID 1 is enough for WAL (if you acheive theoretical performance levels,
[PERFORM] Quick question about top...
I have a really stupid question about top, what exactly is iowait CPU time?Alex
Re: [PERFORM] SAN performance mystery
Given the fact that most SATA drives have only an 8MB cache, and your RAID controller should have at least 64MB, I would argue that the system with the RAID controller should always be faster. If it's not, you're getting short-changed somewhere, which is typical on linux, because the drivers just aren't there for a great many controllers that are out there. Alex.On 6/15/06, Mark Lewis [EMAIL PROTECTED] wrote: On Thu, 2006-06-15 at 18:24 -0400, Tom Lane wrote: I agree with Brian's suspicion that the SATA drive isn't properly fsync'ing to disk, resulting in bogusly high throughput.However, ISTM a well-configured SAN ought to be able to match even the bogus throughput, because it should be able to rely on battery-backed cache to hold written blocks across a power failure, and hence should be able to report write-complete as soon as it's got the page in cache rather than having to wait till it's really down on magnetic platter. Which is what the SATA drive is doing ... only it can't keep the promise it's making for lack of any battery backup on its on-board cache. It really depends on your SAN RAID controller.We have an HP SAN; Idon't remember the model number exactly, but we ran some tests and withthe battery-backed write cache enabled, we got some improvement in write performance but it wasn't NEARLY as fast as an SATA drive which liedabout write completion.The write-and-fsync latency was only about 2-3 times better than with nowrite cache at all.So I wouldn't assume that just because you've got a write cache on your SAN, that you're getting the same speed asfsync=off, at least for some cheap controllers.-- Mark Lewis---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] 64-bit vs 32-bit performance ... backwards?
Anyone who has tried x86-64 linux knows what a royal pain in the ass it is. They didn't do anything sensible, like just make the whole OS 64 bit, no, they had to split it up, and put 64-bit libs in a new directory /lib64. This means that a great many applications don't know to check in there for libs, and don't compile pleasantly, php is one among them. I forget what others, it's been awhile now. Of course if you actualy want to use more than 4gig RAM in a pleasant way, it's pretty much essential. Alex.On 6/12/06, Steve Atkins [EMAIL PROTECTED] wrote: On Jun 12, 2006, at 6:15 PM, Joshua D. Drake wrote: Empirically... postgresql built for 64 bits is marginally slower than that built for a 32 bit api on sparc. None of my customers have found 64 bit x86 systems to be suitable for production use, yet, so I've not tested on any of those architectures. Really? All of our customers are migrating to Opteron and I have many that have been using Opteron for over 12 months happily. An Opteron is 64 bit capable; that doesn't mean you have to run 64 bitcode on it.Mine're mostly reasonably conservative users, with hundreds of machinesto support. Using 64 bit capable hardware, such as Opterons, is one thing,but using an entirely different linux installation and userspacecode, say, isa much bigger change in support terms. In the extreme case it makes nosense to double your OS support overheads to get a single digit percentageperformance improvement on one database system.That's not to say that linux/x86-64 isn't production ready for someusers, justthat it's not necessarily a good operational decision for my customers. Givenmy internal workloads aren't really stressing the hardware they're onI don'thave much incentive to benchmark x86-64 yet - by the time the numbersmight be useful to me we'll be on a different postgresql, likely a differentgcc/icc and so on.Cheers, Steve---(end of broadcast)---TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] 3WARE Card performance boost?
He's talking about RAID 1 here, not a gargantuan RAID 6. Onboard RAM on the controller card is going to make very little difference. All it will do is allow the card to re-order writes to a point (not all cards even do this). Alex. On 1/18/06, William Yu [EMAIL PROTECTED] wrote: Benjamin Arai wrote: Obviously, I have done this to improve write performance for the update each week. My question is if I install a 3ware or similar card to replace my current software RAID 1 configuration, am I going to see a very large improvement? If so, what would be a ball park figure? The key is getting a card with the ability to upgrade the onboard ram. Our previous setup was a LSI MegaRAID 320-1 (128MB), 4xRAID10, fsync=off. Replaced it with a ARC-1170 (1GB) w/ 24x7200RPM SATA2 drives (split into 3 8-drive RAID6 arrays) and performance for us is through the ceiling. For OLTP type updates, we've gotten about +80% increase. For massive 1-statement updates, performance increase is in the +triple digits. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] 3WARE Card performance boost?
http://www.3ware.com/products/serial_ata2-9000.asp Check their data sheet - the cards are BBU ready - all you have to do is order a BBU which you can from here: http://www.newegg.com/Product/Product.asp?Item=N82E16815999601 Alex. On 1/18/06, Joshua D. Drake [EMAIL PROTECTED] wrote: Obviously, I have done this to improve write performance for the update each week. My question is if I install a 3ware or similar card to replace my current software RAID 1 configuration, am I going to see a very large improvement? If so, what would be a ball park figure? Well that entirely depends on what level... 1. I would suggest LSI 150-6 not 3ware Why? Because 3ware does not make a midrange card that has a battery backed cache :). That is the only reason. 3ware makes good stuff. So anyway... LSI150-6 with Battery Backed cache option. Put 6 drives on it with a RAID 10 array, turn on write cache and you should have a hauling drive. Joshua D. Drake *Benjamin Arai* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] http://www.benjaminarai.com http://www.benjaminarai.com/ -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/ ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] 3WARE Card performance boost?
A 3ware card will re-order your writes to put them more in disk order, which will probably improve performance a bit, but just going from a software RAID 1 to a hardware RAID 1, I would not imagine that you will see much of a performance boost. Really to get better performance you will need to add more drives, or faster drives. If you are currently running 7200 RPM consumer drives, going to a 1RPM WD Raptor drive will probably increase performance by about 30%, again not all that much. Alex On 1/18/06, Benjamin Arai [EMAIL PROTECTED] wrote: Hi, I am currently doing large weekly updates with fsync=off. My updates involves SELECT, UPDATE, DELETE and etc. Setting fsync=off works for me since I take a complete backup before the weekly update and run a sync and CHECKPOINT after each weekly update has completed to ensure the data is all written to disk. Obviously, I have done this to improve write performance for the update each week. My question is if I install a 3ware or similar card to replace my current software RAID 1 configuration, am I going to see a very large improvement? If so, what would be a ball park figure? Benjamin Arai [EMAIL PROTECTED] http://www.benjaminarai.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
It's irrelavent what controller, you still have to actualy write the parity blocks, which slows down your write speed because you have to write n+n/2 blocks. instead of just n blocks making the system write 50% more data. RAID 5 must write 50% more data to disk therefore it will always be slower. Alex. On 12/25/05, Michael Stone [EMAIL PROTECTED] wrote: On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: Caches help, and the bigger the cache the better, but once you are doing enough writes fast enough (and that doesn't take much even with a few GBs of cache) the recalculate-checksums-and-write-new-ones overhead will decrease the write speed of real data. Bear in mind that the HD's _raw_ write speed hasn't been decreased. Those HD's are pounding away as fast as they can for you. Your _effective_ or _data level_ write speed is what decreases due to overhead. You're overgeneralizing. Assuming a large cache and a sequential write, there's need be no penalty for raid 5. (For random writes you may need to read unrelated blocks in order to calculate parity, but for large sequential writes the parity blocks should all be read from cache.) A modern cpu can calculate parity for raid 5 on the order of gigabytes per second, and even crummy embedded processors can do hundreds of megabytes per second. You may have run into some lousy implementations, but you should be much more specific about what hardware you're talking about instead of making sweeping generalizations. Side Note: people often forget the other big reason to use RAID 10 over RAID 5. RAID 5 is always only 2 HD failures from data loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss unless you get unlucky and lose both members of a RAID 1 set. IOW, your RAID 10 is only 2 HD failures from data loss also. If that's an issue you need to go with RAID 6 or add another disk to each mirror. Mike Stone ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
Yes - they work excellently. I have several medium and large servers running 3ware 9500S series cards with great success. We have rebuilding many failed RAID 10s over the course with no problems. Alex On 12/26/05, Benjamin Arai [EMAIL PROTECTED] wrote: Have you have any experience rebuilding arrays in linux using the 3Ware utilities? If so, did it work well? Luke Lonergan wrote: Benjamin, Have you done any benchmarking of the 9550SX against a software raid configuration? Interesting - no, not on SATA, mostly because I've had awful luck with Linux drivers and SATA. The popular manufacturers of SATA to PCI bridge chipsets are Silicon Image and Highpoint, and I've not seen Linux work with them at any reasonable performance yet. I've also had problems with Adaptec's cards - I think they manufacture their own SATA to PCI chipset as well. So far, I've only had good luck with the on-chipset Intel SATA implementation. I think the problems I've had could be entirely driver-related, but in the end it doesn't matter if you can't find drivers that work for Linux. The other problem is getting enough SATA connections for the number of disks we want. I do have two new Areca SATA RAID cards and I'm going to benchmark those against the 3Ware 9550SX with 2 x 8 = 16 disks on one host. I guess we could run the HW RAID controllers in JBOD mode to get a good driver / chipset configuration for software RAID, but frankly I prefer HW RAID if it performs well. So far the SATA host-based RAID is blowing the doors off of every other HW RAID solution I've tested. - Luke -- Benjamin Arai [EMAIL PROTECTED] http://www.benjaminarai.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
Yes, but those blocks in RAID 10 are largely irrelevant as they are to independant disks. In RAID 5 you have to write parity to an 'active' drive that is part of the stripe. (They are irrelevant unless of course you are maxing out your SCSI bus - yet another reason why SATA can be faster than SCSI, particularly in RAID 10, every channel is independant). Sorry - my math for RAID 5 was a bit off - I don't know why I was considering only a three dirve situation - which is the worst. It's n+1 you are right. still, for small arrays thats a big penalty. Still, there is definately a penatly contrary to the assertion of the orignal poster. I agree totally that the read+parity-calc+write in the worst case is totaly bad, which is why I alway recommend people should _never ever_ use RAID 5. In this day and age of large capacity chassis, and large capacity SATA drives, RAID 5 is totally inapropriate IMHO for _any_ application least of all databases. In reality I have yet to benchmark a system where RAID 5 on the same number of drives with 8 drives or less in a single array beat a RAID 10 with the same number of drives. I would definately be interested in a SCSI card that could actualy achieve the theoretical performance of RAID 5 especially under Linux. With RAID 5 you get to watch you system crumble and fail when a drive fails and the array goes into a failed state. It's just not worth it. Alex. On 12/26/05, David Lang [EMAIL PROTECTED] wrote: On Mon, 26 Dec 2005, Alex Turner wrote: It's irrelavent what controller, you still have to actualy write the parity blocks, which slows down your write speed because you have to write n+n/2 blocks. instead of just n blocks making the system write 50% more data. RAID 5 must write 50% more data to disk therefore it will always be slower. raid5 writes n+1 blocks not n+n/2 (unless n=2 for a 3-disk raid). you can have a 15+1 disk raid5 array for example however raid1 (and raid10) have to write 2*n blocks to disk. so if you are talking about pure I/O needed raid5 wins hands down. (the same 16 drives would be a 8+8 array) what slows down raid 5 is that to modify a block you have to read blocks from all your drives to re-calculate the parity. this interleaving of reads and writes when all you are logicly doing is writes can really hurt. (this is why I asked the question that got us off on this tangent, when doing new writes to an array you don't have to read the blocks as they are blank, assuming your cacheing is enough so that you can write blocksize*n before the system starts actually writing the data) David Lang Alex. On 12/25/05, Michael Stone [EMAIL PROTECTED] wrote: On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: Caches help, and the bigger the cache the better, but once you are doing enough writes fast enough (and that doesn't take much even with a few GBs of cache) the recalculate-checksums-and-write-new-ones overhead will decrease the write speed of real data. Bear in mind that the HD's _raw_ write speed hasn't been decreased. Those HD's are pounding away as fast as they can for you. Your _effective_ or _data level_ write speed is what decreases due to overhead. You're overgeneralizing. Assuming a large cache and a sequential write, there's need be no penalty for raid 5. (For random writes you may need to read unrelated blocks in order to calculate parity, but for large sequential writes the parity blocks should all be read from cache.) A modern cpu can calculate parity for raid 5 on the order of gigabytes per second, and even crummy embedded processors can do hundreds of megabytes per second. You may have run into some lousy implementations, but you should be much more specific about what hardware you're talking about instead of making sweeping generalizations. Side Note: people often forget the other big reason to use RAID 10 over RAID 5. RAID 5 is always only 2 HD failures from data loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss unless you get unlucky and lose both members of a RAID 1 set. IOW, your RAID 10 is only 2 HD failures from data loss also. If that's an issue you need to go with RAID 6 or add another disk to each mirror. Mike Stone ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Is RAID10 the best choice?
Personaly I would split into two RAID 1s. One for pg_xlog, one for the rest. This gives probably the best performance/reliability combination. Alex. On 12/10/05, Carlos Benkendorf [EMAIL PROTECTED] wrote: Hello, I would like to know which is the best configuration to use 4 scsi drives with a pg 8.1 server. Configuring them as a RAID10 set seems a good choice but now I´m figuring another configuration: SCSI drive 1: operational system SCSI drive 2: pg_xlog SCSI drive 3: data SCSI drive 4: index I know the difference between them when you analyze risks of loosing data but how about performance? What should be better? Obs.: Our system uses always an index for every access.. (enable_seqscan(false)) Thanks in advance! Benkendorf Yahoo! doce lar. Faça do Yahoo! sua homepage. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] LVM and Postgres
I would argue that almost certainly won't by doing that as you will create a new place even further away for the disk head to seek to instead of just another file on the same FS that is probably closer to the current head position. Alex On 12/6/05, Michael Stone [EMAIL PROTECTED] wrote: On Tue, Dec 06, 2005 at 09:36:23PM +, Rory Campbell-Lange wrote: Yes, we don't have any spare disks unfortunately. We have enabled the BBU write, so we are hoping for good performance. Even if you don't use seperate disks you'll probably get better performance by putting the WAL on a seperate ext2 partition. xfs gives good performance for the table data, but is not particularly good for the WAL. Mike Stone ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Ok - so I ran the same test on my system and get a total speed of 113MB/sec. Why is this? Why is the system so limited to around just 110MB/sec? I tuned read ahead up a bit, and my results improve a bit.. Alex On 11/18/05, Luke Lonergan [EMAIL PROTECTED] wrote: Dave, On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote: Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. Part 2: The answer System A: This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel. On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: that's 66 MB/s. Not the efficiency I'd hope from the onboard SATA controller that I'd like, I would have expected to get 85% of the 100MB/s raw read performance. So that's $1,200 / 66 MB/s (without adjusting for 2003 price versus now) = 18.2 $/MB/s Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh #!/bin/bash time psql -c select count(*) from ivp.bigtable1 dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout1 count -- 1000 (1 row) real0m32.565s user0m0.002s sys 0m0.003s Size of the table data: [EMAIL PROTECTED] IVP]$ du -sk dgtestdb/base 2121648 dgtestdb/base System B: This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead. It's running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel. Same test as above, but with 17GB of data takes 69.7 seconds to scan (!) That's 244.2MB/s, which is obviously double my earlier point of 110-120MB/s. This system is running with a 16MB Linux readahead setting, let's try it with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s. So, using the tuned setting of blockdev —setra 16384 we get $6,000 / 244MB/s = 24.6 $/MB/s If we use the default Linux setting it's 2.5x worse. Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh #!/bin/bash time psql -c select count(*) from ivp.bigtable1 dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout3 count -- 8000 (1 row) real1m9.875s user0m0.000s sys 0m0.004s [EMAIL PROTECTED] IVP]$ !du du -sk dgtestdb/base 17021260dgtestdb/base Summary: cough, cough OK – you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead. This is a cheap alternative to overhauling the executor to use asynch I/O. Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldn't go any faster than 244MB/s. - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 11/16/05, William Yu [EMAIL PROTECTED] wrote: Alex Turner wrote: Spend a fortune on dual core CPUs and then buy crappy disks... I bet for most applications this system will be IO bound, and you will see a nice lot of drive failures in the first year of operation with consumer grade drives. Spend your money on better Disks, and don't bother with Dual Core IMHO unless you can prove the need for it. I would say the opposite -- you always want Dual Core nowadays. DC Opterons simply give you better bang for the buck than single core Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because those mega-CPU motherboards are astronomically expensive. Opteron 242 - $178.00 Opteron 242 - $178.00 Tyan S2882 - $377.50 Total: $733.50 Opteron 265 - $719.00 Tyan K8E - $169.00 Total: $888.00 Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples. Infact I couldn't find a single CPU slot board that did, so you pretty much have to buy a dual CPU board to get PCI-X. 1xDC is _not_ cheaper. Our DB application does about 5 queries/second peak, plus a heavy insert job once per day. We only _need_ two CPUs, which is true for a great many DB applications. Unless you like EJB of course, which will thrash the crap out of your system. Consider the two most used regions for DBs: a) OLTP - probably IO bound, large number of queries/sec updating info on _disks_, not requiring much CPU activity except to retrieve item infomration which is well indexed and normalized. b) Data wharehouse - needs CPU, but probably still IO bound, large data set that won't fit in RAM will required large amounts of disk reads. CPU can easily keep up with disk reads. I have yet to come across a DB system that wasn't IO bound. DC also gives you a better upgrade path. Let's say you do testing and figure 2x246 is the right setup to handle the load. Well instead of getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a DC/270. Now you have a server that can be upgraded to +80% more CPU by popping in another DC/270 versus throwing out the entire thing to get a 4x1P setup. No argument there. But it's pointless if you are IO bound. The only questions would be: (1) Do you need a SMP server at all? I'd claim yes -- you always need 2+ cores whether it's DC or 2P to avoid IO interrupts blocking other processes from running. At least 2CPUs is always good for precisely those reasons. More than 2CPUs gives diminishing returns. (2) Does a DC system perform better than it's Nx1P cousin? My experience is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244 and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP, etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups. Maybe true, but the 265 does have a 25% faster FSB than the 244, which might perhaps play a role. (3) Do you need an insane amount of memory? Well here's the case where the more expensive motherboard will serve you better since each CPU slot has its own bank of memory. Spend more money on memory, get cheaper single-core CPUs. Remember - large DB is going to be IO bound. Memory will get thrashed for file block buffers, even if you have large amounts, it's all gonna be cycled in and out again. Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon DCs, while cheaper than their corresponding single-core SMPs, don't have the same performance profile of Opteron DCs. Basically, you're paying a bit extra so your server can generate a ton more heat. Dell/Xeon/Postgres is just a bad combination any day of the week ;) Alex. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Just pick up a SCSI drive and a consumer ATA drive. Feel their weight. You don't have to look inside to tell the difference. Alex On 11/16/05, David Boreham [EMAIL PROTECTED] wrote: I suggest you read this on the difference between enterprise/SCSI and desktop/IDE drives: http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf This is exactly the kind of vendor propaganda I was talking about and it proves my point quite well : that there's nothing specific relating to reliability that is different between SCSI and SATA drives cited in that paper. It does have a bunch of FUD such as 'oh yeah we do a lot more drive characterization during manufacturing'. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 11/16/05, Joshua D. Drake [EMAIL PROTECTED] wrote: The only questions would be: (1) Do you need a SMP server at all? I'd claim yes -- you always need 2+ cores whether it's DC or 2P to avoid IO interrupts blocking other processes from running. I would back this up. Even for smaller installations (single raid 1, 1 gig of ram). Why? Well because many applications are going to be CPU bound. For example we have a PHP application that is a CMS. On a single CPU machine, RAID 1 it takes about 300ms to deliver a single page, point to point. We are not IO bound. So what happens is that under reasonable load we are actually waiting for the CPU to process the code. This is the performance profile for PHP, not for Postgresql. This is the postgresql mailing list. A simple upgrade to an SMP machine literally doubles our performance because we are still not IO bound. I strongly suggest that everyone use at least a single dual core because of this experience. Performance of PHP, not postgresql. (3) Do you need an insane amount of memory? Well here's the case where the more expensive motherboard will serve you better since each CPU slot has its own bank of memory. Spend more money on memory, get cheaper single-core CPUs. Agreed. A lot of times the slowest dual-core is 5x what you actually need. So get the slowest, and bulk up on memory. If nothing else memory is cheap today and it might not be tomorrow. [snip] Running postgresql on a single drive RAID 1 with PHP on the same machine is not a typical installation. 300ms for PHP in CPU time? wow dude - that's quite a page. PHP typical can handle up to 30-50 pages per second for a typical OLTP application on a single CPU box. Something is really wrong with that system if it takes 300ms per page. Alex. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases
Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10, the 3ware controllers beat the Areca card. Alex. On 11/16/05, Ron [EMAIL PROTECTED] wrote: Got some hard numbers to back your statement up? IME, the Areca 1160's with = 1GB of cache beat any other commodity RAID controller. This seems to be in agreement with at least one independent testing source: http://print.tweakers.net/?reviews/557 RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any commodity HW solution, but their price point is considerably higher. ...on another note, I completely agree with the poster who says we need more cache on RAID controllers. We should all be beating on the RAID HW manufacturers to use standard DIMMs for their caches and to provide 2 standard DIMM slots in their full height cards (allowing for up to 8GB of cache using 2 4GB DIMMs as of this writing). It should also be noted that 64 drive chassis' are going to become possible once 2.5 10Krpm SATA II and FC HDs become the standard next year (48's are the TOTL now). We need controller technology to keep up. Ron At 12:16 AM 11/16/2005, Alex Turner wrote: Not at random access in RAID 10 they aren't, and anyone with their head screwed on right is using RAID 10. The 9500S will still beat the Areca cards at RAID 10 database access patern. Alex. On 11/15/05, Dave Cramer [EMAIL PROTECTED] wrote: Luke, Have you tried the areca cards, they are slightly faster yet. Dave On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote: I agree - you can get a very good one from www.acmemicro.com or www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read performance on these (with tuning) on Linux using the xfs filesystem, which is one of the most critical factors for large databases. Note that you want to have your DBMS use all of the CPU and disk channel bandwidth you have on each query, which takes a parallel database like Bizgres MPP to achieve. Regards, ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 11/15/05, Luke Lonergan [EMAIL PROTECTED] wrote: Adam, -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Claus Guttesen Sent: Tuesday, November 15, 2005 12:29 AM To: Adam Weisberg Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Hardware/OS recommendations for large databases ( 5TB) Does anyone have recommendations for hardware and/or OS to work with around 5TB datasets? Hardware-wise I'd say dual core opterons. One dual-core-opteron performs better than two single-core at the same speed. Tyan makes some boards that have four sockets, thereby giving you 8 cpu's (if you need that many). Sun and HP also makes nice hardware although the Tyan board is more competetive priced. OS wise I would choose the FreeBSD amd64 port but partititions larger than 2 TB needs some special care, using gpt rather than disklabel etc., tools like fsck may not be able to completely check partitions larger than 2 TB. Linux or Solaris with either LVM or Veritas FS sounds like candidates. I agree - you can get a very good one from www.acmemicro.com or www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read performance on these (with tuning) on Linux using the xfs filesystem, which is one of the most critical factors for large databases. Spend a fortune on dual core CPUs and then buy crappy disks... I bet for most applications this system will be IO bound, and you will see a nice lot of drive failures in the first year of operation with consumer grade drives. Spend your money on better Disks, and don't bother with Dual Core IMHO unless you can prove the need for it. Alex ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Not at random access in RAID 10 they aren't, and anyone with their head screwed on right is using RAID 10. The 9500S will still beat the Areca cards at RAID 10 database access patern. Alex. On 11/15/05, Dave Cramer [EMAIL PROTECTED] wrote: Luke, Have you tried the areca cards, they are slightly faster yet. Dave On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote: I agree - you can get a very good one from www.acmemicro.com or www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read performance on these (with tuning) on Linux using the xfs filesystem, which is one of the most critical factors for large databases. Note that you want to have your DBMS use all of the CPU and disk channel bandwidth you have on each query, which takes a parallel database like Bizgres MPP to achieve. Regards, ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Sort performance on large tables
We use this memory in all our servers (well - the 512 sticks). 0 problems to date: http://www.newegg.com/Product/Product.asp?Item=N82E16820145513 $163 for 1GB. This stuff is probably better than the Samsung RAM dell is selling you for 3 times the price. Alex On 11/10/05, Ron Peacetree [EMAIL PROTECTED] wrote: My original post did not take into account VAT, I apologize for that oversight. However, unless you are naive, or made of gold, or have some sort of special relationship that requires you to, _NE VER_ buy RAM from your computer HW OEM. For at least two decades it's been a provable fact that OEMs like DEC, Sun, HP, Compaq, Dell, etc, etc charge far more per GB for the RAM they sell. Same goes for HDs. Buy your memory and HDs direct from reputable manufacturers, you'll get at least the same quality and pay considerably less. Your Dell example is evidence that supports my point. As of this writing, decent RAM should cost $75-$150 pr GB (not including VAT ;-) ). Don't let yourself be conned into paying more. I'm talking about decent RAM from reputable direct suppliers like Corsair and Kingston (_not_ their Value RAM, the actual Kingston branded stuff), OCZ, etc. Such companies sell via multiple channels, including repuatble websites like dealtime.com, pricewatch.com, newegg.com, etc, etc. You are quite correct that there's poor quality junk out there. I was not talking about it, only reasonable quality components. Ron -Original Message- From: Kurt De Grave [EMAIL PROTECTED] Sent: Nov 10, 2005 5:40 AM To: Ron Peacetree [EMAIL PROTECTED] Cc: Charlie Savage [EMAIL PROTECTED], pgsql-performance@postgresql.org Subject: Re: [PERFORM] Sort performance on large tables On Wed, 9 Nov 2005, Ron Peacetree wrote: At this writing, 4 1GB DIMMs (4GB) should set you back ~$300 or less. 4 2GB DIMMs (8GB) should cost ~$600. As of now, very few mainboards support 4GB DIMMs and I doubt the D3000 has such a mainboard. If you can use them, 4 4GB DIMMs (16GB) will currently set you back ~$1600-$2400. Sorry, but every time again I see unrealistic memory prices quoted when the buy-more-memory argument passes by. What kind of memory are you buying for your servers? Non-ECC no-name memory that doesn't even pass a one-hour memtest86 for 20% of the items you buy? Just checked at Dell's web page: adding 4 1GB DIMMs to a PowerEdge 2850 sets you back _1280 EURO_ excluding VAT. And that's after they already charged you 140 euro for replacing the obsolete standard 4 512MB DIMMs with the same capacity in 1GB DIMMs. So the 4GB upgrade actually costs 1420 euro plus VAT, which is quite a bit more than $300. Okay, few people will happily buy at those prices. You can get the exact same goods much cheaper elsewhere, but it'll still cost you way more than the number you gave, plus you'll have to drive to the server's location, open up the box yourself, and risk incompatibilities and support problems if there's ever something wrong with that memory. Disclaimers: I know that you're talking about a desktop in this particular case. I wouldn't see a need for ECC in a development box either. I know a Dell hasn't been the smartest choice for a database box lately (but politics...). kurt. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Performance PG 8.0 on dual opteron / 4GB / 3ware Raid5 / Debian??
Where are the pg_xlog and data directories with respect to each other? From this IOStat it looks like they might be on the same partition, which is not ideal, and actualy surprising that throughput is this good. You need to seperate pg_xlog and data directories to get any kind of reasonable performance. Also don't use RAID 5 - RAID 5 bites, no really - it bites. Use multiple RAID 1s, or RAID 10s, you will get better performance. 50MB/70MB is about the same as you get from a single disk or a RAID 1. We use 2x9506S8MI controlers, and have maintained excellent performance with 2xRAID 10 and 2xRAID 1. Make sure you get the firmware update if you have these controllers though. Alex Turner NetEconomist On 11/6/05, Joost Kraaijeveld [EMAIL PROTECTED] wrote: Hi, I am experiencing very long update queries and I want to know if it reasonable to expect them to perform better. The query below is running for more than 1.5 hours (5500 seconds) now, while the rest of the system does nothing (I don't even type or move a mouse...). - Is that to be expected? - Is 180-200 tps with ~ 9000 KB (see output iostat below) not low, given the fact that fsync is off? (Note: with bonnie++ I get write performance 50 MB/sec and read performace 70 MB/sec with 2000 read/write ops /sec? - Does anyone else have any experience with the 3Ware RAID controller (which is my suspect)? - Any good idea how to determine the real botleneck if this is not the performance I can expect? My hard- and software: - PostgreSQL 8.0.3 - Debian 3.1 (Sarge) AMD64 - Dual Opteron - 4GB RAM - 3ware Raid5 with 5 disks Pieces of my postgresql.conf (All other is default): shared_buffers = 7500 work_mem = 260096 fsync=false effective_cache_size = 32768 The query with explain (amount and orderbedrag_valuta are float8, ordernummer and ordernumber int4): explain update prototype.orders set amount = odbc.orders.orderbedrag_valuta from odbc.orders where ordernumber = odbc.orders.ordernummer; QUERY PLAN - Hash Join (cost=50994.74..230038.17 rows=1104379 width=466) Hash Cond: (outer.ordernumber = inner.ordernummer) - Seq Scan on orders (cost=0.00..105360.68 rows=3991868 width=455) - Hash (cost=48233.79..48233.79 rows=1104379 width=15) - Seq Scan on orders (cost=0.00..48233.79 rows=1104379 width=15) Sample output from iostat during query (about avarage): Device:tpskB_read/skB_wrtn/skB_readkB_wrtn hdc 0.00 0.00 0.00 0 0 sda 0.00 0.00 0.00 0 0 sdb 187.1323.76 8764.36 24 8852 -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: [EMAIL PROTECTED] web: www.askesis.nl ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] browsing table with 2 million records
You could also create your own index so to speak as a table that simply contains a list of primary keys and an order value field that you can use as your offset. This can be kept in sync with the master table using triggers pretty easily. 2 million is not very much if you only have a integer pkey, and an integer order value, then you can join it against the main table. create table my_index_table ( primary_key_value int, order_val int, primary key (primary_key_value)); create index my_index_table_order_val_i on index_table (order_val); select * from main_table a, my_index_table b where b.order_val=25 and b.order_val50 and a.primary_key_id=b.primary_key_id If the data updates alot then this won't work as well though as the index table will require frequent updates to potentialy large number of records (although a small number of pages so it still won't be horrible). Alex Turner NetEconomist On 10/26/05, Joshua D. Drake [EMAIL PROTECTED] wrote: We have a GUI that let user browser through the record page by page at about 25 records a time. (Don't ask me why but we have to have this GUI). This translates to something like select count(*) from table -- to give feedback about the DB size Do you have a integer field that is an ID that increments? E.g; serial? select * from table order by date limit 25 offset 0 You could use a cursor. Sincerely, Joshua D. Drake Tables seems properly indexed, with vacuum and analyze ran regularly. Still this very basic SQLs takes up to a minute run. I read some recent messages that select count(*) would need a table scan for Postgre. That's disappointing. But I can accept an approximation if there are some way to do so. But how can I optimize select * from table order by date limit x offset y? One minute response time is not acceptable. Any help would be appriciated. Wy -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] What gets cached?
Just to play devils advocate here for as second, but if we have an algorithm that is substational better than just plain old LRU, which is what I believe the kernel is going to use to cache pages (I'm no kernel hacker), then why don't we apply that and have a significantly larger page cache a la Oracle? AlexOn 10/21/05, Neil Conway [EMAIL PROTECTED] wrote: On Fri, 2005-21-10 at 07:34 -0500, Martin Nickel wrote: Let's say I do the same thing in Postgres.I'm likely to have my very fastest performance for the first few queries until memory gets filled up. No, you're not: if a query doesn't hit the cache (both the OS cache andthe Postgres userspace cache), it will run slower. If the caches areempty when Postgres starts up (which is true for the userspace cache and might be true of the OS cache), the first queries that are run should beslower, not faster.The only time Postgres seems to take advantage of cached data is when Irepeat the same (or substantially the same) query. Caching is done on a page-by-page basis -- the source text of the queryitself is not relevant. If two different queries happen to hit a similarset of pages, they will probably both benefit from the same set of cached pages. I don't know of any way to view what is actually cached at any point in time, but it seems like most recently used rather than most frequently used. The cache replacement policy in 7.4 and older releases is simple LRU.The policy in 8.0 is ARC (essentially a version of LRU modified to tryto retain hot pages more accurately). The policy in 8.1 is a clock-based algorithm.-Neil---(end of broadcast)---TIP 6: explain analyze is your friend
Re: [PERFORM] Is There Any Way ....
This is possible with Oracle utilizing the keep pool alter table t_name storage ( buffer_pool keep); If Postgres were to implement it's own caching system, this seems like it would be easily to implement (beyond the initial caching effort). Alex On 10/24/05, Craig A. James [EMAIL PROTECTED] wrote: Jim C. Nasby jnasby ( at ) pervasive ( dot ) com wrote: Stefan Weiss wrote: ... IMO it would be useful to have a way to tell PG that some tables were needed frequently, and should be cached if possible. This would allow application developers to consider joins with these tables as cheap, even when querying on columns that are not indexed. Why do you think you'll know better than the database how frequently something is used? At best, your guess will be correct and PostgreSQL (or the kernel) will keep the table in memory. Or, your guess is wrong and you end up wasting memory that could have been used for something else. It would probably be better if you describe why you want to force this table (or tables) into memory, so we can point you at more appropriate solutions. Or perhaps we could explain why we NEED to force these tables into memory, so we can point you at a more appropriate implementation. ;-) Ok, wittiness aside, here's a concrete example. I have an application with one critical index that MUST remain in memory at all times. The index's tablespace is about 2 GB. As long as it's in memory, performance is excellent - a user's query takes a fraction of a second. But if it gets swapped out, the user's query might take up to five minutes as the index is re-read from memory. Now here's the rub. The only performance I care about is response to queries from the web application. Everything else is low priority. But there is other activity going on. Suppose, for example, that I'm updating tables, performing queries, doing administration, etc., etc., for a period of an hour, during which no customer visits the site. The another customer comes along and performs a query. At this point, no heuristic in the world could have guessed that I DON'T CARE ABOUT PERFORMANCE for anything except my web application. The performance of all the other stuff, the administration, the updates, etc., is utterly irrelevant compared to the performance of the customer's query. What actually happens is that the other activities have swapped out the critical index, and my customer waits, and waits, and waits... and goes away after a minute or two. To solve this, we've been forced to purchase two computers, and mirror the database on both. All administration and modification happens on the offline database, and the web application only uses the online database. At some point, we swap the two servers, sync the two databases, and carry on. It's a very unsatisfactory solution. There is ONLY one way to convey this sort of information to Postgres, which is to provide the application developer a mechanism to explicitely name the tables that should be locked in memory. Look at tsearchd that Oleg is working on. It's a direct response to this problem. It's been recognized for decades that, as kernel developers (whether a Linux kernel or a database kernel), our ability to predict the behavior of an application is woefully inadequate compared with the application developer's knowledge of the application. Computer Science simply isn't a match for the human brain yet, not even close. To give you perspective, since I posted a question about this problem (regarding tsearch2/GIST indexes), half of the responses I received told me that they encountered this problem, and their solution was to use an external full-text engine. They all confirmed that Postgres can't deal with this problem yet, primarily for the reasons outlined above. Craig ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] What gets cached?
Oracle uses LRU caching algorithm also, not LFU. AlexOn 10/21/05, Martin Nickel [EMAIL PROTECTED] wrote: I was reading a comment in another posting and it started me thinkingabout this.Let's say I startup an Oracle server.All my queries are alittle bit (sometimes a lot bit) slow until it gets its normal things in memory, then it's up to speed.The normal things would include somesmall lookup tables and the indexes for the most frequently used tables.Let's say I do the same thing in Postgres.I'm likely to have my very fastest performance for the first few queries until memory gets filled up. The only time Postgres seems to take advantage of cached data is when I repeat the same (or substantially the same) query.I don't know of any way to view what is actually cached at any point in time, but it seems like most recently used rather than most frequently used.Does this seem true? s---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Used Memory
[snip]to the second processor in my dual Xeon eServer) has got me to thepoint that the perpetually high memory usage doesn't affect my application server. I'm curious - how does the high memory usage affect your application server? Alex
Re: [PERFORM] Performance on SUSE w/ reiserfs
Realise also that unless you are running the 1.5 x86-64 build, java will not use more than 1Gig, and if the app server requests more than 1gig, Java will die (I've been there) with an out of memory error, even though there is plenty of free mem available. This can easily be cause by a lazy GC thread if the applicaiton is running high on CPU usage. The kernel will not report memory used for caching pages as being unavailable, if a program calls a malloc, the kernel will just swap out the oldest disk page and give the memory to the application. Your free -mo shows 3 gig free even with cached disk pages. It looks to me more like either a Java problem, or a kernel problem... Alex Turner NetEconomistOn 10/10/05, Jon Brisbin [EMAIL PROTECTED] wrote: Tom Lane wrote: Are you sure it's not cached data pages, rather than cached inodes? If so, the above behavior is *good*. People often have a mistaken notion that having near-zero free RAM means they have a problem.In point of fact, that is the way it is supposed to be (at least on Unix-like systems).This is just a reflection of the kernel doing what it is supposed to do, which is to use all spare RAM for caching recently accessed disk pages.If you're not swapping then you do not have a problem.Except for the fact that my Java App server crashes when all theavailable memory is being used by caching and not reclaimed :-) If it wasn't for the app server going down, I probably wouldn't care.--Jon BrisbinWebmasterNPC International, Inc.---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Performance on SUSE w/ reiserfs
Well - to each his own I guess - we did extensive testing on 1.4, and it refused to allocate much past 1gig on both Linux x86/x86-64 and Windows. AlexOn 10/11/05, Alan Stange [EMAIL PROTECTED] wrote: Alex Turner wrote: Perhaps this is true for 1.5 on x86-32 (I've only used it on x86-64) but I was more thinking 1.4 which many folks are still using.The 1.4.x JVM's will also work just fine with much more than 1GB of memory. Perhaps you'd like to try again?-- Alan On 10/11/05, *Alan Stange* [EMAIL PROTECTED] mailto: [EMAIL PROTECTED] wrote: Alex Turner wrote: Realise also that unless you are running the 1.5 x86-64 build, java will not use more than 1Gig, and if the app server requests more than 1gig, Java will die (I've been there) with an out of memory error, even though there is plenty of free mem available.This can easily be cause by a lazy GC thread if the applicaiton is running high on CPU usage. On my side of Planet Earth, the standard non-x64 1.5 JVM will happily use more than 1G of memory (on linux and Solaris, can't speak for Windows).If you're running larger programs, it's probably a good idea to use the -server compiler in the JVM as well.I regularly run with -Xmx1800m and regularly have 1GB heap sizes. The standard GC will not cause on OOM error if space remains for the requested object.The GC thread blocks all other threads during its activity, whatever else is happening on the machine. The newer/experimental GC's did have some potential race conditions, but I believe those have been resolved in the 1.5 JVMs. Finally, note that the latest _05 release of the 1.5 JVM also now supports large page sizes on Linux and Windows: -XX:+UseLargePages this can be quite beneficial depending on the memory patterns in your programs. -- Alan
Re: [PERFORM] Indexes on ramdisk
Talk about your IO system a bit. There might be obvious ways to improve. What System/Motherboard are you using? What Controller Cards are you using? What kind of Disks do you have (SATA, SCSI 7.6k 10k 15k) What denominations (9, 18, 36, 72, 143, 80, 160, 200 240Gig)? What kind of RAIDs do you have setup (How many drives what stripe sizes, how many used for what). What levels of RAID are you using (0,1,10,5,50)? With good setup, a dual PCI-X bus motherboard can hit 2GB/sec and thousands of transactions to disk if you have a controller/disks that can keep up. That is typicaly enough for most people without resorting to SSD. Alex Turner NetEconomistOn 10/4/05, Emil Briggs [EMAIL PROTECTED] wrote: I have an application that has a table that is both read and write intensive.Data from iostat indicates that the write speed of the system is the factorthat is limiting performance. The table has around 20 columns and most of the columns are indexed. The data and the indices for the table are distributedover several mirrored disk partitions and pg_xlog is on another. I'm lookingat ways to improve performance and besides the obvious one of getting an SSD I thought about putting the indices on a ramdisk. That means that after apower failure or shutdown I would have to recreate them but that isacceptable for this application. What I am wondering though is whether or not I would see much performance benefit and if there would be any startupproblems after a power down event due to the indices not being present. Anyinsight would be appreciated.Emil---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] RAID Stripe size
I have benched different sripe sizes with different file systems, and the perfmance differences can be quite dramatic. Theoreticaly a smaller stripe is better for OLTP as you can write more small transactions independantly to more different disks more often than not, but a large stripe size is good for Data warehousing as you are often doing very large sequential reads, and a larger stripe size is going to exploit the on-drive cache as you request larger single chunks from the disk at a time. It also seems that different controllers are partial to different defaults that can affect their performance, so I would suggest that testing this on two different controller cards man be less than optimal. I would also recommend looking at file system. For us JFS worked signifcantly faster than resier for large read loads and large write loads, so we chose JFS over ext3 and reiser. I found that lower stripe sizes impacted performance badly as did overly large stripe sizes. Alex Turner NetEconomistOn 16 Sep 2005 04:51:43 -0700, bmmbn [EMAIL PROTECTED] wrote: Hi EveryoneThe machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15kdisks.2 disks are in RAID1 and hold the OS, SWAP pg_xlog4 disks are in RAID10 and hold the Cluster itself. the DB will have two major tables 1 with 10 million rows and one with100 million rows.All the activities against this tables will be SELECT.Currently the strip size is 8k. I read in many place this is a poor setting.Am i right ?---(end of broadcast)---TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] RAID Stripe size
I have found JFS to be just fine. We have been running a medium load on this server for 9 months with no unscheduled down time. Datbase is about 30gig on disk, and we get about 3-4 requests per second that generate results sets in the thousands from about 8am to about 11pm. I have foudn that JFS barfs if you put a million files in a directory and try to do an 'ls', but then so did reiser, only Ext3 handled this test succesfully. Fortunately with a database, this is an atypical situation, so JFS has been fine for DB for us so far. We have had severe problems with Ext3 when file systems hit 100% usage, they get all kinds of unhappy, we haven't had the same problem with JFS. Alex Turner NetEconomistOn 9/20/05, Welty, Richard [EMAIL PROTECTED] wrote: Alex Turnerwrote: I would also recommend looking at file system.For us JFS worked signifcantlyfaster than resier for large read loads and large write loads, so we chose JFSover ext3 and reiser. has jfs been reliable for you? there seems to be a lot of conjecture about instability,but i find jfs a potentially attractive alternative for a number of reasons.richard---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Need for speed 2
I have found that while the OS may flush to the controller fast with fsync=true, the controller does as it pleases (it has BBU, so I'm not too worried), so you get great performance because your controller is determine read/write sequence outside of what is being demanded by an fsync. Alex Turner NetEconomistOn 8/25/05, Kelly Burkhart [EMAIL PROTECTED] wrote: On Thu, 2005-08-25 at 11:16 -0400, Ron wrote: # - Settings - fsync = false # turns forced synchronization on or off #wal_sync_method = fsync# the default varies across platforms: # fsync, fdatasync, open_sync, or I hope you have a battery backed write buffer!Battery backed write buffer will do nothing here, because the OS istaking it's sweet time flushing to the controller's battery backed write buffer!Isn't the reason for batter backed controller cache to make fsync()sfast?-K---(end of broadcast)---TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Performance considerations for very heavy INSERT traffic
Split your system into multiple partitions of RAID 10s.For max performance, ten drive RAID 10 for pg_xlog (This will max out a PCI-X bus) on Bus A, multiple 4/6Drive RAID 10s for tablespaces on Bus B. For max performance I would recommend using one RAID 10 for raw data tables, one for aggregate tables and one for indexes. More RAM will only help you with queries against your data, if you are pre-aggregating, then you may not need all that much RAM. You can easily get 100 tansacts per second with even less hardware with a little data partitioning. Choose your controller carefully as many don't co-operate with linux well. Alex Turner NetEconomist On 9/12/05, Brandon Black [EMAIL PROTECTED] wrote: I'm in the process of developing an application which uses PostgreSQL for data storage. Our database traffic is very atypical, and as a result it has been rather challenging to figure out how to best tune PostgreSQL on what development hardware we have, as well as to figure out exactly what we should be evaluating and eventually buying for production hardware. The vast, overwhelming majority of our database traffic is pretty much a non-stop stream of INSERTs filling up tables. It is akin to data acquisition. Several thousand clients are sending once-per-minute updates full of timestamped numerical data at our central server, which in turn performs INSERTs into several distinct tables as part of the transaction for that client. We're talking on the order of ~100 transactions per second, each containing INSERTs to multiple tables (which contain only integer and floating point columns and a timestamp column - the primary key (and only index) is on a unique integer ID for the client and the timestamp). The transaction load is spread evenly over time by having the clients send their per-minute updates at random times rather than on the exact minute mark. There will of course be users using a web-based GUI to extract data from these tables and display them in graphs and whatnot, but the SELECT query traffic will always be considerably less frequent and intensive than the incessant INSERTs, and it's not that big a deal if the large queries take a little while to run. This data also expires - rows with timestamps older than X days will be DELETEd periodically (once an hour or faster), such that the tables will reach a relatively stable size (pg_autovacuum is handling vacuuming for now, but considering our case, we're thinking of killing pg_autovacuum in favor of having the periodic DELETE process also do a vacuum of affected tables right after the DELETE, and then have it vacuum the other low traffic tables once a day while it's at it). There is an aggregation layer in place which proxies the inbound data from the clients into a small(er) number of persistent postgresql backend processes. Right now we're doing one aggregator per 128 clients (so instead of 128 seperate database connections over the course of a minute for a small transaction each, there is a single database backend that is constantly committing transactions at a rate of ~ 2/second). At a test load of ~1,000 clients, we would have 8 aggregators running and 8 postgresql backends. Testing has seemed to indicate we should aggregate even harder - the planned production load is ~5,000 clients initially, but will grow to almost double that in the not-too-distant future, and that would mean ~40 backends at 128 clients each initially. Even on 8 cpus, I'm betting 40 concurrent backends doing 2 tps is much worse off than 10 backends doing 8 tps. Test hardware right now is a dual Opteron with 4G of ram, which we've barely gotten 1,000 clients running against. Current disk hardware in testing is whatever we could scrape together (4x 3-ware PCI hardware RAID controllers, with 8 SATA drives in a RAID10 array off of each - aggregated up in a 4-way stripe with linux md driver and then formatted as ext3 with an appropriate stride parameter and data="" Production will hopefully be a 4-8-way Opteron, 16 or more G of RAM, and a fiberchannel hardware raid array or two (~ 1TB available RAID10 storage) with 15krpm disks and battery-backed write cache. I know I haven't provided a whole lot of application-level detail here, but does anyone have any general advice on tweaking postgresql to deal with a very heavy load of concurrent and almost exclusively write-only transactions? Increasing shared_buffers seems to always help, even out to half of the dev box's ram (2G). A 100ms commit_delay seemed to help, but tuning it (and _siblings) has been difficult. We're using 8.0 with the default 8k blocksize, but are strongly considering both developing against 8.1 (seems it might handle the heavy concurrency better), and re-compiling with 32k blocksize since our storage arrays will inevitably be using fairly wide stripes. Any advice on any of this (other than drop the project while you're still a little bit sane)? --Brandon
Re: [PERFORM] sustained update load of 1-2k/sec
I have managed tx speeds that high from postgresql going even as high as 2500/sec for small tables, but it does require a good RAID controler card (yes I'm even running with fsync on). I'm using 3ware 9500S-8MI with Raptor drives in multiple RAID 10s. The box wasn't too $$$ at just around $7k. I have two independant controlers on two independant PCI buses to give max throughput. on with a 6 drive RAID 10 and the other with two 4 drive RAID 10s. Alex Turner NetEconomist On 8/19/05, Mark Cotner [EMAIL PROTECTED] wrote: Hi all, I bet you get tired of the same ole questions over and over. I'm currently working on an application that will poll thousands of cable modems per minute and I would like to use PostgreSQL to maintain state between polls of each device. This requires a very heavy amount of updates in place on a reasonably large table(100k-500k rows, ~7 columns mostly integers/bigint). Each row will be refreshed every 15 minutes, or at least that's how fast I can poll via SNMP. I hope I can tune the DB to keep up. The app is threaded and will likely have well over 100 concurrent db connections. Temp tables for storage aren't a preferred option since this is designed to be a shared nothing approach and I will likely have several polling processes. Here are some of my assumptions so far . . . HUGE WAL Vacuum hourly if not more often I'm getting 1700tx/sec from MySQL and I would REALLY prefer to use PG. I don't need to match the number, just get close. Is there a global temp table option? In memory tables would be very beneficial in this case. I could just flush it to disk occasionally with an insert into blah select from memory table. Any help or creative alternatives would be greatly appreciated. :) 'njoy, Mark -- Writing software requires an intelligent person, creating functional art requires an artist. -- Unknown ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Need for speed
Are you calculating aggregates, and if so, how are you doing it (I ask the question from experience of a similar application where I found that my aggregating PGPLSQL triggers were bogging the system down, and changed them so scheduled jobs instead). Alex Turner NetEconomist On 8/16/05, Ulrich Wisser [EMAIL PROTECTED] wrote: Hello, one of our services is click counting for on line advertising. We do this by importing Apache log files every five minutes. This results in a lot of insert and delete statements. At the same time our customers shall be able to do on line reporting. We have a box with Linux Fedora Core 3, Postgres 7.4.2 Intel(R) Pentium(R) 4 CPU 2.40GHz 2 scsi 76GB disks (15.000RPM, 2ms) I did put pg_xlog on another file system on other discs. Still when several users are on line the reporting gets very slow. Queries can take more then 2 min. I need some ideas how to improve performance in some orders of magnitude. I already thought of a box with the whole database on a ram disc. So really any idea is welcome. Ulrich -- Ulrich Wisser / System Developer RELEVANT TRAFFIC SWEDEN AB, Riddarg 17A, SE-114 57 Sthlm, Sweden Direct (+46)86789755 || Cell (+46)704467893 || Fax (+46)86789769 http://www.relevanttraffic.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Cheap RAM disk?
Also seems pretty silly to put it on a regular SATA connection, when all that can manage is 150MB/sec. If you made it connection directly to 66/64-bit PCI then it could actualy _use_ the speed of the RAM, not to mention PCI-X. Alex Turner NetEconomist On 7/26/05, John A Meinel [EMAIL PROTECTED] wrote: I saw a review of a relatively inexpensive RAM disk over at anandtech.com, the Gigabyte i-RAM http://www.anandtech.com/storage/showdoc.aspx?i=2480 Basically, it is a PCI card, which takes standard DDR RAM, and has a SATA port on it, so that to the system, it looks like a normal SATA drive. The card costs about $100-150, and you fill it with your own ram, so for a 4GB (max size) disk, it costs around $500. Looking for solid state storage devices, the cheapest I found was around $5k for 2GB. Gigabyte claims that the battery backup can last up to 16h, which seems decent, if not really long (the $5k solution has a built-in harddrive so that if the power goes out, it uses the battery power to copy the ramdisk onto the harddrive for more permanent storage). Anyway, would something like this be reasonable as a drive for storing pg_xlog? With 4GB you could have as many as 256 checkpoint segments. I'm a little leary as it is definitely a version 1.0 product (it is still using an FPGA as the controller, so they were obviously pushing to get the card into production). But it seems like this might be a decent way to improve insert performance, without setting fsync=false. Probably it should see some serious testing (as in power spikes/pulled plugs, etc). I know the article made some claim that if you actually pull out the card it goes into high consumption mode which is somehow greater than if you leave it in the slot with the power off. Which to me seems like a lot of bull, and really means the 16h is only under best-case circumstances. But even 1-2h is sufficient to handle a simple power outage. And if you had a UPS with detection of power failure, you could always sync the ramdisk to a local partition before the power goes out. Though you could do that with a normal in-memory ramdisk (tmpfs) without having to buy the card. Though it does give you up-to an extra 4GB of ram, for machines which have already maxed out their slots. Anyway, I thought I would mention it to the list, to see if anyone else has heard of it, or has any thoughts on the matter. I'm sure there are some people who are using more expensive ram disks, maybe they have some ideas about what this device is missing. (other than costing about 1/10th the price) John =:- ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Cheap RAM disk?
Please see: http://www.newegg.com/Product/Product.asp?Item=N82E16820145309 and http://www.newegg.com/Product/Product.asp?Item=N82E16820145416 The price of Reg ECC is not significantly higher than regular ram at this point. Plus if you go with super fast 2-2-2-6 then it's actualy more than good ol 2.5 Reg ECC. Alex Turner NetEconomist On 7/26/05, PFC [EMAIL PROTECTED] wrote: I'm a little leary as it is definitely a version 1.0 product (it is still using an FPGA as the controller, so they were obviously pushing to get the card into production). Not necessarily. FPGA's have become a sensible choice now. My RME studio soundcard uses a big FPGA. The performance in the test doesn't look that good, though, but don't forget it was run under windows. For instance they get 77s to copy the Firefox source tree on their Athlon 64/raptor ; my Duron / 7200rpm ide drive does it in 30 seconds, but not with windows of course. However it doesnt' use ECC so... That's a pity, because they could have implemented ECC in software inside the chip, and have the benefits of error correction with normal, cheap RAM. Well; wait and see... ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Multiple disks: RAID 5 or PG Cluster
Of course these numbers are not true as soon as you exceed the stripe size for a read operation, which is often only 128k. Typically a stripe of mirrors will not read from seperate halves of the mirrors either, so RAID 10 is only N/2 best case in my experience, Raid 0+1 is a mirror of stripes and will read from independant halves, but gives worse redundancy. Alex Turner NetEconomistOn 6/18/05, Jacques Caron [EMAIL PROTECTED] wrote: Hi,At 18:00 18/06/2005, PFC wrote: I don't know what I'm talking about, but wouldn't mirorring be fasterthan striping for random reads like you often get on a database ? (ie. the reads can be dispatched to any disk) ? (or course, not for writes, but ifyou won't use fsync, random writes should be reduced no ?)Roughly, for random reads, the performance (in terms of operations/s) compared to a single disk setup, with N being the number of drives, is:RAID 0 (striping):- read = N- write = N- capacity = N- redundancy = 0RAID 1 (mirroring, N=2):- read = N- write = 1 - capacity = 1- redundancy = 1RAID 5 (striping + parity, N=3)- read = N-1- write = 1/2- capacity = N-1- redundancy = 1RAID 10 (mirroring + striping, N=2n, N=4)- read = N - write = N/2- capacity = N/2- redundancy N/2So depending on your app, i.e. your read/write ratio, how much data can becached, whether the data is important or not, how much data you have, etc, one or the other option might be better.Jacques.---(end of broadcast)---TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Filesystem
We have been using XFS for about 6 months now and it has even tolerated a controller card crash. So far we have mostly good things to report about XFS. I benchmarked raw throughputs at various stripe sizes, and XFS came out on top for us against reiser and ext3. I also used it because of it's supposed good support for large files, which was verified somewhat by the benchmarks. I have noticed a problem though - if you have 80 files in a directory, it seems that XFS chokes on simple operations like 'ls' or 'chmod -R ...' where ext3 doesn't, don't know about reiser, I went straight back to default after that problem (that partition is not on a DB server though). Alex Turner netEconomistOn 6/3/05, Martin Fandel [EMAIL PROTECTED] wrote: Hi @ all,i have only a little question. Which filesystem is preferred forpostgresql? I'm plan to use xfs (before i used reiserfs). The reasonis the xfs_freeze Tool to make filesystem-snapshots.Is the performance better than reiserfs, is it reliable? best regards,Martin---(end of broadcast)---TIP 3: if posting/reading through Usenet, please send an appropriatesubscribe-nomail command to [EMAIL PROTECTED] so that yourmessage can get through to the mailing list cleanly
Re: [PERFORM] Select performance vs. mssql
Until you start worrying about MVC - we have had problems with the MSSQL implementation of read consistency because of this 'feature'. Alex Turner NetEconomistOn 5/24/05, Bruno Wolff III [EMAIL PROTECTED] wrote: On Tue, May 24, 2005 at 08:36:36 -0700,mark durrant [EMAIL PROTECTED] wrote: --MSSQL's ability to hit the index only and not having to go to the table itself results in a _big_ performance/efficiency gain. If someone who's in development wants to pass this along, it would be a nice addition to PostgreSQL sometime in the future. I'd suspect that as well as making one query faster, it would make everything else faster/more scalable as the server load is so much less.This gets brought up a lot. The problem is that the index doesn't includeinformation about whether the current transaction can see the referenced row. Putting this information in the index will add significant overheadto every update and the opinion of the developers is that this would bea net loss overall.---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Partitioning / Clustering
Having local sessions is unnesesary, and here is my logic: Generaly most people have less than 100Mb of bandwidth to the internet. If you make the assertion that you are transferring equal or less session data between your session server (lets say an RDBMS) and the app server than you are between the app server and the client, an out of band 100Mb network for session information is plenty of bandwidth. This also represents OLTP style traffic, which postgresql is pretty good at. You should easily be able to get over 100Tps. 100 hits per second is an awful lot of traffic, more than any website I've managed will ever see. Why solve the complicated clustered sessions problem, when you don't really need to? Alex Turner netEconomist On 5/11/05, PFC [EMAIL PROTECTED] wrote: However, memcached (and for us, pg_memcached) is an excellent way to improve horizontal scalability by taking disposable data (like session information) out of the database and putting it in protected RAM. So, what is the advantage of such a system versus, say, a sticky sessions system where each session is assigned to ONE application server (not PHP then) which keeps it in RAM as native objects instead of serializing and deserializing it on each request ? I'd say the sticky sessions should perform a lot better, and if one machine dies, only the sessions on this one are lost. But of course you can't do it with PHP as you need an app server which can manage sessions. Potentially the savings are huge, though. On Google, their distributed system spans a huge number of PCs and it has redundancy, ie. individual PC failure is a normal thing and is a part of the system, it is handled gracefully. I read a paper on this matter, it's pretty impressive. The google filesystem has nothing to do with databases though, it's more a massive data store / streaming storage. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Partitioning / Clustering
Ok - my common sense alarm is going off here... There are only 6.446 billion people worldwide. 100 Billion page views would require every person in the world to view 18 pages of yahoo every day. Not very likely. http://www.internetworldstats.com/stats.htm suggests that there are around 1 billion people actualy on the internet. That means each and every person on the internet has to view 100 pages per day of yahoo. pretty unlikely IMHO. I for one don't even use Yahoo ;) 100 million page views per day suggests that 1 in 100 people on the internet each viewed 10 pages of a site. Thats a pretty high percentage if you ask me. If I visit 20 web sites in a day, and see an average of 10 pages per site. that means only about 2000 or so sites generate 100 million page views in a day or better. 100 million pageviews averages to 1157/sec, which we'll double for peak load to 2314. I can easily see a system doing 2314 hash lookups per second. Hell I wrote a system that could do a thousand times that four years ago on a single 1Ghz Athlon. Heck - you can get 2314 lookups/sec on a 486 ;) Given that session information doesn't _have_ to persist to storage, and can be kept in RAM. A single server could readily manage session information for even very large sites (of course over a million concurrent users could really start chewing into RAM, but if you are Yahoo, you can probably afford a box with 100GB of RAM ;). We get over 1000 tps on a dual opteron with a couple of mid size RAID arrays on 10k discs with fsync on for small transactions. I'm sure that could easily be bettered with a few more dollars. Maybe my number are off, but somehow it doesn't seem like that many people need a highly complex session solution to me. Alex Turner netEconomist On 5/12/05, Alex Stapleton [EMAIL PROTECTED] wrote: On 12 May 2005, at 15:08, Alex Turner wrote: Having local sessions is unnesesary, and here is my logic: Generaly most people have less than 100Mb of bandwidth to the internet. If you make the assertion that you are transferring equal or less session data between your session server (lets say an RDBMS) and the app server than you are between the app server and the client, an out of band 100Mb network for session information is plenty of bandwidth. This also represents OLTP style traffic, which postgresql is pretty good at. You should easily be able to get over 100Tps. 100 hits per second is an awful lot of traffic, more than any website I've managed will ever see. Why solve the complicated clustered sessions problem, when you don't really need to? 100 hits a second = 8,640,000 hits a day. I work on a site which does 100 million dynamic pages a day. In comparison Yahoo probably does 100,000,000,000 (100 billion) views a day if I am interpreting Alexa's charts correctly. Which is about 1,150,000 a second. Now considering the site I work on is not even in the top 1000 on Alexa, theres a lot of sites out there which need to solve this problem I would assume. There are also only so many hash table lookups a single machine can do, even if its a Quad Opteron behemoth. Alex Turner netEconomist On 5/11/05, PFC [EMAIL PROTECTED] wrote: However, memcached (and for us, pg_memcached) is an excellent way to improve horizontal scalability by taking disposable data (like session information) out of the database and putting it in protected RAM. So, what is the advantage of such a system versus, say, a sticky sessions system where each session is assigned to ONE application server (not PHP then) which keeps it in RAM as native objects instead of serializing and deserializing it on each request ? I'd say the sticky sessions should perform a lot better, and if one machine dies, only the sessions on this one are lost. But of course you can't do it with PHP as you need an app server which can manage sessions. Potentially the savings are huge, though. On Google, their distributed system spans a huge number of PCs and it has redundancy, ie. individual PC failure is a normal thing and is a part of the system, it is handled gracefully. I read a paper on this matter, it's pretty impressive. The google filesystem has nothing to do with databases though, it's more a massive data store / streaming storage. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Index bloat problem?
Is: REINDEX DATABASE blah supposed to rebuild all indices in the database, or must you specify each table individualy? (I'm asking because I just tried it and it only did system tables) Alex Turner netEconomist On 4/21/05, Josh Berkus josh@agliodbs.com wrote: Bill, What about if an out-of-the-ordinary number of rows were deleted (say 75% of rows in the table, as opposed to normal 5%) followed by a 'VACUUM ANALYZE'? Could things get out of whack because of that situation? Yes. You'd want to run REINDEX after and event like that. As you should now. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Index bloat problem?
Same thing happens in Oracle ALTER INDEX blah rebuild To force a rebuild. It will mark the free blocks as 'free' below the PCTFREE value for the tablespace. Basically If you build an index with entries. and each entry is 1/4 of a block, the database will write 2500 blocks to the disk. If you delete a random 75% of the index values, you will now have 2500 blocks that have 75% free space. The database will reuse that free space in those blocks as you insert new values, but until then, you still have 2500 blocks worth of data on a disk, that is only 25% full. Rebuilding the index forces the system to physically re-allocate all that data space, and now you have just 2499 entries, that use 625 blocks. I'm not sure that 'blocks' is the correct term in postgres, it's segments in Oracle, but the concept remains the same. Alex Turner netEconomist On 4/21/05, Bill Chandler [EMAIL PROTECTED] wrote: --- Josh Berkus josh@agliodbs.com wrote: Bill, What about if an out-of-the-ordinary number of rows were deleted (say 75% of rows in the table, as opposed to normal 5%) followed by a 'VACUUM ANALYZE'? Could things get out of whack because of that situation? Yes. You'd want to run REINDEX after and event like that. As you should now. -- Josh Berkus Aglio Database Solutions San Francisco Thank you. Though I must say, that is very discouraging. REINDEX is a costly operation, timewise and due to the fact that it locks out other processes from proceeding. Updates are constantly coming in and queries are occurring continuously. A REINDEX could potentially bring the whole thing to a halt. Honestly, this seems like an inordinate amount of babysitting for a production application. I'm not sure if the client will be willing to accept it. Admittedly my knowledge of the inner workings of an RDBMS is limited, but could somebody explain to me why this would be so? If you delete a bunch of rows why doesn't the index get updated at the same time? Is this a common issue among all RDBMSs or is it something that is PostgreSQL specific? Is there any way around it? thanks, Bill __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] How to improve db performance with $7K?
I wonder if thats something to think about adding to Postgresql? A setting for multiblock read count like Oracle (Although having said that I believe that Oracle natively caches pages much more aggressively that postgresql, which allows the OS to do the file caching). Alex Turner netEconomist P.S. Oracle changed this with 9i, you can change the Database block size on a tablespace by tablespace bassis making it smaller for OLTP tablespaces and larger for Warehousing tablespaces (at least I think it's on a tablespace, might be on a whole DB). On 4/19/05, Jim C. Nasby [EMAIL PROTECTED] wrote: On Mon, Apr 18, 2005 at 06:41:37PM -, Mohan, Ross wrote: Don't you think optimal stripe width would be a good question to research the binaries for? I'd think that drives the answer, largely. (uh oh, pun alert) EG, oracle issues IO requests (this may have changed _just_ recently) in 64KB chunks, regardless of what you ask for. So when I did my striping (many moons ago, when the Earth was young...) I did it in 128KB widths, and set the oracle multiblock read count according. For oracle, any stripe size under 64KB=stupid, anything much over 128K/258K=wasteful. I am eager to find out how PG handles all this. AFAIK PostgreSQL requests data one database page at a time (normally 8k). Of course the OS might do something different. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] How to improve db performance with $7K?
Whilst I admire your purist approach, I would say that if it is beneficial to performance that a kernel understand drive geometry, then it is worth investigating teaching it how to deal with that! I was less referrring to the kernel as I was to the controller. Lets say we invented a new protocol that including the drive telling the controller how it was layed out at initialization time so that the controller could make better decisions about re-ordering seeks. It would be more cost effective to have that set of electronics just once in the controller, than 8 times on each drive in an array, which would yield better performance to cost ratio. Therefore I would suggest it is something that should be investigated. After all, why implemented TCQ on each drive, if it can be handled more effeciently at the other end by the controller for less money?! Alex Turner netEconomist On 4/19/05, Dave Held [EMAIL PROTECTED] wrote: -Original Message- From: Alex Turner [mailto:[EMAIL PROTECTED] Sent: Monday, April 18, 2005 5:50 PM To: Bruce Momjian Cc: Kevin Brown; pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? Does it really matter at which end of the cable the queueing is done (Assuming both ends know as much about drive geometry etc..)? [...] The parenthetical is an assumption I'd rather not make. If my performance depends on my kernel knowing how my drive is laid out, I would always be wondering if a new drive is going to break any of the kernel's geometry assumptions. Drive geometry doesn't seem like a kernel's business any more than a kernel should be able to decode the ccd signal of an optical mouse. The kernel should queue requests at a level of abstraction that doesn't depend on intimate knowledge of drive geometry, and the drive should queue requests on the concrete level where geometry matters. A drive shouldn't guess whether a process is trying to read a file sequentially, and a kernel shouldn't guess whether sector 30 is contiguous with sector 31 or not. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] How to improve db performance with $7K?
This is fundamentaly untrue. A mirror is still a mirror. At most in a RAID 10 you can have two simultaneous seeks. You are always going to be limited by the seek time of your drives. It's a stripe, so you have to read from all members of the stripe to get data, requiring all drives to seek. There is no advantage to seek time in adding more drives. By adding more drives you can increase throughput, but the max throughput of the PCI-X bus isn't that high (I think around 400MB/sec) You can easily get this with a six or seven drive RAID 5, or a ten drive RAID 10. At that point you start having to factor in the cost of a bigger chassis to hold more drives, which can be big bucks. Alex Turner netEconomist On 18 Apr 2005 10:59:05 -0400, Greg Stark [EMAIL PROTECTED] wrote: William Yu [EMAIL PROTECTED] writes: Using the above prices for a fixed budget for RAID-10, you could get: SATA 7200 -- 680MB per $1000 SATA 10K -- 200MB per $1000 SCSI 10K -- 125MB per $1000 What a lot of these analyses miss is that cheaper == faster because cheaper means you can buy more spindles for the same price. I'm assuming you picked equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as many spindles as the 125MB/$1000. That means it would have almost double the bandwidth. And the 7200 RPM case would have more than 5x the bandwidth. While 10k RPM drives have lower seek times, and SCSI drives have a natural seek time advantage, under load a RAID array with fewer spindles will start hitting contention sooner which results into higher latency. If the controller works well the larger SATA arrays above should be able to maintain their mediocre latency much better under load than the SCSI array with fewer drives would maintain its low latency response time despite its drives' lower average seek time. -- greg ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] How to improve db performance with $7K?
[snip] Adding drives will not let you get lower response times than the average seek time on your drives*. But it will let you reach that response time more often. [snip] I believe your assertion is fundamentaly flawed. Adding more drives will not let you reach that response time more often. All drives are required to fill every request in all RAID levels (except possibly 0+1, but that isn't used for enterprise applicaitons). Most requests in OLTP require most of the request time to seek, not to read. Only in single large block data transfers will you get any benefit from adding more drives, which is atypical in most database applications. For most database applications, the only way to increase transactions/sec is to decrease request service time, which is generaly achieved with better seek times or a better controller card, or possibly spreading your database accross multiple tablespaces on seperate paritions. My assertion therefore is that simply adding more drives to an already competent* configuration is about as likely to increase your database effectiveness as swiss cheese is to make your car run faster. Alex Turner netEconomist *Assertion here is that the DBA didn't simply configure all tables and xlog on a single 7200 RPM disk, but has seperate physical drives for xlog and tablespace at least on 10k drives. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] How to improve db performance with $7K?
Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at least I would never recommend 1+0 for anything). RAID 10 and RAID 0+1 are _quite_ different. One gives you very good redundancy, the other is only slightly better than RAID 5, but operates faster in degraded mode (single drive). Alex Turner netEconomist On 4/18/05, John A Meinel [EMAIL PROTECTED] wrote: Alex Turner wrote: [snip] Adding drives will not let you get lower response times than the average seek time on your drives*. But it will let you reach that response time more often. [snip] I believe your assertion is fundamentaly flawed. Adding more drives will not let you reach that response time more often. All drives are required to fill every request in all RAID levels (except possibly 0+1, but that isn't used for enterprise applicaitons). Actually 0+1 is the recommended configuration for postgres databases (both for xlog and for the bulk data), because the write speed of RAID5 is quite poor. Hence you base assumption is not correct, and adding drives *does* help. Most requests in OLTP require most of the request time to seek, not to read. Only in single large block data transfers will you get any benefit from adding more drives, which is atypical in most database applications. For most database applications, the only way to increase transactions/sec is to decrease request service time, which is generaly achieved with better seek times or a better controller card, or possibly spreading your database accross multiple tablespaces on seperate paritions. This is probably true. However, if you are doing lots of concurrent connections, and things are properly spread across multiple spindles (using RAID0+1, or possibly tablespaces across multiple raids). Then each seek occurs on a separate drive, which allows them to occur at the same time, rather than sequentially. Having 2 processes competing for seeking on the same drive is going to be worse than having them on separate drives. John =:- ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] How to improve db performance with $7K?
I think the add more disks thing is really from the point of view that one disk isn't enough ever. You should really have at least four drives configured into two RAID 1s. Most DBAs will know this, but most average Joes won't. Alex Turner netEconomist On 4/18/05, Steve Poe [EMAIL PROTECTED] wrote: Alex, In the situation of the animal hospital server I oversee, their application is OLTP. Adding hard drives (6-8) does help performance. Benchmarks like pgbench and OSDB agree with it, but in reality users could not see noticeable change. However, moving the top 5/10 tables and indexes to their own space made a greater impact. Someone who reads PostgreSQL 8.0 Performance Checklist is going to see point #1 add more disks is the key. How about adding a subpoint to explaining when more disks isn't enough or applicable? I maybe generalizing the complexity of tuning an OLTP application, but some clarity could help. Steve Poe ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] How to improve db performance with $7K?
Ok - well - I am partially wrong... If you're stripe size is 64Kb, and you are reading 256k worth of data, it will be spread across four drives, so you will need to read from four devices to get your 256k of data (RAID 0 or 5 or 10), but if you are only reading 64kb of data, I guess you would only need to read from one disk. So my assertion that adding more drives doesn't help is pretty wrong... particularly with OLTP because it's always dealing with blocks that are smaller that the stripe size. Alex Turner netEconomist On 4/18/05, Jacques Caron [EMAIL PROTECTED] wrote: Hi, At 18:56 18/04/2005, Alex Turner wrote: All drives are required to fill every request in all RAID levels No, this is definitely wrong. In many cases, most drives don't actually have the data requested, how could they handle the request? When reading one random sector, only *one* drive out of N is ever used to service any given request, be it RAID 0, 1, 0+1, 1+0 or 5. When writing: - in RAID 0, 1 drive - in RAID 1, RAID 0+1 or 1+0, 2 drives - in RAID 5, you need to read on all drives and write on 2. Otherwise, what would be the point of RAID 0, 0+1 or 1+0? Jacques. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] How to improve db performance with $7K?
So I wonder if one could take this stripe size thing further and say that a larger stripe size is more likely to result in requests getting served parallized across disks which would lead to increased performance? Again, thanks to all people on this list, I know that I have learnt a _hell_ of alot since subscribing. Alex Turner netEconomist On 4/18/05, Alex Turner [EMAIL PROTECTED] wrote: Ok - well - I am partially wrong... If you're stripe size is 64Kb, and you are reading 256k worth of data, it will be spread across four drives, so you will need to read from four devices to get your 256k of data (RAID 0 or 5 or 10), but if you are only reading 64kb of data, I guess you would only need to read from one disk. So my assertion that adding more drives doesn't help is pretty wrong... particularly with OLTP because it's always dealing with blocks that are smaller that the stripe size. Alex Turner netEconomist On 4/18/05, Jacques Caron [EMAIL PROTECTED] wrote: Hi, At 18:56 18/04/2005, Alex Turner wrote: All drives are required to fill every request in all RAID levels No, this is definitely wrong. In many cases, most drives don't actually have the data requested, how could they handle the request? When reading one random sector, only *one* drive out of N is ever used to service any given request, be it RAID 0, 1, 0+1, 1+0 or 5. When writing: - in RAID 0, 1 drive - in RAID 1, RAID 0+1 or 1+0, 2 drives - in RAID 5, you need to read on all drives and write on 2. Otherwise, what would be the point of RAID 0, 0+1 or 1+0? Jacques. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] How to improve db performance with $7K?
Mistype.. I meant 0+1 in the second instance :( On 4/18/05, Joshua D. Drake [EMAIL PROTECTED] wrote: Alex Turner wrote: Not true - the recommended RAID level is RAID 10, not RAID 0+1 (at least I would never recommend 1+0 for anything). Uhmm I was under the impression that 1+0 was RAID 10 and that 0+1 is NOT RAID 10. Ref: http://www.acnc.com/raid.html Sincerely, Joshua D. Drake ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] How to improve db performance with $7K?
On 4/18/05, Jacques Caron [EMAIL PROTECTED] wrote: Hi, At 20:21 18/04/2005, Alex Turner wrote: So I wonder if one could take this stripe size thing further and say that a larger stripe size is more likely to result in requests getting served parallized across disks which would lead to increased performance? Actually, it would be pretty much the opposite. The smaller the stripe size, the more evenly distributed data is, and the more disks can be used to serve requests. If your stripe size is too large, many random accesses within one single file (whose size is smaller than the stripe size/number of disks) may all end up on the same disk, rather than being split across multiple disks (the extreme case being stripe size = total size of all disks, which means concatenation). If all accesses had the same cost (i.e. no seek time, only transfer time), the ideal would be to have a stripe size equal to the number of disks. [snip] Ahh yes - but the critical distinction is this: The smaller the stripe size, the more disks will be used to serve _a_ request - which is bad for OLTP because you want fewer disks per request so that you can have more requests per second because the cost is mostly seek. If more than one disk has to seek to serve a single request, you are preventing that disk from serving a second request at the same time. To have more throughput in MB/sec, you want a smaller stripe size so that you have more disks serving a single request allowing you to multiple by effective drives to get total bandwidth. Because OLTP is made up of small reads and writes to a small number of different files, I would guess that you want those files split up across your RAID, but not so much that a single small read or write operation would traverse more than one disk. That would infer that your optimal stripe size is somewhere on the right side of the bell curve that represents your database read and write block count distribution. If on average the dbwritter never flushes less than 1MB to disk at a time, then I guess your best stripe size would be 1MB, but that seems very large to me. So I think therefore that I may be contending the exact opposite of what you are postulating! Alex Turner netEconomist ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] How to improve db performance with $7K?
Does it really matter at which end of the cable the queueing is done (Assuming both ends know as much about drive geometry etc..)? Alex Turner netEconomist On 4/18/05, Bruce Momjian pgman@candle.pha.pa.us wrote: Kevin Brown wrote: Greg Stark wrote: I think you're being misled by analyzing the write case. Consider the read case. When a user process requests a block and that read makes its way down to the driver level, the driver can't just put it aside and wait until it's convenient. It has to go ahead and issue the read right away. Well, strictly speaking it doesn't *have* to. It could delay for a couple of milliseconds to see if other requests come in, and then issue the read if none do. If there are already other requests being fulfilled, then it'll schedule the request in question just like the rest. The idea with SCSI or any command queuing is that you don't have to wait for another request to come in --- you can send the request as it arrives, then if another shows up, you send that too, and the drive optimizes the grouping at a later time, knowing what the drive is doing, rather queueing in the kernel. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Intel SRCS16 SATA raid?
No offense to that review, but it was really wasn't that good, and drew bad conclusions from the data. I posted it originaly and immediately regretted it. See http://www.tweakers.net/reviews/557/18 Amazingly the controller with 1Gig cache manages a write throughput of 750MB/sec on a single drive. quote: Floating high above the crowd, the ARC-1120 has a perfect view on the struggles of the other adapters. It's because the adapter has 1Gig of RAM, nothing to do with the RAID architecture, it's clearly caching the entire dataset. The drive can't physicaly run that fast. These guys really don't know what they are doing. Curiously: http://www.tweakers.net/reviews/557/25 The 3ware does very well as a data drive for MySQL. The size of your cache is going to _directly_ affect RAID 5 performance. Put a gig of memory in a 3ware 9500S and benchmark it against the Areca then. Also - folks don't run data paritions on RAID 5 because the write speed is too low. When you look at the results for RAID 10, the 3ware leads the pack. See also: http://www20.tomshardware.com/storage/20041227/areca-raid6-06.html I trust toms hardware a little more to set up a good review to be honest. The 3ware trounces the Areca in all IO/sec test. Alex Turner netEconomist On 4/15/05, Marinos Yannikos [EMAIL PROTECTED] wrote: Joshua D. Drake wrote: Well I have never even heard of it. 3ware is the defacto authority of reasonable SATA RAID. no! 3ware was rather early in this business, but there are plenty of (IMHO, and some other people's opinion) better alternatives available. 3ware has good Linux drivers, but the performance of their current controllers isn't that good. Have a look at this: http://www.tweakers.net/reviews/557/1 especially the sequential writes with RAID-5 on this page: http://www.tweakers.net/reviews/557/19 We have been a long-time user of a 3ware 8506 controller (8 disks, RAID-5) and have purchased 2 Areca ARC-1120 now since we weren't satisfied with the performance and the 2TB per array limit... -mjy ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Intel SRCS16 SATA raid?
The original thread was how much can I get for $7k You can't fit a 15k RPM SCSI solution into $7K ;) Some of us are on a budget! 10k RPM SATA drives give acceptable performance at a good price, thats really the point here. I have never really argued that SATA is going to match SCSI performance on multidrive arrays for IO/sec. But it's all about the benjamins baby. If I told my boss we need $25k for a database machine, he'd tell me that was impossible, and I have $5k to do it. If I tell him $7k - he will swallow that. We don't _need_ the amazing performance of a 15k RPM drive config. Our biggest hit is reads, so we can buy 3xSATA machines and load balance. It's all about the application, and buying what is appropriate. I don't buy a Corvette if all I need is a malibu. Alex Turner netEconomist On 4/15/05, Dave Held [EMAIL PROTECTED] wrote: -Original Message- From: Alex Turner [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 6:15 PM To: Dave Held Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Intel SRCS16 SATA raid? Looking at the numbers, the raptor with TCQ enabled was close or beat the Atlas III 10k drive on most benchmarks. And I would be willing to bet that the Atlas 10k is not using the same generation of technology as the Raptors. Naturaly a 15k drive is going to be faster in many areas, but it is also much more expensive. It was only 44% better on the server tests than the raptor with TCQ, but it costs nearly 300% more ($538 cdw.com, $180 newegg.com). State that in terms of cars. Would you be willing to pay 300% more for a car that is 44% faster than your competitor's? Of course you would, because we all recognize that the cost of speed/performance does not scale linearly. Naturally, you buy the best speed that you can afford, but when it comes to hard drives, the only major feature whose price tends to scale anywhere close to linearly is capacity. Note also that the 15k drive was the only drive that kept up with the raptor on raw transfer speed, which is going to matter for WAL. So get a Raptor for your WAL partition. ;) [...] The Raptor drives can be had for as little as $180/ea, which is quite a good price point considering they can keep up with their SCSI 10k RPM counterparts on almost all tests with NCQ enabled (Note that 3ware controllers _don't_ support NCQ, although they claim their HBA based queueing is 95% as good as NCQ on the drive). Just keep in mind the points made by the Seagate article. You're buying much more than just performance for that $500+. You're also buying vibrational tolerance, high MTBF, better internal environmental controls, and a pretty significant margin on seek time, which is probably your most important feature for disks storing tables. An interesting test would be to stick several drives in a cabinet and graph how performance is affected at the different price points/ technologies/number of drives. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Intel SRCS16 SATA raid?
I stand corrected! Maybe I should re-evaluate our own config! Alex T (The dell PERC controllers do pretty much suck on linux) On 4/15/05, Vivek Khera [EMAIL PROTECTED] wrote: On Apr 15, 2005, at 11:01 AM, Alex Turner wrote: You can't fit a 15k RPM SCSI solution into $7K ;) Some of us are on a budget! I just bought a pair of Dual Opteron, 4GB RAM, LSI 320-2X RAID dual channel with 8 36GB 15kRPM seagate drives. Each one of these boxes set me back just over $7k, including onsite warrantee. They totally blow away the Dell Dual XEON with external 14 disk RAID (also 15kRPM drives, manufacturer unknown) which also has 4GB RAM and a Dell PERC 3/DC controller, the whole of which set me back over $15k. Vivek Khera, Ph.D. +1-301-869-4449 x806 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Intel SRCS16 SATA raid?
I have read a large chunk of this, and I would highly recommend it to anyone who has been participating in the drive discussions. It is most informative!! Alex Turner netEconomist On 4/14/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Greg, I posted this link under a different thread (the $7k server thread). It is a very good read on why SCSI is better for servers than ATA. I didn't note bias, though it is from a drive manufacturer. YMMV. There is an interesting, though dated appendix on different manufacturers' drive characteristics. http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf Enjoy, Rick [EMAIL PROTECTED] wrote on 04/14/2005 09:54:45 AM: Our vendor is trying to sell us on an Intel SRCS16 SATA raid controller instead of the 3ware one. Poking around it seems this does come with Linux drivers and there is a battery backup option. So it doesn't seem to be completely insane. Anyone have any experience with these controllers? I'm also wondering about whether I'm better off with one of these SATA raid controllers or just going with SCSI drives. -- greg ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Intel SRCS16 SATA raid?
I have put together a little head to head performance of a 15k SCSI, 10k SCSI 10K SATA w/TCQ, 10K SATA wo/TCQ and 7.2K SATA drive comparison at storage review http://www.storagereview.com/php/benchmark/compare_rtg_2001.php?typeID=10testbedID=3osID=4raidconfigID=1numDrives=1devID_0=232devID_1=40devID_2=259devID_3=267devID_4=261devID_5=248devCnt=6 It does illustrate some of the weaknesses of SATA drives, but all in all the Raptor drives put on a good show. Alex Turner netEconomist On 4/14/05, Alex Turner [EMAIL PROTECTED] wrote: I have read a large chunk of this, and I would highly recommend it to anyone who has been participating in the drive discussions. It is most informative!! Alex Turner netEconomist On 4/14/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Greg, I posted this link under a different thread (the $7k server thread). It is a very good read on why SCSI is better for servers than ATA. I didn't note bias, though it is from a drive manufacturer. YMMV. There is an interesting, though dated appendix on different manufacturers' drive characteristics. http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf Enjoy, Rick [EMAIL PROTECTED] wrote on 04/14/2005 09:54:45 AM: Our vendor is trying to sell us on an Intel SRCS16 SATA raid controller instead of the 3ware one. Poking around it seems this does come with Linux drivers and there is a battery backup option. So it doesn't seem to be completely insane. Anyone have any experience with these controllers? I'm also wondering about whether I'm better off with one of these SATA raid controllers or just going with SCSI drives. -- greg ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Intel SRCS16 SATA raid?
Looking at the numbers, the raptor with TCQ enabled was close or beat the Atlas III 10k drive on most benchmarks. Naturaly a 15k drive is going to be faster in many areas, but it is also much more expensive. It was only 44% better on the server tests than the raptor with TCQ, but it costs nearly 300% more ($538 cdw.com, $180 newegg.com). Note also that the 15k drive was the only drive that kept up with the raptor on raw transfer speed, which is going to matter for WAL. For those of us on a budget, a quality controller card with lots of RAM is going to be our biggest friend because it can cache writes, and improve performance. The 3ware controllers seem to be universally benchmarked as the best SATA RAID 10 controllers where database performance is concerned. Even the crappy tweakers.net review had the 3ware as the fastest controller for a MySQL data partition in RAID 10. The Raptor drives can be had for as little as $180/ea, which is quite a good price point considering they can keep up with their SCSI 10k RPM counterparts on almost all tests with NCQ enabled (Note that 3ware controllers _don't_ support NCQ, although they claim their HBA based queueing is 95% as good as NCQ on the drive). Alex Turner netEconomist On 4/14/05, Dave Held [EMAIL PROTECTED] wrote: -Original Message- From: Alex Turner [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 12:14 PM To: [EMAIL PROTECTED] Cc: Greg Stark; pgsql-performance@postgresql.org; [EMAIL PROTECTED] Subject: Re: [PERFORM] Intel SRCS16 SATA raid? I have put together a little head to head performance of a 15k SCSI, 10k SCSI 10K SATA w/TCQ, 10K SATA wo/TCQ and 7.2K SATA drive comparison at storage review http://www.storagereview.com/php/benchmark/compare_rtg_2001.ph p?typeID=10testbedID=3osID=4raidconfigID=1numDrives=1devI D_0=232devID_1=40devID_2=259devID_3=267devID_4=261devID_5 =248devCnt=6 It does illustrate some of the weaknesses of SATA drives, but all in all the Raptor drives put on a good show. [...] I think it's a little misleading that your tests show 0ms seek times for some of the write tests. The environmental test also selects a missing data point as the winner. Besides that, it seems to me that seek time is one of the most important features for a DB server, which means that the SCSI drives are the clear winners and the non-WD SATA drives are the embarrassing losers. Transfer rate is import, but perhaps less so because DBs tend to read/write small blocks rather than large files. On the server suite, which seems to me to be the most relevant for DBs, the Atlas 15k spanks the other drives by a fairly large margin (especially the lesser SATA drives). When you ignore the consumer app benchmarks, I wouldn't be so confident in saying that the Raptors put on a good show. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] How to improve db performance with $7K?
3ware claim that their 'software' implemented command queueing performs at 95% effectiveness compared to the hardware queueing on a SCSI drive, so I would say that they agree with you. I'm still learning, but as I read it, the bits are split across the platters and there is only 'one' head, but happens to be reading from multiple platters. The 'further' in linear distance the data is from the current position, the longer it's going to take to get there. This seems to be true based on a document that was circulated. A hard drive takes considerable amount of time to 'find' a track on the platter compared to the rotational speed, which would agree with the fact that you can read 70MB/sec, but it takes up to 13ms to seek. the ATA protocol is just how the HBA communicates with the drive, there is no reason why the HBA can't reschedule reads and writes just the like SCSI drive would do natively, and this is what infact 3ware claims. I get the feeling based on my own historical experience that generaly drives don't just have a bunch of bad blocks. This all leads me to believe that you can predict with pretty good accuracy how expensive it is to retrieve a given block knowing it's linear increment. Alex Turner netEconomist On 4/14/05, Kevin Brown [EMAIL PROTECTED] wrote: Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: I really don't see how this is any different between a system that has tagged queueing to the disks and one that doesn't. The only difference is where the queueing happens. In the case of SCSI, the queueing happens on the disks (or at least on the controller). In the case of SATA, the queueing happens in the kernel. That's basically what it comes down to: SCSI lets the disk drive itself do the low-level I/O scheduling whereas the ATA spec prevents the drive from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's possible for the drive to rearrange reads as well as writes --- which AFAICS is just not possible in ATA. (Maybe in the newest spec...) The reason this is so much more of a win than it was when ATA was designed is that in modern drives the kernel has very little clue about the physical geometry of the disk. Variable-size tracks, bad-block sparing, and stuff like that make for a very hard-to-predict mapping from linear sector addresses to actual disk locations. Yeah, but it's not clear to me, at least, that this is a first-order consideration. A second-order consideration, sure, I'll grant that. What I mean is that when it comes to scheduling disk activity, knowledge of the specific physical geometry of the disk isn't really important. What's important is whether or not the disk conforms to a certain set of expectations. Namely, that the general organization is such that addressing the blocks in block number order guarantees maximum throughput. Now, bad block remapping destroys that guarantee, but unless you've got a LOT of bad blocks, it shouldn't destroy your performance, right? Combine that with the fact that the drive controller can be much smarter than it was twenty years ago, and you can see that the case for doing I/O scheduling in the kernel and not in the drive is pretty weak. Well, I certainly grant that allowing the controller to do the I/O scheduling is faster than having the kernel do it, as long as it can handle insertion of new requests into the list while it's in the middle of executing a request. The most obvious case is when the head is in motion and the new request can be satisfied by reading from the media between where the head is at the time of the new request and where the head is being moved to. My argument is that a sufficiently smart kernel scheduler *should* yield performance results that are reasonably close to what you can get with that feature. Perhaps not quite as good, but reasonably close. It shouldn't be an orders-of-magnitude type difference. -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] How to improve db performance with $7K?
Based on the reading I'm doing, and somebody please correct me if I'm wrong, it seems that SCSI drives contain an on disk controller that has to process the tagged queue. SATA-I doesn't have this. This additional controller, is basicaly an on board computer that figures out the best order in which to process commands. I believe you are also paying for the increased tolerance that generates a better speed. If you compare an 80Gig 7200RPM IDE drive to a WD Raptor 76G 10k RPM to a Seagate 10k.6 drive to a Seagate Cheatah 15k drive, each one represents a step up in parts and technology, thereby generating a cost increase (at least thats what the manufactures tell us). I know if you ever held a 15k drive in your hand, you can notice a considerable weight difference between it and a 7200RPM IDE drive. Alex Turner netEconomist On Apr 7, 2005 11:37 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Another simple question: Why is SCSI more expensive? After the eleventy-millionth controller is made, it seems like SCSI and SATA are using a controller board and a spinning disk. Is somebody still making money by licensing SCSI technology? Rick [EMAIL PROTECTED] wrote on 04/06/2005 11:58:33 PM: You asked for it! ;-) If you want cheap, get SATA. If you want fast under *load* conditions, get SCSI. Everything else at this time is marketing hype, either intentional or learned. Ignoring dollars, expect to see SCSI beat SATA by 40%. * * * What I tell you three times is true * * * Also, compare the warranty you get with any SATA drive with any SCSI drive. Yes, you still have some change leftover to buy more SATA drives when they fail, but... it fundamentally comes down to some actual implementation and not what is printed on the cardboard box. Disk systems are bound by the rules of queueing theory. You can hit the sales rep over the head with your queueing theory book. Ultra320 SCSI is king of the hill for high concurrency databases. If you're only streaming or serving files, save some money and get a bunch of SATA drives. But if you're reading/writing all over the disk, the simple first-come-first-serve SATA heuristic will hose your performance under load conditions. Next year, they will *try* bring out some SATA cards that improve on first-come-first-serve, but they ain't here now. There are a lot of rigged performance tests out there... Maybe by the time they fix the queueing problems, serial Attached SCSI (a/k/a SAS) will be out. Looks like Ultra320 is the end of the line for parallel SCSI, as Ultra640 SCSI (a/k/a SPI-5) is dead in the water. Ultra320 SCSI. Ultra320 SCSI. Ultra320 SCSI. Serial Attached SCSI. Serial Attached SCSI. Serial Attached SCSI. For future trends, see: http://www.incits.org/archive/2003/in031163/in031163.htm douglas p.s. For extra credit, try comparing SATA and SCSI drives when they're 90% full. On Apr 6, 2005, at 8:32 PM, Alex Turner wrote: I guess I'm setting myself up here, and I'm really not being ignorant, but can someone explain exactly how is SCSI is supposed to better than SATA? Both systems use drives with platters. Each drive can physically only read one thing at a time. SATA gives each drive it's own channel, but you have to share in SCSI. A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but SCSI can only do 320MB/sec across the entire array. What am I missing here? Alex Turner netEconomist ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: RE : RE: [PERFORM] Postgresql vs SQLserver for this application ?
I think everyone was scared off by the 5000 inserts per second number. I've never seen even Oracle do this on a top end Dell system with copious SCSI attached storage. Alex Turner netEconomist On Apr 6, 2005 3:17 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Unfortunately. But we are in the the process to choose Postgresql with pgcluster. I'm currently running some tests (performance, stability...) Save the money on the license fees, you get it for your hardware ;-) I still welcome any advices or comments and I'll let you know how the project is going on. Benjamin. Mohan, Ross [EMAIL PROTECTED] 05/04/2005 20:48 Pour :[EMAIL PROTECTED] cc : Objet :RE: [PERFORM] Postgresql vs SQLserver for this application ? You never got answers on this? Apologies, I don't have one, but'd be curious to hear about any you did get thx Ross -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Monday, April 04, 2005 4:02 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Postgresql vs SQLserver for this application ? hi all. We are designing a quite big application that requires a high-performance database backend. The rates we need to obtain are at least 5000 inserts per second and 15 selects per second for one connection. There should only be 3 or 4 simultaneous connections. I think our main concern is to deal with the constant flow of data coming from the inserts that must be available for selection as fast as possible. (kind of real time access ...) As a consequence, the database should rapidly increase up to more than one hundred gigs. We still have to determine how and when we shoud backup old data to prevent the application from a performance drop. We intend to develop some kind of real-time partionning on our main table keep the flows up. At first, we were planning to use SQL Server as it has features that in my opinion could help us a lot : - replication - clustering Recently we started to study Postgresql as a solution for our project : - it also has replication - Postgis module can handle geographic datatypes (which would facilitate our developments) - We do have a strong knowledge on Postgresql administration (we use it for production processes) - it is free (!) and we could save money for hardware purchase. Is SQL server clustering a real asset ? How reliable are Postgresql replication tools ? Should I trust Postgresql performance for this kind of needs ? My question is a bit fuzzy but any advices are most welcome... hardware,tuning or design tips as well :)) Thanks a lot. Benjamin. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: RE : RE: [PERFORM] Postgresql vs SQLserver for thisapplication ?
I guess I was thinking more in the range of 5000 transaction/sec, less so 5000 rows on bulk import... Alex On Apr 6, 2005 12:47 PM, Mohan, Ross [EMAIL PROTECTED] wrote: snip good stuff... 31Million tuples were loaded in approx 279 seconds, or approx 112k rows per second. I'd love to see PG get into this range..i am a big fan of PG (just a rank newbie) but I gotta think the underlying code to do this has to be not-too-complex. I'd say we're there. || CLAPPING!! Yes! PG is there, assuredly! So VERY cool! I made a newbie error of conflating COPY with INSERT. I don't know if I could get oracle to do much more than about 500-1500 rows/sec...PG is quite impressive. Makes one wonder why corporations positively insist on giving oracle yearly. shrug -Original Message- From: Rod Taylor [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 06, 2005 12:41 PM To: Mohan, Ross Cc: pgsql-performance@postgresql.org Subject: Re: RE : RE: [PERFORM] Postgresql vs SQLserver for thisapplication ? On Wed, 2005-04-06 at 16:12 +, Mohan, Ross wrote: I wish I had a Dell system and run case to show you Alex, but I don't... however...using Oracle's direct path feature, it's pretty straightforward. We've done 110,000 rows per second into index-less tables on a big system (IBM Power5 chips, Hitachi SAN). ( Yes, I am sure: over 100K a second. Sustained for almost 9 minutes. ) Just for kicks I did a local test on a desktop machine (single CPU, single IDE drive) using COPY from STDIN for a set of integers in via a single transaction, no indexes. 1572864 tuples were loaded in 13715.613ms, which is approx 115k rows per second. Okay, no checkpoints and I didn't cross an index boundary, but I also haven't tuned the config file beyond bumping up the buffers. Lets try again with more data this time. 31Million tuples were loaded in approx 279 seconds, or approx 112k rows per second. I'd love to see PG get into this range..i am a big fan of PG (just a rank newbie) but I gotta think the underlying code to do this has to be not-too-complex. I'd say we're there. -Original Message- From: Alex Turner [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 06, 2005 11:38 AM To: [EMAIL PROTECTED] Cc: pgsql-performance@postgresql.org; Mohan, Ross Subject: Re: RE : RE: [PERFORM] Postgresql vs SQLserver for this application ? I think everyone was scared off by the 5000 inserts per second number. I've never seen even Oracle do this on a top end Dell system with copious SCSI attached storage. Alex Turner netEconomist On Apr 6, 2005 3:17 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Unfortunately. But we are in the the process to choose Postgresql with pgcluster. I'm currently running some tests (performance, stability...) Save the money on the license fees, you get it for your hardware ;-) I still welcome any advices or comments and I'll let you know how the project is going on. Benjamin. Mohan, Ross [EMAIL PROTECTED] 05/04/2005 20:48 Pour :[EMAIL PROTECTED] cc : Objet :RE: [PERFORM] Postgresql vs SQLserver for this application ? You never got answers on this? Apologies, I don't have one, but'd be curious to hear about any you did get thx Ross -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Monday, April 04, 2005 4:02 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Postgresql vs SQLserver for this application ? hi all. We are designing a quite big application that requires a high-performance database backend. The rates we need to obtain are at least 5000 inserts per second and 15 selects per second for one connection. There should only be 3 or 4 simultaneous connections. I think our main concern is to deal with the constant flow of data coming from the inserts that must be available for selection as fast as possible. (kind of real time access ...) As a consequence, the database should rapidly increase up to more than one hundred gigs. We still have to determine how and when we shoud backup old data to prevent the application from a performance drop. We intend to develop some kind of real-time partionning on our main table keep the flows up. At first, we were planning to use SQL Server as it has features that in my opinion could help us a lot : - replication - clustering Recently we started to study Postgresql as a solution for our project : - it also has replication - Postgis module can handle geographic datatypes (which would facilitate our developments) - We do have a strong knowledge on Postgresql
Re: [PERFORM] Réf
I think his point was that 9 * 4 != 2400 Alex Turner netEconomist On Apr 6, 2005 2:23 PM, Rod Taylor [EMAIL PROTECTED] wrote: On Wed, 2005-04-06 at 19:42 +0200, Steinar H. Gunderson wrote: On Wed, Apr 06, 2005 at 01:18:29PM -0400, Rod Taylor wrote: Yeah, I think that can be done provided there is more than one worker. My limit seems to be about 1000 transactions per second each with a single insert for a single process (round trip time down the Fibre Channel is large) but running 4 simultaneously only drops throughput to about 900 per process (total of 2400 transactions per second) and the machine still seemed to have lots of oomph to spare. Erm, have I missed something here? 900 * 4 = 2400? Nope. You've not missed anything. If I ran 10 processes and the requirement would be met. -- ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] How to improve db performance with $7K?
Well - unfortuantely software RAID isn't appropriate for everyone, and some of us need a hardware RAID controller. The LSI Megaraid 320-2 card is almost exactly the same price as the 3ware 9500S-12 card (although I will conceed that a 320-2 card can handle at most 2x14 devices compare with the 12 on the 9500S). If someone can come up with a test, I will be happy to run it and see how it goes. I would be _very_ interested in the results having just spent $7k on a new DB server!! I have also seen really bad performance out of SATA. It was with either an on-board controller, or a cheap RAID controller from HighPoint. As soon as I put in a decent controller, things went much better. I think it's unfair to base your opinion of SATA from a test that had a poor controler. I know I'm not the only one here running SATA RAID and being very satisfied with the results. Thanks, Alex Turner netEconomist On Apr 6, 2005 4:01 PM, William Yu [EMAIL PROTECTED] wrote: It's the same money if you factor in the 3ware controller. Even without a caching controller, SCSI works good in multi-threaded IO (not withstanding crappy shit from Dell or Compaq). You can get such cards from LSI for $75. And of course, many server MBs come with LSI controllers built-in. Our older 32-bit production servers all use Linux software RAID w/ SCSI and there's no issues when multiple users/processes hit the DB. *Maybe* a 3ware controller w/ onboard cache + battery backup might do much better for multi-threaded IO than just plain-jane SATA. Unfortunately, I have not been able to find anything online that can confirm or deny this. Hence, the choice is spend $$$ on the 3ware controller and hope it meets your needs -- or spend $$$ on SCSI drives and be sure. Now if you want to run such tests, we'd all be delighted with to see the results so we have another option for building servers. Alex Turner wrote: It's hardly the same money, the drives are twice as much. It's all about the controller baby with any kind of dive. A bad SCSI controller will give sucky performance too, believe me. We had a Compaq Smart Array 5304, and it's performance was _very_ sub par. If someone has a simple benchmark test database to run, I would be happy to run it on our hardware here. Alex Turner On Apr 6, 2005 3:30 AM, William Yu [EMAIL PROTECTED] wrote: Alex Turner wrote: I'm no drive expert, but it seems to me that our write performance is excellent. I think what most are concerned about is OLTP where you are doing heavy write _and_ heavy read performance at the same time. Our system is mostly read during the day, but we do a full system update everynight that is all writes, and it's very fast compared to the smaller SCSI system we moved off of. Nearly a 6x spead improvement, as fast as 900 rows/sec with a 48 byte record, one row per transaction. I've started with SATA in a multi-read/multi-write environment. While it ran pretty good with 1 thread writing, the addition of a 2nd thread (whether reading or writing) would cause exponential slowdowns. I suffered through this for a week and then switched to SCSI. Single threaded performance was pretty similar but with the advanced command queueing SCSI has, I was able to do multiple reads/writes simultaneously with only a small performance hit for each thread. Perhaps having a SATA caching raid controller might help this situation. I don't know. It's pretty hard justifying buying a $$$ 3ware controller just to test it when you could spend the same money on SCSI and have a guarantee it'll work good under multi-IO scenarios. ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] How to improve db performance with $7K?
I guess I'm setting myself up here, and I'm really not being ignorant, but can someone explain exactly how is SCSI is supposed to better than SATA? Both systems use drives with platters. Each drive can physically only read one thing at a time. SATA gives each drive it's own channel, but you have to share in SCSI. A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but SCSI can only do 320MB/sec across the entire array. What am I missing here? Alex Turner netEconomist On Apr 6, 2005 5:41 PM, Jim C. Nasby [EMAIL PROTECTED] wrote: Sorry if I'm pointing out the obvious here, but it seems worth mentioning. AFAIK all 3ware controllers are setup so that each SATA drive gets it's own SATA bus. My understanding is that by and large, SATA still suffers from a general inability to have multiple outstanding commands on the bus at once, unlike SCSI. Therefore, to get good performance out of SATA you need to have a seperate bus for each drive. Theoretically, it shouldn't really matter that it's SATA over ATA, other than I certainly wouldn't want to try and cram 8 ATA cables into a machine... Incidentally, when we were investigating storage options at a previous job we talked to someone who deals with RS/6000 storage. He had a bunch of info about their serial controller protocol (which I can't think of the name of) vs SCSI. SCSI had a lot more overhead, so you could end up saturating even a 160MB SCSI bus with only 2 or 3 drives. People are finally realizing how important bandwidth has become in modern machines. Memory bandwidth is why RS/6000 was (and maybe still is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of the water. Likewise it's why SCSI is so much better than IDE (unless you just give each drive it's own dedicated bandwidth). -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] How to improve db performance with $7K?
Ok - so I found this fairly good online review of various SATA cards out there, with 3ware not doing too hot on RAID 5, but ok on RAID 10. http://www.tweakers.net/reviews/557/ Very interesting stuff. Alex Turner netEconomist On Apr 6, 2005 7:32 PM, Alex Turner [EMAIL PROTECTED] wrote: I guess I'm setting myself up here, and I'm really not being ignorant, but can someone explain exactly how is SCSI is supposed to better than SATA? Both systems use drives with platters. Each drive can physically only read one thing at a time. SATA gives each drive it's own channel, but you have to share in SCSI. A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but SCSI can only do 320MB/sec across the entire array. What am I missing here? Alex Turner netEconomist On Apr 6, 2005 5:41 PM, Jim C. Nasby [EMAIL PROTECTED] wrote: Sorry if I'm pointing out the obvious here, but it seems worth mentioning. AFAIK all 3ware controllers are setup so that each SATA drive gets it's own SATA bus. My understanding is that by and large, SATA still suffers from a general inability to have multiple outstanding commands on the bus at once, unlike SCSI. Therefore, to get good performance out of SATA you need to have a seperate bus for each drive. Theoretically, it shouldn't really matter that it's SATA over ATA, other than I certainly wouldn't want to try and cram 8 ATA cables into a machine... Incidentally, when we were investigating storage options at a previous job we talked to someone who deals with RS/6000 storage. He had a bunch of info about their serial controller protocol (which I can't think of the name of) vs SCSI. SCSI had a lot more overhead, so you could end up saturating even a 160MB SCSI bus with only 2 or 3 drives. People are finally realizing how important bandwidth has become in modern machines. Memory bandwidth is why RS/6000 was (and maybe still is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of the water. Likewise it's why SCSI is so much better than IDE (unless you just give each drive it's own dedicated bandwidth). -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] How to improve db performance with $7K?
Ok - I take it back - I'm reading through this now, and realising that the reviews are pretty clueless in several places... On Apr 6, 2005 8:12 PM, Alex Turner [EMAIL PROTECTED] wrote: Ok - so I found this fairly good online review of various SATA cards out there, with 3ware not doing too hot on RAID 5, but ok on RAID 10. http://www.tweakers.net/reviews/557/ Very interesting stuff. Alex Turner netEconomist On Apr 6, 2005 7:32 PM, Alex Turner [EMAIL PROTECTED] wrote: I guess I'm setting myself up here, and I'm really not being ignorant, but can someone explain exactly how is SCSI is supposed to better than SATA? Both systems use drives with platters. Each drive can physically only read one thing at a time. SATA gives each drive it's own channel, but you have to share in SCSI. A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but SCSI can only do 320MB/sec across the entire array. What am I missing here? Alex Turner netEconomist On Apr 6, 2005 5:41 PM, Jim C. Nasby [EMAIL PROTECTED] wrote: Sorry if I'm pointing out the obvious here, but it seems worth mentioning. AFAIK all 3ware controllers are setup so that each SATA drive gets it's own SATA bus. My understanding is that by and large, SATA still suffers from a general inability to have multiple outstanding commands on the bus at once, unlike SCSI. Therefore, to get good performance out of SATA you need to have a seperate bus for each drive. Theoretically, it shouldn't really matter that it's SATA over ATA, other than I certainly wouldn't want to try and cram 8 ATA cables into a machine... Incidentally, when we were investigating storage options at a previous job we talked to someone who deals with RS/6000 storage. He had a bunch of info about their serial controller protocol (which I can't think of the name of) vs SCSI. SCSI had a lot more overhead, so you could end up saturating even a 160MB SCSI bus with only 2 or 3 drives. People are finally realizing how important bandwidth has become in modern machines. Memory bandwidth is why RS/6000 was (and maybe still is) cleaning Sun's clock, and it's why the Opteron blows Itaniums out of the water. Likewise it's why SCSI is so much better than IDE (unless you just give each drive it's own dedicated bandwidth). -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] How to improve db performance with $7K?
Yeah - the more reading I'm doing - the more I'm finding out. Alledgelly the Western Digial Raptor drives implement a version of ATA-4 Tagged Queing which allows reordering of commands. Some controllers support this. The 3ware docs say that the controller support both reordering on the controller and to the drive. *shrug* This of course is all supposed to go away with SATA II which as NCQ, Native Command Queueing. Of course the 3ware controllers don't support SATA II, but a few other do, and I'm sure 3ware will come out with a controller that does. Alex Turner netEconomist On 06 Apr 2005 23:00:54 -0400, Greg Stark [EMAIL PROTECTED] wrote: Alex Turner [EMAIL PROTECTED] writes: SATA gives each drive it's own channel, but you have to share in SCSI. A SATA controller typicaly can do 3Gb/sec (384MB/sec) per drive, but SCSI can only do 320MB/sec across the entire array. SCSI controllers often have separate channels for each device too. In any case the issue with the IDE protocol is that fundamentally you can only have a single command pending. SCSI can have many commands pending. This is especially important for a database like postgres that may be busy committing one transaction while another is trying to read. Having several commands queued on the drive gives it a chance to execute any that are on the way to the committing transaction. However I'm under the impression that 3ware has largely solved this problem. Also, if you save a few dollars and can afford one additional drive that additional drive may improve your array speed enough to overcome that inefficiency. -- greg ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] How to improve db performance with $7K?
To be honest, I've yet to run across a SCSI configuration that can touch the 3ware SATA controllers. I have yet to see one top 80MB/sec, let alone 180MB/sec read or write, which is why we moved _away_ from SCSI. I've seen Compaq, Dell and LSI controllers all do pathetically badly on RAID 1, RAID 5 and RAID 10. 35MB/sec for a three drive RAID 0 is not bad, it's appalling. The hardware manufacturer should be publicly embarassed for this kind of speed. A single U320 10k drive can do close to 70MB/sec sustained. If someone can offer benchmarks to the contrary (particularly in linux), I would be greatly interested. Alex Turner netEconomist On Mar 29, 2005 8:17 AM, Dave Cramer [EMAIL PROTECTED] wrote: Yeah, 35Mb per sec is slow for a raid controller, the 3ware mirrored is about 50Mb/sec, and striped is about 100 Dave PFC wrote: With hardware tuning, I am sure we can do better than 35Mb per sec. Also WTF ? My Laptop does 19 MB/s (reading 10 KB files, reiser4) ! A recent desktop 7200rpm IDE drive # hdparm -t /dev/hdc1 /dev/hdc1: Timing buffered disk reads: 148 MB in 3.02 seconds = 49.01 MB/sec # ll DragonBall 001.avi -r--r--r--1 peufeu users218M mar 9 20:07 DragonBall 001.avi # time cat DragonBall 001.avi /dev/null real0m4.162s user0m0.020s sys 0m0.510s (the file was not in the cache) = about 52 MB/s (reiser3.6) So, you have a problem with your hardware... ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings -- Dave Cramer http://www.postgresintl.com 519 939 0336 ICQ#14675561 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] How to improve db performance with $7K?
I'm no drive expert, but it seems to me that our write performance is excellent. I think what most are concerned about is OLTP where you are doing heavy write _and_ heavy read performance at the same time. Our system is mostly read during the day, but we do a full system update everynight that is all writes, and it's very fast compared to the smaller SCSI system we moved off of. Nearly a 6x spead improvement, as fast as 900 rows/sec with a 48 byte record, one row per transaction. I don't know enough about how SATA works to really comment on it's performance as a protocol compared with SCSI. If anyone has a usefull link on that, it would be greatly appreciated. More drives will give more throughput/sec, but not necesarily more transactions/sec. For that you will need more RAM on the controler, and defaintely a BBU to keep your data safe. Alex Turner netEconomist On Apr 4, 2005 10:39 AM, Steve Poe [EMAIL PROTECTED] wrote: Alex Turner wrote: To be honest, I've yet to run across a SCSI configuration that can touch the 3ware SATA controllers. I have yet to see one top 80MB/sec, let alone 180MB/sec read or write, which is why we moved _away_ from SCSI. I've seen Compaq, Dell and LSI controllers all do pathetically badly on RAID 1, RAID 5 and RAID 10. Alex, How does the 3ware controller do in heavy writes back to the database? It may have been Josh, but someone said that SATA does well with reads but not writes. Would not equal amount of SCSI drives outperform SATA? I don't want to start a whose better war, I am just trying to learn here. It would seem the more drives you could place in a RAID configuration, the performance would increase. Steve Poe ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] How to improve db performance with $7K?
I'm doing some research on SATA vs SCSI right now, but to be honest I'm not turning up much at the protocol level. Alot of stupid benchmarks comparing 10k Raptor drives against Top of the line 15k drives, where usnurprsingly the SCSI drives win but of course cost 4 times as much. Although even in some, SATA wins, or draws. I'm trying to find something more apples to apples. 10k to 10k. Alex Turner netEconomist On Apr 4, 2005 3:23 PM, Vivek Khera [EMAIL PROTECTED] wrote: On Apr 4, 2005, at 3:12 PM, Alex Turner wrote: Our system is mostly read during the day, but we do a full system update everynight that is all writes, and it's very fast compared to the smaller SCSI system we moved off of. Nearly a 6x spead improvement, as fast as 900 rows/sec with a 48 byte record, one row per transaction. Well, if you're not heavily multitasking, the advantage of SCSI is lost on you. Vivek Khera, Ph.D. +1-301-869-4449 x806 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Sustained inserts per sec ... ?
Oh - this is with a seperate transaction per command. fsync is on. Alex Turner netEconomist On Apr 1, 2005 4:17 PM, Alex Turner [EMAIL PROTECTED] wrote: 1250/sec with record size average is 26 bytes 800/sec with record size average is 48 bytes. 250/sec with record size average is 618 bytes. Data from pg_stats and our own job monitoring System has four partitions, two raid 1s, a four disk RAID 10 and a six disk RAID 10. pg_xlog is on four disk RAID 10, database is on RAID 10. Data is very spread out because database turnover time is very high, so our performance is about double this with a fresh DB. (the data half life is probably measurable in days or weeks). Alex Turner netEconomist On Apr 1, 2005 1:06 PM, Marc G. Fournier [EMAIL PROTECTED] wrote: Just curious, but does anyone have an idea of what we are capable of? I realize that size of record would affect things, as well as hardware, but if anyone has some ideas on max, with 'record size', that would be appreciated ... Thanks ... Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: [EMAIL PROTECTED] Yahoo!: yscrappy ICQ: 7615664 ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Postgres on RAID5
a 14 drive stripe will max out the PCI bus long before anything else, the only reason for a stripe this size is to get a total accessible size up. A 6 drive RAID 10 on a good controller can get up to 400Mb/sec which is pushing the limit of the PCI bus (taken from offical 3ware 9500S 8MI benchmarks). 140 drives is not going to beat 6 drives because you've run out of bandwidth on the PCI bus. The debait on RAID 5 rages onward. The benchmarks I've seen suggest that RAID 5 is consistantly slower than RAID 10 with the same number of drivers, but others suggest that RAID 5 can be much faster that RAID 10 (see arstechnica.com) (Theoretical performance of RAID 5 is inline with a RAID 0 stripe of N-1 drives, RAID 10 has only N/2 drives in a stripe, perfomance should be nearly double - in theory of course). 35 Trans/sec is pretty slow, particularly if they are only one row at a time. I typicaly get 200-400/sec on our DB server on a bad day. Up to 1100 on a fresh database. I suggested running a bonnie benchmark, or some other IO perftest to determine if it's the array itself performing badly, or if there is something wrong with postgresql. If the array isn't kicking out at least 50MB/sec read/write performance, something is wrong. Until you've isolated the problem to either postgres or the array, everything else is simply speculation. In a perfect world, you would have two 6 drive RAID 10s. on two PCI busses, with system tables on a third parition, and archive logging on a fourth. Unsurprisingly this looks alot like the Oracle recommended minimum config. Also a note for interest is that this is _software_ raid... Alex Turner netEconomist On 13 Mar 2005 23:36:13 -0500, Greg Stark [EMAIL PROTECTED] wrote: Arshavir Grigorian [EMAIL PROTECTED] writes: Hi, I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an Ext3 filesystem which is used by Postgres. People are going to suggest moving to RAID1+0. I'm unconvinced that RAID5 across 14 drivers shouldn't be able to keep up with RAID1 across 7 drives though. It would be interesting to see empirical data. One thing that does scare me is the Postgres transaction log and the ext3 journal both sharing these disks with the data. Ideally both of these things should get (mirrored) disks of their own separate from the data files. But 2-3s pauses seem disturbing. I wonder whether ext3 is issuing a cache flush on every fsync to get the journal pushed out. This is a new linux feature that's necessary with ide but shouldn't be necessary with scsi. It would be interesting to know whether postgres performs differently with fsync=off. This would even be a reasonable mode to run under for initial database loads. It shouldn't make much of a difference with hardware like this though. And you should be aware that running under this mode in production would put your data at risk. -- greg ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Postgres on RAID5
He doesn't have a RAID controller, it's software RAID... Alex Turner netEconomis On Mon, 14 Mar 2005 16:18:00 -0500, Merlin Moncure [EMAIL PROTECTED] wrote: Alex Turner wrote: 35 Trans/sec is pretty slow, particularly if they are only one row at a time. I typicaly get 200-400/sec on our DB server on a bad day. Up to 1100 on a fresh database. Well, don't rule out that his raid controller is not caching his writes. His WAL sync method may be overriding his raid cache policy and flushing his writes to disk, always. Win32 has the same problem, and before Magnus's O_DIRECT patch, there was no way to easily work around it without turning fsync off. I'd suggest playing with different WAL sync methods before trying anything else. Merli ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Postgres on RAID5
I would recommend running a bonnie++ benchmark on your array to see if it's the array/controller/raid being crap, or wether it's postgres. I have had some very surprising results from arrays that theoretically should be fast, but turned out to be very slow. I would also seriously have to recommend against a 14 drive RAID 5! This is statisticaly as likely to fail as a 7 drive RAID 0 (not counting the spare, but rebuiling a spare is very hard on existing drives). Alex Turner netEconomist On Fri, 11 Mar 2005 16:13:05 -0500, Arshavir Grigorian [EMAIL PROTECTED] wrote: Hi, I have a RAID5 array (mdadm) with 14 disks + 1 spare. This partition has an Ext3 filesystem which is used by Postgres. Currently we are loading a 50G database on this server from a Postgres dump (copy, not insert) and are experiencing very slow write performance (35 records per second). Top shows that the Postgres process (postmaster) is being constantly put into D state for extended periods of time (2-3 seconds) which I assume is because it's waiting for disk io. I have just started gathering system statistics and here is what sar -b shows: (this is while the db is being loaded - pg_restore) tpsrtps wtps bread/s bwrtn/s 01:35:01 PM275.77 76.12199.66709.59 2315.23 01:45:01 PM287.25 75.56211.69706.52 2413.06 01:55:01 PM281.73 76.35205.37711.84 2389.86 [snip] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] multi billion row tables: possible or insane?
Not true - with fsync on I get nearly 500 tx/s, with it off I'm as high as 1600/sec with dual opteron and 14xSATA drives and 4GB RAM on a 3ware Escalade. Database has 3 million rows. As long as queries use indexes, multi billion row shouldn't be too bad. Full table scan will suck though. Alex Turner netEconomist On Tue, 1 Mar 2005 16:40:29 +0100, Vig, Sandor (G/FI-2) [EMAIL PROTECTED] wrote: 385 transaction/sec? fsync = false risky but fast. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of John Arbash Meinel Sent: Tuesday, March 01, 2005 4:19 PM To: Ramon Bastiaans Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] multi billion row tables: possible or insane? Ramon Bastiaans wrote: Hi all, I am doing research for a project of mine where I need to store several billion values for a monitoring and historical tracking system for a big computer system. My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more). If you have that 1 billion perfectly distributed over all hours of the day, then you need 1e9/30/24/3600 = 385 transactions per second. Which I'm pretty sure is possible with postgres, you just need pretty beefy hardware. And like Jeff said, lots of disks for lots of IO. Like a quad opteron, with 16GB of ram, and around 14-20 very fast disks. raid10 not raid5, etc. To improve query performance, you can do some load balancing by having replication machines by using Slony. Or if you can do batch processing, you could split up the work into a few update machines, which then do bulk updates on the master database. This lets you get more machines into the job, since you can't share a database across multiple machines. I was wondering if anyone has had any experience with these kind of big numbers of data in a postgres sql database and how this affects database design and optimization. Well, one of the biggest things is if you can get bulk updates, or if clients can handle data being slightly out of date, so you can use cacheing. Can you segregate your data into separate tables as much as possible? Are your clients okay if aggregate information takes a little while to update? One trick is to use semi-lazy materialized views to get your updates to be fast. What would be important issues when setting up a database this big, and is it at all doable? Or would it be a insane to think about storing up to 5-10 billion rows in a postgres database. I think you if you can design the db properly, it is doable. But if you have a clients saying I need up to the second information on 1 billion rows, you're never going to get it. The database's performance is important. There would be no use in storing the data if a query will take ages. Query's should be quite fast if possible. Again, it depends on the queries being done. There are some nice tricks you can use, like doing a month-by-month partitioning (if you are getting 1G inserts, you might want week-by-week partitioning), and then with a date column index, and a union all view you should be able to get pretty good insert speed, and still keep fast *recent* queries. Going through 1billion rows is always going to be expensive. I would really like to hear people's thoughts/suggestions or go see a shrink, you must be mad statements ;) Kind regards, Ramon Bastiaans I think it would be possible, but there are a lot of design issues with a system like this. You can't go into it thinking that you can design a multi billion row database the same way you would design a million row db. John =:- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])