[HACKERS] Patch : seq scan readahead (WIP)

2009-08-08 Thread Pierre Frédéric Caillau d


This is a spinoff of the current work on compression...
I've discovered that linux doesn't apply readahead to sparse files.
So I added a little readahead in seq scans.

Then I realized this might also be beneficial for the standard Postgres.
On my RAID1 it shows some pretty drastic effects.

The PC :

- RAID1 of 2xSATA disks, reads at about 60 MB/s
- RAID5 of 3xSATA disks, reads at about 210 MB/s

Both RAIDs are Linux Software RAID.

Test data :

A 9.3GB table with not too small rows, so count(*) doesn't use lots of CPU.

The problem :

- On the RAID5 there is no problem, count(*) maxes out the disk.
- On the RAID1, count(*) also maxes out the disk, but there are 2 disks.
One works, one sits idle. It does nothing.
Linux Software RAID cannot use 2 disks on sequential reads, at least on my
kernel version. What do your boxes do in such a situation ?

For standard postgres, iostat says :

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda   3,00 0,0040,00  0 40
sdb 727,00116600,0040,00 116600 40

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda 124,00 23408,00 0,00  23408  0
sdb 628,00101640,00 0,00 101640  0

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda 744,00124536,00 0,00 124536  0
sdb   0,00 0,00 0,00  0  0

Basically it is reading the disks in turn, but not at the same time.

The solution :

Somehow coerce Linux Software RAID to stripe reads across the 2 mirrors to  
get more throughput.


After a bit of fiddling, this seems to do it :

- for each page read in a seq scan

Strategy 0 : do nothing (this is the current strategy)
Strategy 1 : issue a Prefetch call 4096 pages ahead (32MB) of current  
position
Strategy 2 : if (the current page  4096) == 1, issue a Prefetch call 4096  
pages ahead (32MB) of current position
Strategy 3 : issue a prefetch at 32MB * ((the current page  4096) ? 1 :  
2) ahead of current position


Results to seq scan 9.3GB of data on the RAID5 :

Strategy 0 :46.4 s
It maxes out the disk anyway, so I didn't try the others.
However RAID1 is better for not so read-only databases...

Results to seq scan 9.3GB of data on the RAID1 :

Strategy 0 :162.8 s
Strategy 1 :152.9 s
Strategy 2 :105.2 s
Strategy 3 :152.3 s

Strategy 2 cuts the seq scan duration by 35%, ie. disk bandwidth gets a  
+54% boost.


For strategy 2, iostat says :

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda 625,00105288,00 0,00 105288  0
sdb 820,00105968,00 0,00 105968  0

Both RAID1 volumes are exploited at the same time.

I guess it would need some experimenting with the values, and a  
per-tablespace setting, but since lots of people use Linux Software RAID1  
on servers, this might be interesting...


You guys want to try it ?

Patch attached.










diff -rupN postgresql-8.4.0-orig/src/backend/access/heap/heapam.c 
postgresql-8.4.0-ra/src/backend/access/heap/heapam.c
--- postgresql-8.4.0-orig/src/backend/access/heap/heapam.c  2009-06-11 
16:48:53.0 +0200
+++ postgresql-8.4.0-ra/src/backend/access/heap/heapam.c2009-08-08 
10:41:15.0 +0200
@@ -135,6 +135,8 @@ initscan(HeapScanDesc scan, ScanKey key,
{
if (scan-rs_strategy == NULL)
scan-rs_strategy = GetAccessStrategy(BAS_BULKREAD);
+   
+   scan-rs_readahead_pages = 4096;/* TODO: GUC ? or maybe 
put it in AccessStrategy ? */
}
else
{
@@ -766,6 +768,12 @@ heapgettup_pagemode(HeapScanDesc scan,
if (page == 0)
page = scan-rs_nblocks;
page--;
+   
+   /*
+* do some extra readahead (really needed for 
compressed files)
+*/
+   if( scan-rs_readahead_pages  !finished )
+   PrefetchBuffer( scan-rs_rd, MAIN_FORKNUM, page 
- scan-rs_readahead_pages + ((page = scan-rs_readahead_pages) ? 0 : 
scan-rs_nblocks));
}
else
{
@@ -788,6 +796,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 */
if (scan-rs_syncscan)
ss_report_location(scan-rs_rd, page);
+   
+   /*
+* do some extra readahead (really needed for 
compressed files)
+*/
+
+   if( scan-rs_readahead_pages  !finished  (page  
4096))
+   PrefetchBuffer( scan-rs_rd, MAIN_FORKNUM, 
(page + scan-rs_readahead_pages) % 

Re: [HACKERS] Patch : seq scan readahead (WIP)

2009-08-08 Thread Albert Cervera i Areny
A Dissabte, 8 d'agost de 2009, Pierre Frédéric Caillaud va escriure:
 I guess it would need some experimenting with the values, and a
 per-tablespace setting, but since lots of people use Linux Software RAID1
 on servers, this might be interesting...

 You guys want to try it ?

Your tests involve only one user. What about having two (or more) users 
reading different tables? You're using both disks for one user for a 35% 
performance gain only...


 Patch attached.


-- 
Albert Cervera i Areny
http://www.NaN-tic.com
Mòbil: +34 669 40 40 18

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers