Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Greg Smith

On Tue, 12 Aug 2008, Ron Mayer wrote:


Really old software (notably 2.4 linux kernels) didn't send
cache synchronizing commands for SCSI nor either ATA; but
it seems well thought through in the 2.6 kernels as described
in the Linux kernel documentation.
http://www.mjmwired.net/kernel/Documentation/block/barrier.txt


If you've drank the kool-aid you might believe that.  When I see people 
asking about this in early 2008 at 
http://thread.gmane.org/gmane.linux.kernel/646040 and serious disk driver 
hacker Jeff Garzik says It's completely ridiculous that we default to an 
unsafe fsync. [ http://thread.gmane.org/gmane.linux.kernel/646040 ], I 
don't know about you but that barrier documentation doesn't make me feel 
warm and safe anymore.



If you do have a disk where you need to disable write caches,
I'd love to know the name of the disk and see the output of
of hdparm -I /dev/sd*** to see if it claims to support such
cache flushes.


The below disk writes impossibly fast when I issue a sequence of fsync 
writes to it under the CentOS 5 Linux I was running on it.  Should only be 
possible to do at most 120/second since it's 7200 RPM, and if I poke it 
with hdparm -W0 first it behaves.  The drive is a known piece of junk 
from circa 2004, and it's worth noting that it's an ext3 filesystem in a 
md0 RAID-1 array (aren't there issues with md and the barriers?)


# hdparm -I /dev/hde

/dev/hde:

ATA device, with non-removable media
Model Number:   Maxtor 6Y250P0
Serial Number:  Y62K95PE
Firmware Revision:  YAR41BW0
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 0
Supported: 7 6 5 4
Configuration:
Logical max current
cylinders   16383   65535
heads   16  1
sectors/track   63  63
--
CHS current addressable sectors:4128705
LBAuser addressable sectors:  268435455
LBA48  user addressable sectors:  490234752
device size with M = 1024*1024:  239372 MBytes
device size with M = 1000*1000:  251000 MBytes (251 GB)
Capabilities:
LBA, IORDY(can be disabled)
Standby timer values: spec'd by Standard, no device specific 
minimum

R/W multiple sector transfer: Max = 16  Current = 16
Advanced power management level: unknown setting (0x)
Recommended acoustic management value: 192, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_VERIFY command
   *WRITE_BUFFER command
   *READ_BUFFER command
   *NOP cmd
   *DOWNLOAD_MICROCODE
Advanced Power Management feature set
SET_MAX security extension
   *Automatic Acoustic Management feature set
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] long transaction

2008-08-13 Thread Sabin Coanda

 have you considered importing to a temporary 'holding' table with
 copy, then doing 'big' sql statements on it to check constraints, etc?


Yes I considered it, but the problem is the data is very tight related 
between different tables and is important to keep the import order of each 
entity into the database. With other words, the entity imprt serialization 
is mandatory. In fact the import script doesn't keep just insert but also 
delete and update for different entities. So copy is not enough. Also using 
'big' sql statements cannot guarantee the import order.

Sabin 



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Scott Marlowe
On Tue, Aug 12, 2008 at 10:28 PM, Ron Mayer
[EMAIL PROTECTED] wrote:
 Scott Marlowe wrote:

 I can attest to the 2.4 kernel not being able to guarantee fsync on
 IDE drives.

 Sure.  But note that it won't for SCSI either; since AFAICT the write
 barrier support was implemented at the same time for both.

Tested both by pulling the power plug.  The SCSI was pulled 10 times
while running 600 or so concurrent pgbench threads, and so was the
IDE.  The SCSI came up clean every single time, the IDE came up
corrupted every single time.

I find it hard to believe there was no difference in write barrier
behaviour with those two setups.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Matthew Wakeling

On Tue, 12 Aug 2008, Ron Mayer wrote:

Really old software (notably 2.4 linux kernels) didn't send
cache synchronizing commands for SCSI nor either ATA;


Surely not true. Write cache flushing has been a known problem in the 
computer science world for several tens of years. The difference is that 
in the past we only had a flush everything command whereas now we have a 
flush everything before the barrier before everything after the barrier 
command.


Matthew

--
To err is human; to really louse things up requires root
privileges. -- Alexander Pope, slightly paraphrased

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Ron Mayer

Scott Marlowe wrote:

On Tue, Aug 12, 2008 at 10:28 PM, Ron Mayer ...wrote:

Scott Marlowe wrote:

I can attest to the 2.4 kernel ...

...SCSI...AFAICT the write barrier support...


Tested both by pulling the power plug.  The SCSI was pulled 10 times
while running 600 or so concurrent pgbench threads, and so was the
IDE.  The SCSI came up clean every single time, the IDE came up
corrupted every single time.


Interesting.  With a pre-write-barrier 2.4 kernel I'd
expect corruption in both.
Perhaps all caches were disabled in the SCSI drives?


I find it hard to believe there was no difference in write barrier
behaviour with those two setups.


Skimming lkml it seems write barriers for SCSI were
behind (in terms of implementation) those for ATA
http://lkml.org/lkml/2005/1/27/94
Jan 2005 ... scsi/sata write barrier support ...
 For the longest time, only the old PATA drivers
 supported barrier writes with journalled file systems.
 This patch adds support for the same type of cache
 flushing barriers that PATA uses for SCSI

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Ron Mayer

Greg Smith wrote:
The below disk writes impossibly fast when I issue a sequence of fsync 


'k.  I've got some homework. I'll be trying to reproduce similar
with md raid, old IDE drives, etc to see if I can reproduce them.
I assume test_fsync in the postgres source distribution is
a decent way to see?

driver hacker Jeff Garzik says It's completely ridiculous that we 
default to an unsafe fsync. 


Yipes indeed.  Still makes me want to understand why people
claim IDE suffers more than SCSI, tho.  Ext3 bugs seem likely
to affect both to me.


writes to it under the CentOS 5 Linux I was running on it. ...
junk from circa 2004, and it's worth noting that it's an ext3 filesystem 
in a md0 RAID-1 array (aren't there issues with md and the barriers?)


Apparently various distros vary a lot in how they're set
up (SuSE apparently defaults to mounting ext3 with the barrier=1
option; other distros seemed not to, etc).

I'll do a number of experiments with md, a few different drives,
etc. today and see if I can find issues with any of the
drives (and/or filesystems) around here.

But I still am looking for any evidence that there were any
widely shipped SATA (or even IDE drives) that were at fault,
as opposed to filesystem bugs and poor settings of defaults.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Scott Marlowe
On Wed, Aug 13, 2008 at 8:41 AM, Ron Mayer
[EMAIL PROTECTED] wrote:
 Greg Smith wrote:

 But I still am looking for any evidence that there were any
 widely shipped SATA (or even IDE drives) that were at fault,
 as opposed to filesystem bugs and poor settings of defaults.

Well, if they're getting more than 150/166.6/250 transactions per
second without a battery backed cache, then they're likely lying about
fsync.  And most SATA and IDE drives will give you way over that for a
small data set.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query plan, index scan cost

2008-08-13 Thread Decibel!

On Jul 18, 2008, at 5:28 AM, Stefan Zweig wrote:

CREATE TABLE nw_tla_2008_4_deu
(
ID bigint NOT NULL,
NET2CLASS smallint,
FOW smallint,
CONSTRAINT nw_tla_2008_4_deu_pkey PRIMARY KEY (ID),
)
WITHOUT OIDS;


You might want to give up on the double-quotes... you'll have to use  
them everywhere. It'd drive me nuts... :)



EXPLAIN
ANALYZE

SELECT
nw.ID AS id

FROM
nw_tla_2008_4_deu AS nw

WHERE
expand(st_pointfromtext('POINT(13.7328934 51.049476)',4326),  
0.2476961598054)  nw.the_geom

AND nw.FOW IN (1,2,3,4,10,17)
AND nw.NET2CLASS IN (0,1,2,3)

snip

Total runtime: *13.332* ms


running the next query which is only slightly different and has one  
instead of two and conditions leads to the following result


EXPLAIN
ANALYZE

SELECT
nw.ID AS id

FROM
nw_tla_2008_4_deu AS nw

WHERE
expand(st_pointfromtext('POINT(13.7328934 51.049476)',4326),  
0.2476961598054)  nw.the_geom

AND nw.FOW IN (1,2,3,4,10,17)

snip

Total runtime: *109*ms


so in both querys there are and conditions. there are two and  
conditions in the first query and one and condition in the second  
query. unfortunately i am not an expert in reading the postgre  
query plan. basically i am wondering why in the first query a  
second index scan is done whereas in the second query the second  
index scan is not done. the second query runs hundred times faster  
then first one which surprising to me.


The second index scan wasn't done in the second query because you  
don't have the second IN clause. And it's actually the 1st query that  
was faster, because it returned fewer rows (15k instead of 45k).

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Decibel!

On Aug 11, 2008, at 9:01 AM, Jeff wrote:

On Aug 11, 2008, at 5:17 AM, Henrik wrote:

OK, changed the SAS RAID 10 to RAID 5 and now my random writes are  
handing 112 MB/ sek. So it is almsot twice as fast as the RAID10  
with the same disks. Any ideas why?


Is the iozone tests faulty?



does IOzone disable the os caches?
If not you need to use a size of 2xRAM for true results.

regardless - the test only took 10 seconds of wall time - which  
isn't very long at all. You'd probably want to run it longer anyway.



Additionally, you need to be careful of what size writes you're  
using. If you're doing random writes that perfectly align with the  
raid stripe size, you'll see virtually no RAID5 overhead, and you'll  
get the performance of N-1 drives, as opposed to RAID10 giving you N/2.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-13 Thread Decibel!

On Aug 12, 2008, at 4:59 PM, Chris Kratz wrote:
Ran into a re-occuring performance problem with some report queries  
again today.  In a nutshell, we have filters on either multiple  
joined tables, or multiple columns on a single table that are  
highly correlated.  So, the estimates come out grossly incorrect  
(the planner has no way to know they are correlated).  2000:1 for  
one I'm looking at right now.  Generally this doesn't matter,  
except in complex reporting queries like these when this is the  
first join of 40 other joins.  Because the estimate is wrong at the  
lowest level, it snowballs up through the rest of the joins causing  
the query to run very, very slowly.   In many of these cases,  
forcing nested loops off for the duration of the query fixes the  
problem.  But I have a couple that still are painfully slow and  
shouldn't be.


I've been reading through the archives with others having similar  
problems (including myself a year ago).  Am I right in assuming  
that at this point there is still little we can do in postgres to  
speed up this kind of query?  Right now the planner has no way to  
know the correlation between different columns in the same table,  
let alone columns in different tables.  So, it just assumes no  
correlation and returns incorrectly low estimates in cases like these.


The only solution I've come up with so far is to materialize  
portions of the larger query into subqueries with these correlated  
filters which are indexed and analyzed before joining into the  
larger query.  This would keep the incorrect estimates from  
snowballing up through the chain of joins.


Are there any other solutions to this problem?



Well... you could try and convince certain members of the community  
that we actually do need some kind of a query hint mechanism... ;)


I did make a suggestion a few months ago that involved sorting a  
table on different columns and recording the correlation of other  
columns. The scheme isn't perfect, but it would help detect cases  
like a field populated by a sequence and another field that's insert  
timestamp; those two fields would correlate highly, and you should  
even be able to correlate the two histograms; that would allow you to  
infer that most of the insert times for _id's between 100 and 200  
will be between 2008-01-01 00:10 and 2008-01-01 00:20, for example.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Greg Smith

On Wed, 13 Aug 2008, Ron Mayer wrote:


I assume test_fsync in the postgres source distribution is
a decent way to see?


Not really.  It takes too long (runs too many tests you don't care about) 
and doesn't spit out the results the way you want them--TPS, not average 
time.


You can do it with pgbench (scale here really doesn't matter):

$ cat insert.sql
\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 10 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
BEGIN;
INSERT INTO history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, 
:aid, :delta, CURRENT_TIMESTAMP);

END;
$ createdb pgbench
$ pgbench -i -s 20 pgbench
$ pgbench -f insert.sql -s 20 -c 1 -t 1 pgbench

Don't really need to ever rebuild that just to run more tests if all you 
care about is the fsync speed (no indexes in the history table to bloat or 
anything).


Or you can measure with sysbench; 
http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/ 
goes over that but they don't have the syntax exacty right.  Here's an 
example that works:


:~/sysbench-0.4.8/bin/bin$ ./sysbench run --test=fileio 
--file-fsync-freq=1 --file-num=1 --file-total-size=16384 
--file-test-mode=rndwr


But I still am looking for any evidence that there were any widely 
shipped SATA (or even IDE drives) that were at fault, as opposed to 
filesystem bugs and poor settings of defaults.


Alan Cox claims that until circa 2001, the ATA standard didn't require 
implementing the cache flush call at all.  See 
http://www.kerneltraffic.org/kernel-traffic/kt20011015_137.html Since 
firmware is expensive to write and manufacturers are generally lazy here, 
I'd bet a lot of disks from that era were missing support for the call. 
Next time I'd digging through my disk graveyard I'll try and find such a 
disk.  If he's correct that the standard changed around you wouldn't 
expect any recent drive to not support the call.


I feel it's largely irrelevant that most drives handle things just fine 
nowadays if you send them the correct flush commands, because there are so 
manh other things that can make that system as a whole not work right. 
Even if the flush call works most of the time, disk firmware is turning 
increasibly into buggy software, and attempts to reduce how much of that 
firmware you're actually using can be viewed as helpful.


This is why I usually suggest just turning the individual drive caches 
off; the caveats for when they might work fine in this context are just 
too numerous.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-13 Thread Chris Kratz
On Wed, Aug 13, 2008 at 10:59 AM, Decibel! [EMAIL PROTECTED] wrote:

 On Aug 12, 2008, at 4:59 PM, Chris Kratz wrote:

 Ran into a re-occuring performance problem with some report queries again
 today.  In a nutshell, we have filters on either multiple joined tables, or
 multiple columns on a single table that are highly correlated.  So, the
 estimates come out grossly incorrect (the planner has no way to know they
 are correlated).  2000:1 for one I'm looking at right now.  Generally this
 doesn't matter, except in complex reporting queries like these when this is
 the first join of 40 other joins.  Because the estimate is wrong at the
 lowest level, it snowballs up through the rest of the joins causing the
 query to run very, very slowly.   In many of these cases, forcing nested
 loops off for the duration of the query fixes the problem.  But I have a
 couple that still are painfully slow and shouldn't be.

 I've been reading through the archives with others having similar problems
 (including myself a year ago).  Am I right in assuming that at this point
 there is still little we can do in postgres to speed up this kind of query?
  Right now the planner has no way to know the correlation between different
 columns in the same table, let alone columns in different tables.  So, it
 just assumes no correlation and returns incorrectly low estimates in cases
 like these.

 The only solution I've come up with so far is to materialize portions of
 the larger query into subqueries with these correlated filters which are
 indexed and analyzed before joining into the larger query.  This would keep
 the incorrect estimates from snowballing up through the chain of joins.

 Are there any other solutions to this problem?



 Well... you could try and convince certain members of the community that we
 actually do need some kind of a query hint mechanism... ;)

 I did make a suggestion a few months ago that involved sorting a table on
 different columns and recording the correlation of other columns. The scheme
 isn't perfect, but it would help detect cases like a field populated by a
 sequence and another field that's insert timestamp; those two fields would
 correlate highly, and you should even be able to correlate the two
 histograms; that would allow you to infer that most of the insert times for
 _id's between 100 and 200 will be between 2008-01-01 00:10 and 2008-01-01
 00:20, for example.
 --
 Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
 Give your computer some brain candy! www.distributed.net Team #1828


 Thanks for the reply,

Yes, I know hints are frowned upon around here.  Though, I'd love to have
them or something equivalent on this particular query just so the customer
can run their important reports.  As it is, it's unrunnable.

Unfortunately, if I don't think the sorting idea would help in the one case
I'm looking at which involves filters on two tables that are joined
together.  The filters happen to be correlated such that about 95% of the
rows from each filtered table are actually returned after the join.
Unfortunately, the planner thinks we will get 1 row back.

I do have to find a way to make these queries runnable.  I'll keep looking.

Thanks,

-Chris


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Henrik


13 aug 2008 kl. 17.13 skrev Decibel!:


On Aug 11, 2008, at 9:01 AM, Jeff wrote:

On Aug 11, 2008, at 5:17 AM, Henrik wrote:

OK, changed the SAS RAID 10 to RAID 5 and now my random writes are  
handing 112 MB/ sek. So it is almsot twice as fast as the RAID10  
with the same disks. Any ideas why?


Is the iozone tests faulty?



does IOzone disable the os caches?
If not you need to use a size of 2xRAM for true results.

regardless - the test only took 10 seconds of wall time - which  
isn't very long at all. You'd probably want to run it longer anyway.



Additionally, you need to be careful of what size writes you're  
using. If you're doing random writes that perfectly align with the  
raid stripe size, you'll see virtually no RAID5 overhead, and you'll  
get the performance of N-1 drives, as opposed to RAID10 giving you N/ 
2.

But it still needs to do 2 reads and 2 writes for every write, correct?

I did some bonnie++ tests just to give some new more reasonable numbers.
This is with RAID10 on 4 SAS 15k drives with write-back cache.

Version 1.03b   --Sequential Output-- --Sequential Input-  
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--  
--Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec  
%CP  /sec %CP
safecube04   32136M 73245  95 213092  16 89456  11 64923  81 219341   
16 839.9   1
--Sequential Create-- Random  
Create
-Create-- --Read--- -Delete-- -Create-- --Read---  
-Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec  
%CP  /sec %CP
 16  6178  99 + +++ + +++  6452 100 + +++  
20633  99
safecube04,32136M, 
73245,95,213092,16,89456,11,64923,81,219341,16,839.9,1,16,6178,99, 
+,+++,+,+++,6452,100,+,+++,20633,99









--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-13 Thread Alvaro Herrera
Chris Kratz wrote:

 Unfortunately, if I don't think the sorting idea would help in the one case
 I'm looking at which involves filters on two tables that are joined
 together.  The filters happen to be correlated such that about 95% of the
 rows from each filtered table are actually returned after the join.
 Unfortunately, the planner thinks we will get 1 row back.

Maybe you can wrap that part of the query in a SQL function and set its
estimated cost to the real values with ALTER FUNCTION ... ROWS.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] autovacuum: use case for indenpedent TOAST table autovac settings

2008-08-13 Thread Alvaro Herrera
Hackers and PG users,

Does anyone see a need for having TOAST tables be individually
configurable for autovacuum?  I've finally come around to looking at
being able to use ALTER TABLE for autovacuum settings, and I'm wondering
if we need to support that case.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Filesystem benchmarking for pg 8.3.3 server

2008-08-13 Thread Ron Mayer

Scott Marlowe wrote:

IDE came up corrupted every single time.

Greg Smith wrote:
you've drank the kool-aid ... completely 
ridiculous ...unsafe fsync ... md0 RAID-1 
array (aren't there issues with md and the barriers?)


Alright - I'll eat my words.  Or mostly.

I still haven't found IDE drives that lie; but
if the testing I've done today, I'm starting to
think that:

  1a) ext3 fsync() seems to lie badly.
  1b) but ext3 can be tricked not to lie (but not
  in the way you might think).
  2a) md raid1 fsync() sometimes doesn't actually
  sync
  2b) I can't trick it not to.
  3a) some IDE drives don't even pretend to support
  letting you know when their cache is flushed
  3b) but the kernel will happily tell you about
  any such devices; as well as including md
  raid ones.

In more detail.  I tested on a number of systems
and disks including new (this year) and old (1997)
IDE drives; and EXT3 with and without the barrier=1
mount option.


First off - some IDE drives don't even support the
relatively recent ATA command that apparently lets
the software know when a cache flush is complete.
Apparently on those you will get messages in your
system logs:
  %dmesg | grep 'disabling barriers'
  JBD: barrier-based sync failed on md1 - disabling barriers
  JBD: barrier-based sync failed on hda3 - disabling barriers
and
  %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT
will not show you anything on those devices.
IMHO that's cool; and doesn't count as a lying IDE drive
since it didn't claim to support this.

Second of all - ext3 fsync() appears to me to
be *extremely* stupid.   It only seems to correctly
do the correct flushing (and waiting) for a drive's
cache to be flushed when a file's inode has changed.
For example, in the test program below, it will happily
do a real fsync (i.e. the program take a couple seconds
to run) so long as I have the fchmod() statements are in
there.   It will *NOT* wait on my system if I comment those
fchmod()'s out. Sadly, I get the same behavior with and
without the ext3 barrier=1 mount option. :(
==
/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
*/
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include unistd.h
#include stdio.h
#include stdlib.h

int main(int argc,char *argv[]) {
  if (argc2) {
printf(usage: fs filename\n);
exit(1);
  }
  int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
  int i;
  for (i=0;i100;i++) {
char byte;
pwrite (fd, byte, 1, 0);
fchmod (fd, 0644); fchmod (fd, 0664);
fsync (fd);
  }
}
==
Since it does indeed wait when the inode's touched, I think
it suggests that it's not the hard drive that's lying, but
rather ext3.

So I take back what I said about linux and write barriers
being sane.   They're not.

But AFACT, all the (6 different) IDE drives I've seen work
as advertised, and the kernel happily seems to spews boot
messages when it finds one that doesn't support knowing
when a cache flush finished.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] autovacuum: use case for indenpedent TOAST table autovac settings

2008-08-13 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Does anyone see a need for having TOAST tables be individually
 configurable for autovacuum?  I've finally come around to looking at
 being able to use ALTER TABLE for autovacuum settings, and I'm wondering
 if we need to support that case.

It seems like we'll want to do it somehow.  Perhaps the cleanest way is
to incorporate toast-table settings in the reloptions of the parent
table.  Otherwise dump/reload is gonna be a mess.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [HACKERS] [PERFORM] autovacuum: use case for indenpedent TOAST table autovac settings

2008-08-13 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Does anyone see a need for having TOAST tables be individually
  configurable for autovacuum?  I've finally come around to looking at
  being able to use ALTER TABLE for autovacuum settings, and I'm wondering
  if we need to support that case.
 
 It seems like we'll want to do it somehow.  Perhaps the cleanest way is
 to incorporate toast-table settings in the reloptions of the parent
 table.  Otherwise dump/reload is gonna be a mess.

Yeah, Magnus was suggesting this syntax:

ALTER TABLE foo SET toast_autovacuum_enable = false;
and the like.

My question is whether there is interest in actually having support for
this, or should we just inherit the settings from the main table.  My
gut feeling is that this may be needed in some cases, but perhaps I'm
overengineering the thing.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [HACKERS] [PERFORM] autovacuum: use case for indenpedent TOAST table autovac settings

2008-08-13 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 It seems like we'll want to do it somehow.  Perhaps the cleanest way is
 to incorporate toast-table settings in the reloptions of the parent
 table.  Otherwise dump/reload is gonna be a mess.

 My question is whether there is interest in actually having support for
 this, or should we just inherit the settings from the main table.  My
 gut feeling is that this may be needed in some cases, but perhaps I'm
 overengineering the thing.

It seems reasonable to inherit the parent's settings by default, in any
case.  So you could do that now and then extend the feature later if
there's real demand.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Incorrect estimates on correlated filters

2008-08-13 Thread Craig Ringer
Decibel! wrote:

 Well... you could try and convince certain members of the community that
 we actually do need some kind of a query hint mechanism... ;)

It strikes me that there are really two types of query hint possible here.

One tells the planner (eg) prefer a merge join here.

The other gives the planner more information that it might not otherwise
have to work with, so it can improve its decisions. The values used in
this join condition are highly correlated.

Is there anything wrong with the second approach? It shouldn't tend to
suppress planner bug reports etc. Well, not unless people use it to lie
to the planner, and I expect results from that would be iffy at best. It
just provides information to supplement Pg's existing stats system to
handle cases where it's not able to reasonably collect the required
information.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance