Re: [PERFORM] perf problem with huge table

2010-02-11 Thread jesper
 Hi all,

 i am trying to move my app from M$sql to PGsql, but i need a bit of help
 :)

Except from all the other good advises about partitioning the dataset and
such there is another aspect to keep in mind. When you have a large
dataset and your queries become IO-bound the tuple density is going to
hit you in 2 ways. Your dataset seems to have a natural clustering around
the time, which is also what you would use for the partitioning. That also
means that if you sort of have the clustering of data on disk you would
have the tuples you need to satisfy a query on the same page or pages
close to.

The cost of checking visibillity for a tuple is to some degree a function
of the tuple size, so if you can do anything to increase the tuple
density that will most likely benefit speed in two ways:

* You increace the likelyhood that the next tuple was in the same page and
  then dont result in a random I/O seek.
* You increace the total amount of tuples you have sitting in your system
  cache in the same amount of pages (memory) so they dont result in a
  random I/O seek.

So .. if you are carrying around columns you dont really need, then
throw them away. (that could be colums that trivially can be computed
bases on other colums), but you need to do your own testing here. To
stress the first point theres a sample run on a fairly old desktop with
one SATA drive.

testtable has the id integer and a data which is 486 bytes of text.
testtable2 has the id integer and a data integer.

both filled with 10M tuples and PG restarted and rand drop caches before
to simulate totally disk bound system.

testdb=# select count(id) from testtable where id  800 and id  850;
 count

 49
(1 row)

Time: 7909.464 ms
testdb=# select count(id) from testtable2 where id  800 and id 
850;
 count

 49
(1 row)

Time: 2149.509 ms

In this sample.. 4 times faster, the query does not touch the data column.
(on better hardware you'll most likely see better results).

If the columns are needed, you can push less frequently used columns to a
1:1 relation.. but that gives you some administrative overhead, but then
you can desice at query time if you want the extra random seeks to
access data.

You have the same picture the other way around if your queries are
accession data sitting in TOAST, you'll be paying double random IO-cost
for getting the tuple. So it is definately a tradeoff, that should be done
with care.

I've monkeypatched my own PG using this patch to toy around with criteria
to send the less frequently used data to a TOAST table.
http://article.gmane.org/gmane.comp.db.postgresql.devel.general/135158/match=

Google vertical partition for more, this is basically what it is.

(I belive this could benefit my own application, so I'm also
trying to push some interest into the area).

-- 
Jesper








-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!

2010-02-11 Thread Bruce Momjian
Kevin Grittner wrote:
 Jesper Krogh jes...@krogh.cc wrote:
  
  Sorry if it is obvious.. but what filesystem/OS are you using and
  do you have BBU-writeback on the main data catalog also?
  
 Sorry for not providing more context.
  
 ATHENA:/var/pgsql/data # uname -a
 Linux ATHENA 2.6.16.60-0.39.3-smp #1 SMP Mon May 11 11:46:34 UTC
 2009 x86_64 x86_64 x86_64 GNU/Linux
 ATHENA:/var/pgsql/data # cat /etc/SuSE-release
 SUSE Linux Enterprise Server 10 (x86_64)
 VERSION = 10
 PATCHLEVEL = 2
  
 File system is xfs noatime,nobarrier for all data; OS is on ext3.  I
 *think* the pg_xlog mirrored pair is hanging off the same
 BBU-writeback controller as the big RAID, but I'd have to track down
 the hardware tech to confirm, and he's out today.  System has 16
 Xeon CPUs and 64 GB RAM.

I would be surprised if the RAID controller had a BBU-writeback cache. 
I don't think having xlog share a BBU-writeback makes things slower, and
if it does, I would love for someone to explain why.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Dell PERC H700/H800

2010-02-11 Thread Matthew Wakeling


Just a heads up - apparently the more recent Dell RAID controllers will no 
longer recognise hard discs that weren't sold through Dell.


http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/

As one of the comments points out, that kind of makes them no longer SATA 
or SAS compatible, and they shouldn't be allowed to use those acronyms any 
more.


Matthew

--
An optimist sees the glass as half full, a pessimist as half empty,
and an engineer as having redundant storage capacity.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-11 Thread Robert Haas
On Wed, Feb 10, 2010 at 8:52 PM, Bryce Nesbitt bry...@obviously.com wrote:
 If you guys succeed in making this class of query perform, you'll have beat
 out the professional consulting firm we hired, which was all but useless!
 The query is usually slow, but particular combinations of words seem to make
 it obscenely slow.

Heh heh heh professional consulting firm.

 production=# EXPLAIN ANALYZE SELECT context_key FROM article_words
 WHERE word_key = 3675;
 ---
  Index Scan using article_words_wc on article_words  (cost=0.00..21433.53
 rows=11309 width=4) (actual time=0.025..7.579 rows=4003 loops=1)
    Index Cond: (word_key = 3675)
  Total runtime: 11.704 ms

That's surprisingly inaccurate.  Since this table is large:

 production=# explain analyze select count(*) from article_words;
 Aggregate  (cost=263831.63..263831.64 rows=1 width=0) (actual
 time=35851.654..35851.655 rows=1 loops=1)
    -  Seq Scan on words  (cost=0.00..229311.30 rows=13808130 width=0)
 (actual time=0.043..21281.124 rows=13808184 loops=1)
  Total runtime: 35851.723 ms

...you may need to crank up the statistics target.  I would probably
try cranking it all the way up to the max, though there is a risk that
might backfire, in which case you'll need to decrease it again.

ALTER TABLE article_words ALTER COLUMN word_key SET STATISTICS 1000;

That's probably not going to fix your whole problem, but it should be
interesting to see whether it makes things better or worse and by how
much.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] perf problem with huge table

2010-02-11 Thread Leo Mannhart
Dave Crooke wrote:
 On Wed, Feb 10, 2010 at 5:30 PM, Jon Lewison jlewis...@gmail.com
 mailto:jlewis...@gmail.com wrote:
 
 
 
 Just a nit, but Oracle implements MVCC.  90% of the databases out
 there do.
 
 
 Sorry, I spoke imprecisely. What I meant was the difference in how the
 rows are stored internally  in Oracle, the main tablespace contains
 only the newest version of a row, which is (where possible) updated in
 place - queries in a transaction that can still see an older version
 have to pull it from the UNDO tablespace (rollback segments in Oracle 8
 and older).
  
 In Postgres, all versions of all rows are in the main table, and have
 validity ranges associated with them (this version of this row existed
 between transaction ids x and y). Once a version goes out of scope, it
 has to be garbage collected by the vacuuming process so the space can be
 re-used.
 
 In general, this means Oracle is faster *if* you're only doing lots of
 small transactions (consider how these different models handle an update
 to a single field in a single row) but it is more sensitive to the scale
 of transactions  doing a really big transaction against a database
 with an OLTP workload can upset Oracle's digestion as it causes a lot of
 UNDO lookups, PG's performance is a lot more predictable in this regard.
 
 Both models have benefits and drawbacks ... when designing a schema for
 performance it's important to understand these differences.
 
 
 I find partitioning pretty useful in this scenario if the data
 allows is.  Aging out data just means dropping a partition rather
 than a delete statement.
 
 
 Forgot to say this - yes, absolutely agree  dropping a table is a
 lot cheaper than a transactional delete.
 
 In general, I think partitioning is more important / beneficial with
 PG's style of MVCC than with Oracle or SQL-Server (which I think is
 closer to Oracle than PG).

I would like to disagree here a little bit

Where Oracle's table partitioning is coming in very handy is for example
when you have to replace the data of a big (read-only) table on a
regularly basis (typically the replicated data from another system).
In this case, you just create a partitioned table of exact the same
columns/indexes whatsoever as the data table.

To load, you then do load the data into the partitioned table, i.e.
- truncate the partitioned table, disable constraints, drop indexes
- load the data into the partitioned table
- rebuild all indexes etc. on the partitioned table

during all this time (even if it takes hours) the application can still
access the data in the data table without interfering the bulk load.

Once you have prepared the data in the partitioned table, you
- exchange the partition with the data table
wich is a dictionary operation, that means, the application is (if ever)
only blocked during this operation which is in the sub-seconds range.

If you have to do this with convetional updates or deletes/inserts resp.
then this might not even be possible in the given timeframe.

just as an example
Leo

p.s. just to make it a little bit clearer about the famous ORA-01555:
Oracle is not forgetting the data as the Oracle RDBMS is of course
also ACID-compliant. The ORA-01555 can happen

- when the rollback tablespace is really to small to hold all the data
changed in the transaction (which I consider a configuration error)

- when a long running (read) transaction is trying to change a record
which is already updated AND COMMITTED by another transaction. The key
here is, that a second transaction has changed a record which is also
needed by the first transaction and the second transaction commited the
work. Committing the change means, the data in the rollback segment is
no longer needed, as it can be read directly from the data block (after
all it is commited and this means valid and visible to other
transactions). If the first transaction now tries to read the data from
the rollback segment to see the unchanged state, it will still succeed
(it is still there, nothing happend until now to the rollback segment).
The problem of the ORA-01555 shows up only, if now a third transaction
needs space in the rollback segment. As the entry from the first/second
transaction is marked committed (and therefore no longer needed), it is
perfectly valid for transaction #3 to grab this rollback segment and to
store its old value there. If THEN (and only then) comes transaction #1
again, asking for the old, unchanged value when the transaction started,
THEN the famous ORA-01555 is raised as this value is now overwritten by
transaction #3.
Thats why in newer versions you have to set the retention time of the
rollback blocks/segments to a value bigger than your expected longest
transaction. This will decrease the likelihood of the ORA-01555
drastically (but it is still not zero, as you could easily construct an
example where it still will fail with ORA-0155 as a transaction can
still run longer than you 

Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!

2010-02-11 Thread Scott Marlowe
On Thu, Feb 11, 2010 at 4:29 AM, Bruce Momjian br...@momjian.us wrote:
 Kevin Grittner wrote:
 Jesper Krogh jes...@krogh.cc wrote:

  Sorry if it is obvious.. but what filesystem/OS are you using and
  do you have BBU-writeback on the main data catalog also?

 Sorry for not providing more context.

 ATHENA:/var/pgsql/data # uname -a
 Linux ATHENA 2.6.16.60-0.39.3-smp #1 SMP Mon May 11 11:46:34 UTC
 2009 x86_64 x86_64 x86_64 GNU/Linux
 ATHENA:/var/pgsql/data # cat /etc/SuSE-release
 SUSE Linux Enterprise Server 10 (x86_64)
 VERSION = 10
 PATCHLEVEL = 2

 File system is xfs noatime,nobarrier for all data; OS is on ext3.  I
 *think* the pg_xlog mirrored pair is hanging off the same
 BBU-writeback controller as the big RAID, but I'd have to track down
 the hardware tech to confirm, and he's out today.  System has 16
 Xeon CPUs and 64 GB RAM.

 I would be surprised if the RAID controller had a BBU-writeback cache.
 I don't think having xlog share a BBU-writeback makes things slower, and
 if it does, I would love for someone to explain why.

I believe in the past when this discussion showed up it was mainly due
to them being on the same file system (and then not with pg_xlog
separate) that made the biggest difference.  I recall there being a
noticeable performance gain from having two file systems on the same
logical RAID device even.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Dell PERC H700/H800

2010-02-11 Thread Joshua D. Drake
On Thu, 2010-02-11 at 12:39 +, Matthew Wakeling wrote:
 Just a heads up - apparently the more recent Dell RAID controllers will no 
 longer recognise hard discs that weren't sold through Dell.
 
 http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/
 
 As one of the comments points out, that kind of makes them no longer SATA 
 or SAS compatible, and they shouldn't be allowed to use those acronyms any 
 more.

That's interesting. I know that IBM at least on some of their models
have done the same. Glad I use HP :)

Joshua D. Drake

 
 Matthew
 
 -- 
  An optimist sees the glass as half full, a pessimist as half empty,
  and an engineer as having redundant storage capacity.
 


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or 
Sir.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Dell PERC H700/H800

2010-02-11 Thread Scott Marlowe
On Thu, Feb 11, 2010 at 5:39 AM, Matthew Wakeling matt...@flymine.org wrote:

 Just a heads up - apparently the more recent Dell RAID controllers will no
 longer recognise hard discs that weren't sold through Dell.

 http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/

 As one of the comments points out, that kind of makes them no longer SATA or
 SAS compatible, and they shouldn't be allowed to use those acronyms any
 more.

Yet one more reason I'm glad I no longer source servers from Dell.

I just ask my guy at Aberdeen if he thinks drive X is a good choice,
we discuss it like adults and I make my decision.  And I generally
listen to him because he's usually right.  But I'd spit nails if my my
RAID controller refused to work with whatever drives I decided to plug
into it.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!

2010-02-11 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Jesper Krogh jes...@krogh.cc wrote:
  
 Sorry if it is obvious.. but what filesystem/OS are you using and
 do you have BBU-writeback on the main data catalog also?
 
 File system is xfs noatime,nobarrier for all data; OS is on ext3. 
 I *think* the pg_xlog mirrored pair is hanging off the same
 BBU-writeback controller as the big RAID, but I'd have to track
 down the hardware tech to confirm
 
Another example of why I shouldn't trust my memory.  Per the
hardware tech:
 
 
OS:  /dev/sda   is RAID1  -  2  x  2.5 15k SAS disk
pg_xlog: /dev/sdb   is RAID1  -  2  x  2.5 15k SAS disk
 
These reside on a ServeRAID-MR10k controller with 256MB BB cache.
 
 
data:/dev/sdc   is RAID5  -  30 x 3.5 15k SAS disk
 
These reside on the DS3200 disk subsystem with 512MB BB cache per
controller and redundant drive loops.
 
 
At least I had the file systems and options right.  ;-)
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!

2010-02-11 Thread Alvaro Herrera
Kevin Grittner wrote:

 Another example of why I shouldn't trust my memory.  Per the
 hardware tech:
  
  
 OS:  /dev/sda   is RAID1  -  2  x  2.5 15k SAS disk
 pg_xlog: /dev/sdb   is RAID1  -  2  x  2.5 15k SAS disk
  
 These reside on a ServeRAID-MR10k controller with 256MB BB cache.
  
  
 data:/dev/sdc   is RAID5  -  30 x 3.5 15k SAS disk
  
 These reside on the DS3200 disk subsystem with 512MB BB cache per
 controller and redundant drive loops.

Hmm, so maybe the performance benefit is not from it being on a separate
array, but from it being RAID1 instead of RAID5?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!

2010-02-11 Thread Aidan Van Dyk
* Alvaro Herrera alvhe...@commandprompt.com [100211 12:58]:
 Hmm, so maybe the performance benefit is not from it being on a separate
 array, but from it being RAID1 instead of RAID5?

Or the cumulative effects of:
1) Dedicated spindles/Raid1
2) More BBU cache available (I can't imagine the OS pair writing much)
3) not being queued behind data writes before getting to controller
3) Not waiting for BBU cache to be available (which is shared with all data
   writes) which requires RAID5 writes to complete...

Really, there's *lots* of variables here.  The basics being that WAL on
the same FS as data, on a RAID5, even with BBU is worse than WAL on a
dedicated set of RAID1 spindles with it's own BBU.

Wow!

;-)


-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature


Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!

2010-02-11 Thread Kevin Grittner
Aidan Van Dyk ai...@highrise.ca wrote:
 Alvaro Herrera alvhe...@commandprompt.com wrote:
 Hmm, so maybe the performance benefit is not from it being on a
 separate array, but from it being RAID1 instead of RAID5?
 
 Or the cumulative effects of:
 1) Dedicated spindles/Raid1
 2) More BBU cache available (I can't imagine the OS pair writing
much)
 3) not being queued behind data writes before getting to
controller
 3) Not waiting for BBU cache to be available (which is shared with
all data writes) which requires RAID5 writes to complete...
 
 Really, there's *lots* of variables here.  The basics being that
 WAL on the same FS as data, on a RAID5, even with BBU is worse
 than WAL on a dedicated set of RAID1 spindles with it's own BBU.
 
 Wow!
 
Sure, OK, but what surprised me was that a set of 15 read-only
queries (with pretty random reads) took almost twice as long when
the WAL files were on the same file system.  That's with OS writes
being only about 10% of reads, and *that's* with 128 GB of RAM which
keeps a lot of the reads from having to go to the disk.  I would not
have expected that a read-mostly environment like this would be that
sensitive to the WAL file placement.  (OK, I *did* request the
separate file system for them anyway, but I thought it was going to
be a marginal benefit, not something this big.)
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Dell PERC H700/H800

2010-02-11 Thread Łukasz Jagiełło
2010/2/11 James Mansion ja...@mansionfamily.plus.com:
 Matthew Wakeling wrote:

 Just a heads up - apparently the more recent Dell RAID controllers will no
 longer recognise hard discs that weren't sold through Dell.


 http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/

 As one of the comments points out, that kind of makes them no longer SATA
 or SAS compatible, and they shouldn't be allowed to use those acronyms any
 more.

 I think that's potentially FUD.  Its all about 'Dell qualified drives'.  I
 can't see anything that suggests that Dell will OEM drives and somehow tag
 them so that the drive must have come from them.  Of course they are big
 enough that they could have special BIOS I guess, but I read it that the
 drive types (and presumably revisions thereof) had to be recognised by the
 controller from a list, which presumably can be reflashed, which is not
 quite saying that if some WD enterprise drive model is 'qualified' then you
 have to buy it from Dell..

 Do you have any further detail?

For example: SAMSUNG MCCOE50G, 50GB SSD which you can buy only from
Dell. It's unknown at Samsung page. I think they can easy order own
model.

-- 
Łukasz Jagiełło
System Administrator
G-Forces Web Management Polska sp. z o.o. (www.gforces.pl)

Ul. Kruczkowskiego 12, 80-288 Gdańsk
Spółka wpisana do KRS pod nr 246596 decyzją Sądu Rejonowego Gdańsk-Północ

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Dell PERC H700/H800

2010-02-11 Thread Scott Marlowe
On Thu, Feb 11, 2010 at 1:11 PM, James Mansion
ja...@mansionfamily.plus.com wrote:
 Matthew Wakeling wrote:

 Just a heads up - apparently the more recent Dell RAID controllers will no
 longer recognise hard discs that weren't sold through Dell.


 http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/

 As one of the comments points out, that kind of makes them no longer SATA
 or SAS compatible, and they shouldn't be allowed to use those acronyms any
 more.

 Matthew

 I think that's potentially FUD.  Its all about 'Dell qualified drives'.  I
 can't see anything that suggests that Dell will OEM drives and somehow tag
 them so that the drive must have come from them.  Of course they are big
 enough that they could have special BIOS I guess, but I read it that the
 drive types (and presumably revisions thereof) had to be recognised by the
 controller from a list, which presumably can be reflashed, which is not
 quite saying that if some WD enterprise drive model is 'qualified' then you
 have to buy it from Dell..

 Do you have any further detail?

In the post to the dell mailing list (
http://lists.us.dell.com/pipermail/linux-poweredge/2010-February/041335.html
) It was pointed out that the user had installed Seagate ES.2 drives,
which are enterprise class drives that have been around a while and
are kind of the standard SATA enterprise clas drives and are listed so
by Seagate:

http://www.seagate.com/www/en-us/products/servers/barracuda_es/barracuda_es.2

These drives were marked as BLOCKED and unusable by the system.

The pdf linked to in the dell forum specifically states that the hard
drives are loaded with a dell specific firmware.  The PDF seems
otherwise free of useful information, and is mostly a marketing tool
as near as I can tell.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Dell PERC H700/H800

2010-02-11 Thread James Mansion

Matthew Wakeling wrote:


Just a heads up - apparently the more recent Dell RAID controllers 
will no longer recognise hard discs that weren't sold through Dell.


http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ 



As one of the comments points out, that kind of makes them no longer 
SATA or SAS compatible, and they shouldn't be allowed to use those 
acronyms any more.


Matthew

I think that's potentially FUD.  Its all about 'Dell qualified drives'.  
I can't see anything that suggests that Dell will OEM drives and somehow 
tag them so that the drive must have come from them.  Of course they are 
big enough that they could have special BIOS I guess, but I read it that 
the drive types (and presumably revisions thereof) had to be recognised 
by the controller from a list, which presumably can be reflashed, which 
is not quite saying that if some WD enterprise drive model is 
'qualified' then you have to buy it from Dell..


Do you have any further detail?


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance