Re: [PERFORM] Practical upper limits of pgbench read/write tps with 8.3

2008-07-07 Thread Jeffrey Baker
On Mon, Jul 7, 2008 at 3:22 PM, Greg Smith <[EMAIL PROTECTED]> wrote:
> On Mon, 7 Jul 2008, Jeffrey Baker wrote:
>
>> On the single 2.2GHz Athlon, the maximum tps seems to be 1450...what's the
>> bottleneck?  Is PG lock-bound?
>
> It can become lock-bound if you don't make the database scale significantly
> larger than the number of clients, but that's probably not your problem.
>  The pgbench client driver program itself is pretty CPU intensive and can
> suffer badly from kernel issues.  I am unsurprised you can only hit 1450
> with a single CPU.  On systems with multiple CPUs where the single CPU
> running the pgbench client is much faster than your 2.2GHz Athlon, you'd
> probably be able to get a few thousand TPS, but eventually the context
> switching of the client itself can become a bottleneck.

On a 2GHz Core 2 Duo the best tps achieved is 2300, with -c 8.
pgbench itself gets around 10% of the CPU (user + sys for pgbench is
7s of 35s wall clock time, or 70 CPU-seconds, thus 10%).

I suppose you could still blame it on ctxsw between pgbench and pg
itself, but the results are not better with pgbench on another machine
cross-connected with gigabit ethernet.

-jwb

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Practical upper limits of pgbench read/write tps with 8.3

2008-07-07 Thread Greg Smith

On Mon, 7 Jul 2008, Jeffrey Baker wrote:

On the single 2.2GHz Athlon, the maximum tps seems to be 1450...what's 
the bottleneck?  Is PG lock-bound?


It can become lock-bound if you don't make the database scale 
significantly larger than the number of clients, but that's probably not 
your problem.  The pgbench client driver program itself is pretty CPU 
intensive and can suffer badly from kernel issues.  I am unsurprised you 
can only hit 1450 with a single CPU.  On systems with multiple CPUs where 
the single CPU running the pgbench client is much faster than your 2.2GHz 
Athlon, you'd probably be able to get a few thousand TPS, but eventually 
the context switching of the client itself can become a bottleneck.


Running pgbench against a RAM disk is a good way to find out where the 
system bottlenecks at without disk I/O involvement, you might try that 
test on your larger server when you get a chance.  One interesting thing 
to watch you may not have tried yet is running top and seeing how close to 
a single CPU pgbench itself is running at.  If you've got 4 CPUs, and the 
pgbench client program shows 25% utilization, it is now the bottleneck 
rather than whatever you thought you were measuring.  I thought this might 
be the case in the last test results you reported on Friday but didn't 
have a chance to comment on it until now.


One thing you can try here is running pgbench itself on another server 
than the one hosting the database, but that seems to top out at a few 
thousand TPS as well; may get higher than you've been seeing though.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Practical upper limits of pgbench read/write tps with 8.3

2008-07-07 Thread Jeffrey Baker
I'm spending a third day testing with the ioDrive, and it occurred to
me that I should normalize my tests by mounting the database on a
ramdisk.  The results were surprisingly low.  On the single 2.2GHz
Athlon, the maximum tps seems to be 1450.  This is achieved with a
single connection.  I/O rates to and from the ramdisk never exceed
50MB/s on a one-minute average.

With the flash device on the same benchmark, the tps rate is 1350,
meaning that as far as PostgreSQL is concerned, on this machine, the
flash device achieves 90% of the best possible performance.

Question being: what's the bottleneck?  Is PG lock-bound?

-jwb

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fusion-io ioDrive

2008-07-07 Thread PFC


PFC, I have to say these kind of posts make me a fan of yours.  I've  
read many of your storage-related replied and have found them all very  
educational.  I just want to let you know I found your assessment of the  
impact of Flash storage perfectly-worded and unbelievably insightful.  
Thanks a million for sharing your knowledge with the list. -Dan


Hehe, thanks.

	There was a time when you had to be a big company full of cash to build a  
computer, and then sudenly people did it in garages, like Wozniak and  
Jobs, out of off-the-shelf parts.


	I feel the ioDrive guys are the same kind of hackers, except today's  
hackers have much more powerful tools. Perhaps, and I hope it's true,  
storage is about to undergo a revolution like the personal computer had  
20-30 years ago, when the IBMs of the time were eaten from the roots up.


	IMHO the key is that you can build a ioDrive from off the shelf parts,  
but you can't do that with a disk drive.
	Flash manufacturers are smelling blood, they profit from USB keys and  
digicams but imagine the market for solid state drives !
	And in this case the hardware is simple : flash, ram, a fpga, some chips,  
nothing out of the ordinary, it is the brain juice in the software (which  
includes FPGAs) which will sort out the high performance and reliability  
winners from the rest.


	Lowering the barrier of entry is good for innovation. I believe Linux  
will benefit, too, since the target is (for now) high-performance servers,  
and as shown by the ioDrive, innovating hackers prefer to write Linux  
drivers rather than Vista (argh) drivers.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fusion-io ioDrive

2008-07-07 Thread Jeffrey Baker
On Mon, Jul 7, 2008 at 6:08 AM, Merlin Moncure <[EMAIL PROTECTED]> wrote:
> On Sat, Jul 5, 2008 at 2:41 AM, Jeffrey Baker <[EMAIL PROTECTED]> wrote:
>>>Service Time Percentile, millis
>>>R/W TPS   R-O TPS  50th   80th   90th   95th
>>> RAID  182   673 18 32 42 64
>>> Fusion971  4792  8  9 10 11
>>
>> Someone asked for bonnie++ output:
>>
>> Block output: 495MB/s, 81% CPU
>> Block input: 676MB/s, 93% CPU
>> Block rewrite: 262MB/s, 59% CPU
>>
>> Pretty respectable.  In the same ballpark as an HP MSA70 + P800 with
>> 25 spindles.
>
> You left off the 'seeks' portion of the bonnie++ results -- this is
> actually the most important portion of the test.  Based on your tps
> #s, I'm expecting seeks equiv of about 10 10k drives in configured in
> a raid 10, or around 1000-1500.  They didn't publish any prices so
> it's hard to say if this is 'cost competitive'.

I left it out because bonnie++ reports it as "+" i.e. greater than
or equal to 10 per second.

-jwb

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fusion-io ioDrive

2008-07-07 Thread Jonah H. Harris
On Mon, Jul 7, 2008 at 9:23 AM, Merlin Moncure <[EMAIL PROTECTED]> wrote:
> I have a lot of problems with your statements.  First of all, we are
> not really talking about 'RAM' storage...I think your comments would
> be more on point if we were talking about mounting database storage
> directly from the server memory for example.  Sever memory and cpu are
> involved to the extent that the o/s using them for caching and
> filesystem things and inside the device driver.

I'm not sure how those cards work, but my guess is that the CPU will
go 100% busy (with a near-zero I/O wait) on any sizable workload.  In
this case, the current pgbench configuration being used is quite small
and probably won't resemble this.

> Also, your comments seem to indicate that having a slower device leads
> to higher concurrency because it allows the process to yield and do
> other things.  This is IMO simply false.

Argue all you want, but this is a fairly well known (20+ year-old) behavior.

> With faster storage cpu loads will increase but only because the overall
> system throughput increases and cpu/memory 'work' increases in terms
> of overall system activity.

Again, I said that response times (throughput) would improve.  I'd
like to see your argument for explaining how you can handle more
CPU-only operations when 0% of the CPU is free for use.

> Presumably as storage approaches speedsof main system memory
> the algorithms of dealing with it will become simpler (not having to
> go through acrobatics to try and making everything sequential)
> and thus faster.

We'll have to see.

> I also find the remarks of software 'optimizing' for strict hardware
> assumptions (L1+L2) cache a little suspicious.  In some old programs I
> remember keeping a giant C 'union' of critical structures that was
> exactly 8k to fit in the 486 cpu cache.  In modern terms I think that
> type of programming (sans some specialized environments) is usually
> counter-productive...I think PostgreSQL's approach of deferring as
> much work as possible to the o/s is a great approach.

All of the major database vendors still see an immense value in
optimizing their algorithms and memory structures for specific
platforms and CPU caches.  Hence, if they're *paying* money for
very-specialized industry professionals to optimize in this way, I
would hesitate to say there isn't any value in it.   As a fact,
Postgres doesn't have those low-level resources, so for the most part,
I have to agree that they have to rely on the OS.

-Jonah

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fusion-io ioDrive

2008-07-07 Thread PFC




*) is the flash random write problem going to be solved in hardware or
specialized solid state write caching techniques.   At least
currently, it seems like software is filling the role.


	Those flash chips are page-based, not unlike a harddisk, ie. you cannot  
erase and write a byte, you must erase and write a full page. Size of said  
page depends on the chip implementation. I don't know which chips they  
used so cannot comment there, but you can easily imagine that smaller  
pages yield faster random IO write throughput. For reads, you must first  
select a page and then access it. Thus, it is not like RAM at all. It is  
much more similar to a harddisk with an almost zero seek time (on reads)  
and a very small, but significant seek time (on writes) because a  page  
must be erased before being written.


	Big flash chips include ECC inside to improve reliability. Basically the  
chips include a small static RAM buffer. When you want to read a page it  
is first copied to SRAM and ECC checked. When you want to write a page you  
first write it to SRAM and then order the chip to write it to flash.


	Usually you can't erase a page, you must erase a block which contains  
many pages (this is probably why most flash SSDs suck at random writes).


	NAND flash will never replace SDRAM because of these restrictions (NOR  
flash acts like RAM but it is slow and has less capacity).

However NAND flash is well suited to replace harddisks.

	When writing a page you write it to the small static RAM buffer on the  
chip (fast) and tell the chip to write it to flash (slow). When the chip  
is busy erasing or writing you can not do anything with it, but you can  
still talk to the other chips. Since the ioDrive has many chips I'd bet  
they use this feature.


	I don't know about the ioDrive implementation but you can see that the  
paging and erasing requirements mean some tricks have to be applied and  
the thing will probably need some smart buffering in RAM in order to be  
fast. Since the data in a flash doesn't need to be sequential (read seek  
time being close to zero) it is possible they use a system which makes all  
writes sequential (for instance) which would suit the block erasing  
requirements very well, with the information about block mapping stored in  
RAM, or perhaps they use some form of copy-on-write. It would be  
interesting to dissect this algorithm, especially the part which allows to  
store permanently the block mappings, which cannot be stored in a constant  
known sector since it would wear out pretty quickly.


	Ergo, in order to benchmark this thing and get relevant results, I would  
tend to think that you'd need to fill it to say, 80% of capacity and  
bombard it with small random writes, the total amount of data being  
written being many times more than the total capacity of the drive, in  
order to test the remapping algorithms which are the weak point of such a  
device.



*) do the software solutions really work (unproven)
*) when are the major hardware vendors going to get involved.  they
make a lot of money selling disks and supporting hardware (san, etc).


	Looking at the pictures of the "drive" I see a bunch of Flash chips which  
probably make the bulk of the cost, a switching power supply, a small BGA  
chip which is probably a DDR memory for buffering, and the mystery ASIC  
which is probably a FPGA, I would tend to think Virtex4 from the shape of  
the package seen from the side in one of the pictures.


	A team of talented engineers can design and produce such a board, and  
assembly would only use standard PCB processes. This is unlike harddisks,  
which need a huge investment and a specialized factory because of the  
complex mechanical parts and very tight tolerances. In the case of the  
ioDrive, most of the value is in the intellectual property : software on  
the PC CPU (driver), embedded software, and programming the FPGA.


	All this points to a very different economic model for storage. I could  
design and build a scaled down version of the ioDrive in my "garage", for  
instance (well, the PCI Express licensing fees are hefty, so I'd use PCI,  
but you get the idea).


	This means I think we are about to see a flood of these devices coming  
from many small companies. This is very good for the end user, because  
there will be competition, natural selection, and fast evolution.


Interesting times ahead !


I'm not particularly enamored of having a storage device be stuck
directly in a pci slot -- although I understand it's probably
necessary in the short term as flash changes all the rules and you
can't expect it to run well using mainstream hardware raid
controllers.  By using their own device they have complete control of
the i/o stack up to the o/s driver level.


	Well, SATA is great for harddisks : small cables, less clutter, less  
failure prone than 80 conductor cables, faster, cheaper, etc...


	Basically serial LVDS (low vol

Re: [PERFORM] Fusion-io ioDrive

2008-07-07 Thread Merlin Moncure
On Wed, Jul 2, 2008 at 7:41 AM, Jonah H. Harris <[EMAIL PROTECTED]> wrote:
> On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <[EMAIL PROTECTED]> wrote:
>> Basically the ioDrive is smoking the RAID.  The only real problem with
>> this benchmark is that the machine became CPU-limited rather quickly.
>
> That's traditionally the problem with everything being in memory.
> Unless the database algorithms are designed to exploit L1/L2 cache and
> RAM, which is not the case for a disk-based DBMS, you generally lose
> some concurrency due to the additional CPU overhead of playing only
> with memory.  This is generally acceptable if you're going to trade
> off higher concurrency for faster service times.  And, it isn't only
> evidenced in single systems where a disk-based DBMS is 100% cached,
> but also in most shared-memory clustering architectures.
>
> In most cases, when you're waiting on disk I/O, you can generally
> support higher concurrency because the OS can utilize the CPU's free
> cycles (during the wait) to handle other users.  In short, sometimes,
> disk I/O is a good thing; it just depends on what you need.

I have a lot of problems with your statements.  First of all, we are
not really talking about 'RAM' storage...I think your comments would
be more on point if we were talking about mounting database storage
directly from the server memory for example.  Sever memory and cpu are
involved to the extent that the o/s using them for caching and
filesystem things and inside the device driver.

Also, your comments seem to indicate that having a slower device leads
to higher concurrency because it allows the process to yield and do
other things.  This is IMO simply false.  With faster storage cpu
loads will increase but only because the overall system throughput
increases and cpu/memory 'work' increases in terms of overall system
activity.  Presumably as storage approaches speeds of main system
memory the algorithms of dealing with it will become simpler (not
having to go through acrobatics to try and making everything
sequential) and thus faster.

I also find the remarks of software 'optimizing' for strict hardware
assumptions (L1+L2) cache a little suspicious.  In some old programs I
remember keeping a giant C 'union' of critical structures that was
exactly 8k to fit in the 486 cpu cache.  In modern terms I think that
type of programming (sans some specialized environments) is usually
counter-productive...I think PostgreSQL's approach of deferring as
much work as possible to the o/s is a great approach.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Fusion-io ioDrive

2008-07-07 Thread Merlin Moncure
On Sat, Jul 5, 2008 at 2:41 AM, Jeffrey Baker <[EMAIL PROTECTED]> wrote:
>>Service Time Percentile, millis
>>R/W TPS   R-O TPS  50th   80th   90th   95th
>> RAID  182   673 18 32 42 64
>> Fusion971  4792  8  9 10 11
>
> Someone asked for bonnie++ output:
>
> Block output: 495MB/s, 81% CPU
> Block input: 676MB/s, 93% CPU
> Block rewrite: 262MB/s, 59% CPU
>
> Pretty respectable.  In the same ballpark as an HP MSA70 + P800 with
> 25 spindles.

You left off the 'seeks' portion of the bonnie++ results -- this is
actually the most important portion of the test.  Based on your tps
#s, I'm expecting seeks equiv of about 10 10k drives in configured in
a raid 10, or around 1000-1500.  They didn't publish any prices so
it's hard to say if this is 'cost competitive'.

These numbers are indeed fantastic, disruptive even.  If I was testing
the device for consideration in high duty server environments, I would
be doing durability testing right now...I would slamming the database
with transactions (fsync on, etc) and then power off the device.  I
would do this several times...making sure the software layer isn't
doing some mojo that is technically cheating.

I'm not particularly enamored of having a storage device be stuck
directly in a pci slot -- although I understand it's probably
necessary in the short term as flash changes all the rules and you
can't expect it to run well using mainstream hardware raid
controllers.  By using their own device they have complete control of
the i/o stack up to the o/s driver level.

I've been thinking for a while now that flash is getting ready to
explode into use in server environments.  The outstanding questions I
see are:
*) is write endurance problem truly solved (giving at least a 5-10
year lifetime)
*) what are the true odds of catastrophic device failure (industry
claims less, we'll see)
*) is the flash random write problem going to be solved in hardware or
specialized solid state write caching techniques.   At least
currently, it seems like software is filling the role.
*) do the software solutions really work (unproven)
*) when are the major hardware vendors going to get involved.  they
make a lot of money selling disks and supporting hardware (san, etc).

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How much work_mem to configure...

2008-07-07 Thread Bill Moran
In response to "Scott Marlowe" <[EMAIL PROTECTED]>:

> On Sat, Jul 5, 2008 at 5:24 AM, Jessica Richard <[EMAIL PROTECTED]> wrote:
> > How can I tell if my work_mem configuration is enough to support all
> > Postgres user activities on the server I am managing?
> >
> > Where do I find the indication if the number is lower than needed.
> 
> You kinda have to do some math with fudge factors involved.  As
> work_mem gets smaller, sorts spill over to disk and get slower, and
> hash_aggregate joins get avoided because they need to fit into memory.
> 
> As you increase work_mem, sorts can start happening in memory (or with
> less disk writing) and larger and larger sets can have hash_agg joins
> performed on them because they can fit in memory.
> 
> But there's a dark side to work_mem being too large, and that is that
> you can run your machine out of free memory with too many large sorts
> happening, and then the machine will slow to a crawl as it swaps out
> the very thing you're trying to do in memory.
> 
> So, I tend to plan for about 1/4 of memory used for shared_buffers,
> and up to 1/4 used for sorts so there's plenty of head room and the OS
> to cache files, which is also important for performance.  If you plan
> on having 20 users accessing the database at once, then you figure
> each one might on average run a query with 2 sorts, and that you'll be
> using a maximum of 20*2*work_mem for those sorts etc...
> 
> If it's set to 8M, then you'd get approximately 320 Meg max used by
> all the sorts flying at the same time.  You can see why high work_mem
> and high max_connections settings together can be dangerous.  and why
> pooling connections to limit the possibility of such a thing is useful
> too.
> 
> Generally it's a good idea to keep it in the 4 to 16 meg range on most
> machines to prevent serious issues, but if you're going to allow 100s
> of connections at once, then you need to look at limiting it based on
> how much memory your server has.

I do have one thing to add: if you're using 8.3, there's a log_temp_files
config variable that you can use to monitor when your sorts spill over
onto disk.  It doesn't change anything that Scott said, it simply gives
you another way to monitor what's happening and thus have better
information to tune by.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance