Re: [PERFORM] Read/Write block sizes

2005-08-25 Thread Ron

At 04:49 PM 8/25/2005, Chris Browne wrote:

[EMAIL PROTECTED] (Ron) writes:
> At 03:45 PM 8/25/2005, Josh Berkus wrote:
>> > Ask me sometime about my replacement for GNU sort. Â It uses the
>> > same sorting algorithm, but it's an order of magnitude faster due
>> > to better I/O strategy. Â Someday, in my infinite spare time, I
>> > hope to demonstrate that kind of improvement with a patch to pg.
>>
>>Since we desperately need some improvements in sort performance, I
>>do hope you follow up on this.
>
> I'll generalize that.  IMO we desperately need any and all
> improvements in IO performance.  Even more so than we need
> improvements in sorting or sorting IO performance.

That's frankly a step backwards.  Feel free to "specialise" that instead.


We can agree to disagree, I'm cool with that.

I'm well aware that a Systems Approach to SW 
Architecture is not always popular in the Open 
Source world.  Nonetheless, my POV is that if we 
want to be taken seriously and beat "the big 
boys", we have to do everything smarter and 
faster, as well as cheaper, than they do.  You 
are not likely to be able to do that consistently 
without using some of the "icky" stuff one is 
required to study as part of formal training in 
the Comp Sci and SW Engineering fields.



A patch that improves some specific aspect of 
performance is a thousand times better than any 
sort of "desperate desire for any and

all improvements in I/O performance."


minor twisting of my words: substituting "desire" 
for "need".  The need is provable.  Just put "the 
big 5" (SQL Server, Oracle, DB2, mySQL, and 
PostgreSQL) into some realistic benches to see that.


Major twisting of my words: the apparent 
implication by you that I don't appreciate 
improvements in the IO behavior of specific 
things like sorting as much as I'd appreciate 
more "general" IO performance 
improvements.  Performance optimization is best 
done as an iterative improvement process that 
starts with measuring where the need is greatest, 
then improving that greatest need by the most you 
can, then repeating the whole cycle.  _Every_ 
improvement in such a process is a specific 
improvement, even if the improvement is a 
decision to re-architect the entire product to 
solve the current biggest issue.  Improving 
sorting IO is cool.  OTOH, if pg's biggest IO 
problems are elsewhere, then the amount of 
overall benefit we will get from improving 
sorting IO is going to be minimized until we 
improve the bigger problem(s).  Amdahl's Law.



The "specialized patch" is also pointedly better 
in that a *confidently submitted* patch is 
likely to be way better than any sort of 
"desperate clutching at whatever may come to hand."


Another distortion of my statement and POV.  I 
never suggested nor implied any sort of 
"desperate clutching...".  We have _measurable_ 
IO issues that need to be addressed in order for 
pg to be a better competitor in the 
marketplace.  Just as we do with sorting performance.



Far too often, I see people trying to address 
performance problems via the "desperate 
clutching at whatever seems near to hand," and that
generally turns out very badly as a particular 
result of the whole "desperate clutching" part.


If you can get a sort improvement submitted, that's a concrete improvement...


As I said, I'm all in favor of concrete, 
measurable improvement.  I do not think I ever 
stated I was in favor of anything else.


You evidently are mildly ranting because you've 
seen some examples of poor SW Engineering 
Discipline/Practice by people with perhaps 
inadequate skills for the issues they were trying 
to address.  We all have. "90% of everything is 
Jreck (eg of too low a quality)."


OTOH, I do not think I've given you any reason to 
think I lack such Clue, nor do I think my post was advocating such thrashing.


My post was intended to say that we need an 
Overall Systems Approach to pg optimization 
rather than just applying what compiler writer's 
call "peephole optimizations" to pg.  No more, no less.


I apologize if I somehow misled you,
Ron Peacetree



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Read/Write block sizes

2005-08-25 Thread Chris Browne
[EMAIL PROTECTED] (Ron) writes:
> At 03:45 PM 8/25/2005, Josh Berkus wrote:
>> > Ask me sometime about my replacement for GNU sort. Â It uses the
>> > same sorting algorithm, but it's an order of magnitude faster due
>> > to better I/O strategy. Â Someday, in my infinite spare time, I
>> > hope to demonstrate that kind of improvement with a patch to pg.
>>
>>Since we desperately need some improvements in sort performance, I
>>do hope you follow up on this.
>
> I'll generalize that.  IMO we desperately need any and all
> improvements in IO performance.  Even more so than we need
> improvements in sorting or sorting IO performance.

That's frankly a step backwards.

Feel free to "specialise" that instead.  

A patch that improves some specific aspect of performance is a
thousand times better than any sort of "desperate desire for any and
all improvements in I/O performance."

The latter is unlikely to provide any usable result.

The "specialized patch" is also pointedly better in that a
*confidently submitted* patch is likely to be way better than any sort
of "desperate clutching at whatever may come to hand."

Far too often, I see people trying to address performance problems via
the "desperate clutching at whatever seems near to hand," and that
generally turns out very badly as a particular result of the whole
"desperate clutching" part.

If you can get a sort improvement submitted, that's a concrete
improvement...
-- 
select 'cbbrowne' || '@' || 'ntlug.org';
http://www3.sympatico.ca/cbbrowne/lisp.html
Appendium to  the Rules  of the  Evil Overlord #1:  "I will  not build
excessively integrated  security-and-HVAC systems. They  may be Really
Cool, but are far too vulnerable to breakdowns."

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Read/Write block sizes

2005-08-25 Thread Ron

At 03:45 PM 8/25/2005, Josh Berkus wrote:

Jeff,

> Ask me sometime about my replacement for GNU sort. Â It uses the same
> sorting algorithm, but it's an order of magnitude faster due to better
> I/O strategy. Â Someday, in my infinite spare time, I hope to demonstrate
> that kind of improvement with a patch to pg.

Since we desperately need some improvements in sort performance, I do hope
you follow up on this.

--
--Josh


I'll generalize that.  IMO we desperately need 
any and all improvements in IO performance.  Even 
more so than we need improvements in sorting or sorting IO performance.


Ron



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Read/Write block sizes

2005-08-25 Thread Josh Berkus
Jeff,

> Ask me sometime about my replacement for GNU sort.  It uses the same
> sorting algorithm, but it's an order of magnitude faster due to better
> I/O strategy.  Someday, in my infinite spare time, I hope to demonstrate
> that kind of improvement with a patch to pg.

Since we desperately need some improvements in sort performance, I do hope 
you follow up on this.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Read/Write block sizes

2005-08-24 Thread Lance Obermeyer
Since Bruce referred to the "corporate software world" I'll chime in...

It has been a while since adding knobs and dials has been considered a good 
idea.  Customers are almost always bad at tuning their systems, which decreases 
customer satisfaction.  While many people assume the corporate types don't 
care, that is actually far from the truth.  Well run commercial software 
companies regularly commission (expensive) customer satisfaction surveys.  
These numbers are the second most important numbers in all of the enterprise, 
trailing only revenue in importance.  Results are sliced and diced in every way 
imaginable.

The commercial world is trying to auto-tune their systems just as much.  
Examples are the work that many of the big boys are doing towards "autonomic" 
computing.  While it is driven by naked self interest of wanting to sell 
version upgrades, those efforts increase customer satisfaction and decrease 
support costs.  Works well for everyone...



-Original Message-
From: Bruce Momjian [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 24, 2005 8:52 AM
To: Jignesh K. Shah
Cc: Jim Nasby; Chris Browne; pgsql-performance@postgresql.org
Subject: Re: Read/Write block sizes



This thread covers several performance ideas.  First is the idea that
more parameters should be configurable.   While this seems like a noble
goal, we try to make parameters auto-tuning, or if users have to
configure it, the parameter should be useful for a significant number of
users.

In the commercial software world, if you can convince your boss that a
feature/knob is useful, it usually gets into the product. 
Unfortunately, this leads to the golden doorknob on a shack, where some
features are out of sync with the rest of the product in terms of
usefulness and utility.  With open source, if a feature can not be
auto-tuned, or has significant overhead, the features has to be
implemented and then proven to be a benefit.

In terms of adding async I/O, threading, and other things, it might make
sense to explore how these could be implemented in a way that fits the
above criteria.

---

Jignesh K. Shah wrote:
> Hi Jim,
> 
> | How many of these things are currently easy to change with a recompile?
> | I should be able to start testing some of these ideas in the near
> | future, if they only require minor code or configure changes.
> 
> 
> The following
> * Data File Size   1GB
> * WAL File Size of 16MB
> * Block Size  of 8K
> 
> Are very easy to change with a recompile.. A Tunable will be greatly 
> prefered as it will allow one binary for different tunings
> 
> * MultiBlock read/write
> 
> Is not available but will greatly help in reducing the number of system 
> calls which will only increase as the size of the database increases if 
> something is not done about i.
> 
> * Pregrown files... maybe not important at this point since TABLESPACE 
> can currently work around it a bit (Just need to create a different file 
> system for each tablespace
> 
> But if you really think hardware & OS  is the answer for all small 
> things.. I think we should now start to look on how to make Postgres 
> Multi-threaded or multi-processed for each connection. With the influx 
> of  "Dual-Core" or "Multi-Core" being the fad Postgres can have the 
> cutting edge if somehow exploiting cores is designed.
> 
> Somebody mentioned that adding CPU to Postgres workload halved the 
> average CPU  usage...
> YEAH... PostgreSQL  uses only 1 CPU per connection (assuming 100% 
> usage)  so if you add another CPU it is idle anyway and the system will 
> report only 50%  :-) BUT the importing to measure is.. whether the query 
> time was cut down or not? ( No flames I am sure you were talking about 
> multi-connection multi-user environment :-) ) But my point is then this 
> approach is worth the ROI and the time and effort spent to solve this 
> problem.
> 
> I actually vote for a multi-threaded solution for each connection while 
> still maintaining seperate process for each connections... This way the 
> fundamental architecture of Postgres doesn't change, however a 
> multi-threaded connection can then start to exploit different cores.. 
> (Maybe have tunables for number of threads to read data files who 
> knows.. If somebody is interested in actually working a design .. 
> contact me and I will be interested in assisting this work.
> 
> Regards,
> Jignesh
> 
> 
> Jim C. Nasby wrote:
> 
> >On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
> >  
> >
> >>[EMAIL PROTECTED] (Jignesh Shah) writes:
> >>
> >>
> Does that include increasing the size of read/write blocks? I've
> noticedthat with a large enough table it takes a while to do a
> sequential scan, even if it's cached; I wonder if the fact that it
> takes a million read(2) calls to get through an 8G table is part of
> that.
> 
> 
> >>>Actually some of that read

Re: [PERFORM] Read/Write block sizes

2005-08-24 Thread Jim C. Nasby
On Wed, Aug 24, 2005 at 12:12:22PM -0400, Chris Browne wrote:
> Everyone involved in development seems to me to have a reasonably keen
> understanding as to what the potential benefits of threading are; the
> value is that there fall out plenty of opportunities to parallelize
> the evaluation of portions of queries.  Alas, it wouldn't be until
> *after* all the effort goes in that we would get any idea as to what
> kinds of speedups this would provide.

My understanding is that the original suggestion was to use threads
within individual backends to allow for parallel query execution, not
swiching to a completely thread-based model.

In any case, there are other ways to enable parallelism without using
threads, such as handing actual query execution off to a set of
processes.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Softwarehttp://pervasive.com512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Read/Write block sizes

2005-08-24 Thread Jignesh Shah
Agreed!!!

But the knowledge to Auto-tune your application comes from years of 
understanding of how users are using the so-called "knobs".. But if the "knobs" 
are not there in the first place.. how do you know what people are using?

The "so-called" big boys are also using their knowledge base of what works for 
the customer in their autonomic self healers and its based on the experience of 
all the settings possible and based on service requests on what had failed that 
they get the knowledge about avoiding what fails and tuning what works. 

Remember "recompiling" is a risk with upteem number of variables which not 
every production release engineer is happy about.

Its easy to change back the knob to the previous value rather than trying to 
figure out how do I get my old binaries back.


-Jignesh


- Original Message -
From: Lance Obermeyer <[EMAIL PROTECTED]>
Date: Wednesday, August 24, 2005 4:10 pm
Subject: RE: Read/Write block sizes

> Since Bruce referred to the "corporate software world" I'll chime 
> in...
> It has been a while since adding knobs and dials has been 
> considered a good idea.  Customers are almost always bad at tuning 
> their systems, which decreases customer satisfaction.  While many 
> people assume the corporate types don't care, that is actually far 
> from the truth.  Well run commercial software companies regularly 
> commission (expensive) customer satisfaction surveys.  These 
> numbers are the second most important numbers in all of the 
> enterprise, trailing only revenue in importance.  Results are 
> sliced and diced in every way imaginable.
> 
> The commercial world is trying to auto-tune their systems just as 
> much.  Examples are the work that many of the big boys are doing 
> towards "autonomic" computing.  While it is driven by naked self 
> interest of wanting to sell version upgrades, those efforts 
> increase customer satisfaction and decrease support costs.  Works 
> well for everyone...
> 
> 
> 
> -Original Message-
> From: Bruce Momjian [EMAIL PROTECTED]
> Sent: Wednesday, August 24, 2005 8:52 AM
> To: Jignesh K. Shah
> Cc: Jim Nasby; Chris Browne; pgsql-performance@postgresql.org
> Subject: Re: Read/Write block sizes
> 
> 
> 
> This thread covers several performance ideas.  First is the idea that
> more parameters should be configurable.   While this seems like a 
> noblegoal, we try to make parameters auto-tuning, or if users have to
> configure it, the parameter should be useful for a significant 
> number of
> users.
> 
> In the commercial software world, if you can convince your boss 
> that a
> feature/knob is useful, it usually gets into the product. 
> Unfortunately, this leads to the golden doorknob on a shack, where 
> somefeatures are out of sync with the rest of the product in terms of
> usefulness and utility.  With open source, if a feature can not be
> auto-tuned, or has significant overhead, the features has to be
> implemented and then proven to be a benefit.
> 
> In terms of adding async I/O, threading, and other things, it might 
> makesense to explore how these could be implemented in a way that 
> fits the
> above criteria.
> 
> 
> ---
> 
> Jignesh K. Shah wrote:
> > Hi Jim,
> > 
> > | How many of these things are currently easy to change with a 
> recompile?> | I should be able to start testing some of these ideas 
> in the near
> > | future, if they only require minor code or configure changes.
> > 
> > 
> > The following
> > * Data File Size   1GB
> > * WAL File Size of 16MB
> > * Block Size  of 8K
> > 
> > Are very easy to change with a recompile.. A Tunable will be 
> greatly 
> > prefered as it will allow one binary for different tunings
> > 
> > * MultiBlock read/write
> > 
> > Is not available but will greatly help in reducing the number of 
> system 
> > calls which will only increase as the size of the database 
> increases if 
> > something is not done about i.
> > 
> > * Pregrown files... maybe not important at this point since 
> TABLESPACE 
> > can currently work around it a bit (Just need to create a 
> different file 
> > system for each tablespace
> > 
> > But if you really think hardware & OS  is the answer for all 
> small 
> > things.. I think we should now start to look on how to make 
> Postgres 
> > Multi-threaded or multi-processed for each connection. With the 
> influx 
> > of  "Dual-Core" or "Multi-Core" being the fad Postgres can 
> have the 
> > cutting edge if somehow exploiting cores is designed.
> > 
> > Somebody mentioned that adding CPU to Postgres workload halved 
> the 
> > average CPU  usage...
> > YEAH... PostgreSQL  uses only 1 CPU per connection (assuming 100% 
> > usage)  so if you add another CPU it is idle anyway and the 
> system will 
> > report only 50%  :-) BUT the importing to measure is.. whether 
> the query 
> > time was cut down or not? ( No flames I am sure you were talking 
> about 
> >

Re: [PERFORM] Read/Write block sizes

2005-08-24 Thread Josh Berkus
Tom, Gavin,


> > To get decent I/O you need 1MB fundamental units all the way down the
> > stack.
>
> It would also be a good idea to have an application that isn't likely
> to change a single bit in a 1MB range and then expect you to record
> that change.  This pretty much lets Postgres out of the picture.

We're looking at this pretty much just for data warehousing, where you 
constantly have gigabytes of data which don't change from month to month or 
even year to year.   I agree that it would *not* be an optimization for OLTP 
systems.  Which is why a build-time option would be fine.

> Ummm... I don't see anything here which will be a win for Postgres. The
> transactional semantics we're interested in are fairly complex:
>
> 1) Modifications to multiple objects can become visible to the system
> atomically
> 2) On error, a series of modifications which had been grouped together
> within a transaction can be rolled back
> 3) Using object version information, determine which version of which
> object is visible to a given session
> 4) Using version information and locking, detect and resolve read/write
> and write/write conflicts

I wasn't thinking of database transactions.  I was thinking specifically of 
using Reiser4 transactions (and other transactional filesytems) to do things 
like eliminate the need for full page writes in the WAL.  Filesystems are 
low-level things which should take care of low-level needs, like making sure 
an 8K page got written to disk even in the event of a system failure.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Read/Write block sizes

2005-08-24 Thread Chris Browne
[EMAIL PROTECTED] (Steve Poe) writes:
> Chris,
>
> Unless I am wrong, you're making the assumpting the amount of time spent
> and ROI is known. Maybe those who've been down this path know how to get
> that additional 2-4% in 30 minutes or less? 
>
> While each person and business' performance gains (or not) could vary,
> someone spending the 50-100h to gain 2-4% over a course of a month for a
> 24x7 operation would seem worth the investment?  

What we *do* know is that adding these "knobs" would involve a
significant amount of effort, as the values are widely used throughout
the database engine.  Making them dynamic (e.g. - so they could be
tuned on a tablespace-by-tablespace basis) would undoubtedly require
rather a lot of development effort.  They are definitely NOT 30 minute
changes.

Moreover, knowing how to adjust them is almost certainly also NOT a 30
minute configuration change; significant benchmarking effort for the
individual application is almost sure to be needed.

It's not much different from the reason why PostgreSQL doesn't use
threading...

The problem with using threading is that introducing it to the code
base would require a pretty enormous amount of effort (I'll bet
multiple person-years), and it wouldn't provide *any* benefit until
you get rather a long ways down the road.

Everyone involved in development seems to me to have a reasonably keen
understanding as to what the potential benefits of threading are; the
value is that there fall out plenty of opportunities to parallelize
the evaluation of portions of queries.  Alas, it wouldn't be until
*after* all the effort goes in that we would get any idea as to what
kinds of speedups this would provide.

In effect, there has to be a year invested in *breaking* PostgreSQL
(because this would initially break a lot, since thread programming is
a really tough skill) where you don't actually see any benefits.

> I would assume that dbt2 with STP helps minimize the amount of hours
> someone has to invest to determine performance gains with
> configurable options?

That's going to help in constructing a "default" knob value.  And if
we find an "optimal default," that encourages sticking with the
current approach, of using #define to apply that value...

>> If someone spends 100h working on one of these items, and gets a 2%
>> performance improvement, that's almost certain to be less desirable
>> than spending 50h on something else that gets a 4% improvement.
>> 
>> And we might discover that memory management improvements in Linux
>> 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such
>> improvements "for free" behind our backs without *any* need to write
>> database code.  :-)
-- 
let name="cbbrowne" and tld="ntlug.org" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/lisp.html
"For those  of you who are  into writing programs that  are as obscure
and complicated  as possible, there are opportunities  for... real fun
here" -- Arthur Norman

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Read/Write block sizes

2005-08-24 Thread Bruce Momjian

This thread covers several performance ideas.  First is the idea that
more parameters should be configurable.   While this seems like a noble
goal, we try to make parameters auto-tuning, or if users have to
configure it, the parameter should be useful for a significant number of
users.

In the commercial software world, if you can convince your boss that a
feature/knob is useful, it usually gets into the product. 
Unfortunately, this leads to the golden doorknob on a shack, where some
features are out of sync with the rest of the product in terms of
usefulness and utility.  With open source, if a feature can not be
auto-tuned, or has significant overhead, the features has to be
implemented and then proven to be a benefit.

In terms of adding async I/O, threading, and other things, it might make
sense to explore how these could be implemented in a way that fits the
above criteria.

---

Jignesh K. Shah wrote:
> Hi Jim,
> 
> | How many of these things are currently easy to change with a recompile?
> | I should be able to start testing some of these ideas in the near
> | future, if they only require minor code or configure changes.
> 
> 
> The following
> * Data File Size   1GB
> * WAL File Size of 16MB
> * Block Size  of 8K
> 
> Are very easy to change with a recompile.. A Tunable will be greatly 
> prefered as it will allow one binary for different tunings
> 
> * MultiBlock read/write
> 
> Is not available but will greatly help in reducing the number of system 
> calls which will only increase as the size of the database increases if 
> something is not done about i.
> 
> * Pregrown files... maybe not important at this point since TABLESPACE 
> can currently work around it a bit (Just need to create a different file 
> system for each tablespace
> 
> But if you really think hardware & OS  is the answer for all small 
> things.. I think we should now start to look on how to make Postgres 
> Multi-threaded or multi-processed for each connection. With the influx 
> of  "Dual-Core" or "Multi-Core" being the fad Postgres can have the 
> cutting edge if somehow exploiting cores is designed.
> 
> Somebody mentioned that adding CPU to Postgres workload halved the 
> average CPU  usage...
> YEAH... PostgreSQL  uses only 1 CPU per connection (assuming 100% 
> usage)  so if you add another CPU it is idle anyway and the system will 
> report only 50%  :-) BUT the importing to measure is.. whether the query 
> time was cut down or not? ( No flames I am sure you were talking about 
> multi-connection multi-user environment :-) ) But my point is then this 
> approach is worth the ROI and the time and effort spent to solve this 
> problem.
> 
> I actually vote for a multi-threaded solution for each connection while 
> still maintaining seperate process for each connections... This way the 
> fundamental architecture of Postgres doesn't change, however a 
> multi-threaded connection can then start to exploit different cores.. 
> (Maybe have tunables for number of threads to read data files who 
> knows.. If somebody is interested in actually working a design .. 
> contact me and I will be interested in assisting this work.
> 
> Regards,
> Jignesh
> 
> 
> Jim C. Nasby wrote:
> 
> >On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
> >  
> >
> >>[EMAIL PROTECTED] (Jignesh Shah) writes:
> >>
> >>
> Does that include increasing the size of read/write blocks? I've
> noticedthat with a large enough table it takes a while to do a
> sequential scan, even if it's cached; I wonder if the fact that it
> takes a million read(2) calls to get through an 8G table is part of
> that.
> 
> 
> >>>Actually some of that readaheads,etc the OS does already if it does
> >>>some sort of throttling/clubbing of reads/writes. But its not enough
> >>>for such types of workloads.
> >>>
> >>>Here is what I think will help:
> >>>
> >>>* Support for different Blocksize TABLESPACE without recompiling the
> >>>code.. (Atlease support for a different Blocksize for the whole
> >>>database without recompiling the code)
> >>>
> >>>* Support for bigger sizes of WAL files instead of 16MB files
> >>>WITHOUT recompiling the code.. Should be a tuneable if you ask me
> >>>(with checkpoint_segments at 256.. you have too many 16MB files in
> >>>the log directory) (This will help OLTP benchmarks more since now
> >>>they don't spend time rotating log files)
> >>>
> >>>* Introduce a multiblock or extent tunable variable where you can
> >>>define a multiple of 8K (or BlockSize tuneable) to read a bigger
> >>>chunk and store it in the bufferpool.. (Maybe writes too) (Most
> >>>devices now support upto 1MB chunks for reads and writes)
> >>>
> >>>*There should be a way to preallocate files for TABLES in
> >>>TABLESPACES otherwise with multiple table writes in the same
> >>>filesystem ends with fragmented files which causes poor "READS" from
> >>>t

Re: [PERFORM] Read/Write block sizes

2005-08-24 Thread PFC



of effort reinventing the wheel ... but our time will be repaid much
more if we work at levels that the OS cannot have knowledge of, such as
join planning and data statistics.


Considering a global budget of man-hours which is the best ?

1- Spend it on reimplementing half of VFS in postgres, half of Windows in  
postgres, half of FreeBSD in postgres, half of Solaris in Postgres, only  
to discover you gain a meagre speed increase and a million and a half bugs,


2- Spending 5% of that time lowering the impedance between the OS and  
Postgres, and another 5% annoying Kernel people and helping them tweaking  
stuff for database use, and the rest on useful features that give useful  
speedups, like bitmap indexes, skip scans, and other features that enhance  
power and usability ?


If you're Oracle and have almost unlimited resources, maybe. But even  
Microsoft opted for option 2 : they implemented ReadFileGather and  
WriteFileScatter to lower the syscall overhead and that's it.


And point 2 will benefit to many other apps, wether 1 would benefit only  
postgres, and then only in certain cases.


I do believe there is something ineresting to uncover with reiser4 though  
(it definitely fits point 2).


I'm happy that the pg team chose point 2 and that new versions keep coming  
with new features at an unbelievable rate these times. Do you guys sleep ?


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jeffrey W. Baker
On Wed, 2005-08-24 at 01:56 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <[EMAIL PROTECTED]> writes:
> > On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
> >> Dont forget that already in postgres, you have a process per connection, 
> >> and
> >> all the processes take care of their own I/O.
> 
> > That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
> > coordinating the I/O of all the backends optimally.  For instance, with
> > synchronous scanning.
> 
> And why exactly are we going to do a better job of I/O scheduling than
> the OS itself can do?
...
> There are some things we could do to reduce the impedance between us and
> the OS --- for instance, the upthread criticism that a seqscan asks the
> OS for only 8K at a time is fair enough.  But that doesn't translate
> to a conclusion that we should schedule the I/O instead of the OS.

Synchronous scanning is a fairly huge and obvious win.  If you have two
processes 180 degrees out-of-phase in a linear read, neither process is
going to get anywhere near the throughput they would get from a single
scan.

I think you're being deliberately obtuse with regards to file I/O and
the operating system.  The OS isn't magical.  It has to strike a balance
between a reasonable read latency and a reasonable throughput.  As far
as the kernel is concerned, a busy postgresql server is
indistinguishable from 100 unrelated activities.  All backends will be
served equally, even if in this case "equally" means "quite badly all
around."

An I/O slave process could be a big win in Postgres for many kinds of
reads.  Instead of opening and reading files the backends would connect
to the I/O slave and request the file be read.  If a scan of that file
were already underway, the new backends would be attached.  Otherwise a
new scan would commence.  In either case, the slave process can issue
(sometimes non-dependant) reads well ahead of the needs of the backend.
You may think the OS can do this for you but it can't.  On postgres
knows that it needs the whole file from beginning to end.  The OS can
only guess.

Ask me sometime about my replacement for GNU sort.  It uses the same
sorting algorithm, but it's an order of magnitude faster due to better
I/O strategy.  Someday, in my infinite spare time, I hope to demonstrate
that kind of improvement with a patch to pg.

-jwb


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Tom Lane
"Jeffrey W. Baker" <[EMAIL PROTECTED]> writes:
> On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
>> Dont forget that already in postgres, you have a process per connection, and
>> all the processes take care of their own I/O.

> That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
> coordinating the I/O of all the backends optimally.  For instance, with
> synchronous scanning.

And why exactly are we going to do a better job of I/O scheduling than
the OS itself can do?

There's a fairly basic disconnect in viewpoint involved here.  The
old-school viewpoint (as embodied in Oracle and a few other DBMSes)
is that the OS is too stupid to be worth anything, and the DB should
bypass the OS to the greatest extent possible, doing its own caching,
disk space layout, I/O scheduling, yadda yadda.  That might have been
defensible twenty-odd years ago when Oracle was designed.  Postgres
prefers to lay off to the OS anything that the OS can do well --- and
that definitely includes caching and I/O scheduling.  There are a whole
lot of smart people working on those problems at the OS level.  Maybe we
could make marginal improvements on their results after spending a lot
of effort reinventing the wheel ... but our time will be repaid much
more if we work at levels that the OS cannot have knowledge of, such as
join planning and data statistics.

There are some things we could do to reduce the impedance between us and
the OS --- for instance, the upthread criticism that a seqscan asks the
OS for only 8K at a time is fair enough.  But that doesn't translate
to a conclusion that we should schedule the I/O instead of the OS.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jeffrey W. Baker
On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
> As for the async IO, sure you might think 'oh async IO would be so cool!!'
> and I did, once, too. But then I sat down and _thought_ about it, and
> decided well, no, actually, theres _very_ few areas it could actually help,
> and in most cases it just make it easier to drive your box into lseek()
> induced IO collapse.
> 
> Dont forget that already in postgres, you have a process per connection, and
> all the processes take care of their own I/O.

That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
coordinating the I/O of all the backends optimally.  For instance, with
synchronous scanning.

-jwb


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Guy Thornley
> Unfortunately I'm really afraid that this conversation is about trees
> when the forest is the problem.  PostgreSQL doesn't even have an async
> reader, which is the sort of thing that could double or triple its
> performance.  You're talking about block sizes and such, but the kinds
> of improvements you can get there are in the tens of percents at most.

Not 100% sure, but I'm fairly cirtain we were seeing significant performance
degradation by too much _scheduled_ I/O activity

ie: too much work being submitted to the kernel, due to excessive
parallelism already!!

The classic example of this is a seqscan being interleved by a index scan,
and the disks end up doing nothing but seek activity

Out of all the stuff talked about on this thread so far, only tweaking the
block size (and the madvise() stuff) makes any real-world sense, as its the
only thing talked about that increases the _work_per_seek_.

As for the async IO, sure you might think 'oh async IO would be so cool!!'
and I did, once, too. But then I sat down and _thought_ about it, and
decided well, no, actually, theres _very_ few areas it could actually help,
and in most cases it just make it easier to drive your box into lseek()
induced IO collapse.

Dont forget that already in postgres, you have a process per connection, and
all the processes take care of their own I/O.

Somebody mentioned having threaded backends too, but the only benefit would
be reduced memory footprint (a backend consumes 1-2MB of RAM, which is
almost enough to be a concern for largish systems with a lot of backends)
but personally I _know_ the complixities introduced through threading are
usually not worth it.


IMVVHO (naive experience) what is needed is a complete architecture change
(probably infeasible and only useful as a thought experiment), where:

* a network I/O process deals with client connections
* a limited pool of worker processes deal with statements (perhaps related
  to number of spindles somehow)

so when a client issues a statement, the net-IO process simply forwards the
connection state to a worker process and says 'deal with this'.
(Clearly the state object needs to contain all user and transaction state
the connection is involved in).

- Guy Thornley

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Tom Lane
"Jeffrey W. Baker" <[EMAIL PROTECTED]> writes:
> To get decent I/O you need 1MB fundamental units all the way down the
> stack.

It would also be a good idea to have an application that isn't likely
to change a single bit in a 1MB range and then expect you to record
that change.  This pretty much lets Postgres out of the picture.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jeffrey W. Baker
On Tue, 2005-08-23 at 19:31 -0700, Josh Berkus wrote:
> Steve,
> 
> > I would assume that dbt2 with STP helps minimize the amount of hours
> > someone has to invest to determine performance gains with configurable
> > options?
> 
> Actually, these I/O operation issues show up mainly with DW workloads, so the 
> STP isn't much use there.   If I can ever get some of these machines back 
> from the build people, I'd like to start testing some stuff.
> 
> One issue with testing this is that currently PostgreSQL doesn't support 
> block 
> sizes above 128K.  We've already done testing on that (well, Mark has) and 
> the performance gains aren't even worth the hassle of remembering you're on a 
> different block size (like, +4%).
> 
> What the Sun people have done with other DB systems is show that substantial 
> performance gains are possible on large databases (>100G) using block sizes 
> of 1MB.   I believe that's possible (and that it probably makes more of a 
> difference on Solaris than on BSD) but we can't test it without some hackery 
> first.

To get decent I/O you need 1MB fundamental units all the way down the
stack.  You need a filesystem that can take a 1MB write well, and you
need an I/O scheduler that will keep it together, and you need a storage
controller that can eat a 1MB request at once.  Ideally you'd like an
architecture with a 1MB page (Itanium has this, and AMD64 Linux will
soon have this.)  The Lustre people have done some work in this area,
opening up the datapaths in the kernel so they can keep the hardware
really working.  They even modified the QLogic SCSI/FC driver so it
supports such large transfers.  Their work has shown that you can get
significant perf boost on Linux just by thinking in terms of larger
transfers.

Unfortunately I'm really afraid that this conversation is about trees
when the forest is the problem.  PostgreSQL doesn't even have an async
reader, which is the sort of thing that could double or triple its
performance.  You're talking about block sizes and such, but the kinds
of improvements you can get there are in the tens of percents at most.

-jwb


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Alan Stange

Josh Berkus wrote:


Steve,

 


I would assume that dbt2 with STP helps minimize the amount of hours
someone has to invest to determine performance gains with configurable
options?
   



Actually, these I/O operation issues show up mainly with DW workloads, so the 
STP isn't much use there.   If I can ever get some of these machines back 
from the build people, I'd like to start testing some stuff.


One issue with testing this is that currently PostgreSQL doesn't support block 
sizes above 128K.  We've already done testing on that (well, Mark has) and 
the performance gains aren't even worth the hassle of remembering you're on a 
different block size (like, +4%).
 


What size database was this on?

What the Sun people have done with other DB systems is show that substantial 
performance gains are possible on large databases (>100G) using block sizes 
of 1MB.   I believe that's possible (and that it probably makes more of a 
difference on Solaris than on BSD) but we can't test it without some hackery 
first.


We're running on a 100+GB database, with long streams of 8KB reads with 
the occasional _llseek().  I've been thinking about running with a 
larger blocksize with the expectation that we'd see fewer system calls 
and a bit more throughput.


read() calls are a very expensive way to get 8KB of memory (that we know 
is already resident) during scans.  One has to trap into the kernel, do 
the usual process state accounting, find the block, copy the memory to 
userspace, return back from the kernel to user space reversing all the 
process accounting, pick out the bytes one needs, and repeat all over 
again.That's quite a few sacrificial cache lines for 8KB.   Yeah, 
sure, Linux syscalls are fast, but they aren't that fast, and other 
operating systems (windows and solaris) have a bit more overhead on 
syscalls.


Regarding large blocks sizes on Solaris:  the Solaris folks can also use 
large memory pages and avoid a lot of the TLB overhead  from the VM 
system.  The various trapstat and cpustat commands can be quite 
interesting to look at when running any large application on a Solaris 
system. 

It should be noted that having a large shared memory segment can be a 
performance looser just from the standpoint of TLB thrashing.  O(GB) 
memory access patterns can take a huge performance hit in user space 
with 4K pages compared to the kernel which would be mapping the "segmap" 
(in Solaris parlance) with 4MB pages.


Anyway, I guess my point is that the balance between kernel managed vs. 
postgresql managed buffer isn't obvious at all.


-- Alan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Josh Berkus
Steve,

> I would assume that dbt2 with STP helps minimize the amount of hours
> someone has to invest to determine performance gains with configurable
> options?

Actually, these I/O operation issues show up mainly with DW workloads, so the 
STP isn't much use there.   If I can ever get some of these machines back 
from the build people, I'd like to start testing some stuff.

One issue with testing this is that currently PostgreSQL doesn't support block 
sizes above 128K.  We've already done testing on that (well, Mark has) and 
the performance gains aren't even worth the hassle of remembering you're on a 
different block size (like, +4%).

What the Sun people have done with other DB systems is show that substantial 
performance gains are possible on large databases (>100G) using block sizes 
of 1MB.   I believe that's possible (and that it probably makes more of a 
difference on Solaris than on BSD) but we can't test it without some hackery 
first.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jignesh K. Shah

Hi Jim,

| How many of these things are currently easy to change with a recompile?
| I should be able to start testing some of these ideas in the near
| future, if they only require minor code or configure changes.


The following
* Data File Size   1GB
* WAL File Size of 16MB
* Block Size  of 8K

Are very easy to change with a recompile.. A Tunable will be greatly 
prefered as it will allow one binary for different tunings


* MultiBlock read/write

Is not available but will greatly help in reducing the number of system 
calls which will only increase as the size of the database increases if 
something is not done about i.


* Pregrown files... maybe not important at this point since TABLESPACE 
can currently work around it a bit (Just need to create a different file 
system for each tablespace


But if you really think hardware & OS  is the answer for all small 
things.. I think we should now start to look on how to make Postgres 
Multi-threaded or multi-processed for each connection. With the influx 
of  "Dual-Core" or "Multi-Core" being the fad Postgres can have the 
cutting edge if somehow exploiting cores is designed.


Somebody mentioned that adding CPU to Postgres workload halved the 
average CPU  usage...
YEAH... PostgreSQL  uses only 1 CPU per connection (assuming 100% 
usage)  so if you add another CPU it is idle anyway and the system will 
report only 50%  :-) BUT the importing to measure is.. whether the query 
time was cut down or not? ( No flames I am sure you were talking about 
multi-connection multi-user environment :-) ) But my point is then this 
approach is worth the ROI and the time and effort spent to solve this 
problem.


I actually vote for a multi-threaded solution for each connection while 
still maintaining seperate process for each connections... This way the 
fundamental architecture of Postgres doesn't change, however a 
multi-threaded connection can then start to exploit different cores.. 
(Maybe have tunables for number of threads to read data files who 
knows.. If somebody is interested in actually working a design .. 
contact me and I will be interested in assisting this work.


Regards,
Jignesh


Jim C. Nasby wrote:


On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
 


[EMAIL PROTECTED] (Jignesh Shah) writes:
   


Does that include increasing the size of read/write blocks? I've
noticedthat with a large enough table it takes a while to do a
sequential scan, even if it's cached; I wonder if the fact that it
takes a million read(2) calls to get through an 8G table is part of
that.
   


Actually some of that readaheads,etc the OS does already if it does
some sort of throttling/clubbing of reads/writes. But its not enough
for such types of workloads.

Here is what I think will help:

* Support for different Blocksize TABLESPACE without recompiling the
code.. (Atlease support for a different Blocksize for the whole
database without recompiling the code)

* Support for bigger sizes of WAL files instead of 16MB files
WITHOUT recompiling the code.. Should be a tuneable if you ask me
(with checkpoint_segments at 256.. you have too many 16MB files in
the log directory) (This will help OLTP benchmarks more since now
they don't spend time rotating log files)

* Introduce a multiblock or extent tunable variable where you can
define a multiple of 8K (or BlockSize tuneable) to read a bigger
chunk and store it in the bufferpool.. (Maybe writes too) (Most
devices now support upto 1MB chunks for reads and writes)

*There should be a way to preallocate files for TABLES in
TABLESPACES otherwise with multiple table writes in the same
filesystem ends with fragmented files which causes poor "READS" from
the files.

* With 64bit 1GB file chunks is also moot.. Maybe it should be
tuneable too like 100GB without recompiling the code.

Why recompiling is bad? Most companies that will support Postgres
will support their own binaries and they won't prefer different
versions of binaries for different blocksizes, different WAL file
sizes, etc... and hence more function using the same set of binaries
is more desirable in enterprise environments
 


Every single one of these still begs the question of whether the
changes will have a *material* impact on performance.
   




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Read/Write block sizes (Was: Caching by Postgres)

2005-08-23 Thread Jeffrey W. Baker
On Tue, 2005-08-23 at 19:12 -0400, Michael Stone wrote:
> On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote:
> >Actually some of that readaheads,etc  the OS does  already if it does
> >some sort of throttling/clubbing of reads/writes.
> 
> Note that I specified the fully cached case--even with the workload in
> RAM the system still has to process a heck of a lot of read calls.
> 
> >* Introduce a multiblock or extent tunable variable where you can
> >define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk
> >and store it in the bufferpool.. (Maybe writes too) (Most devices now
> >support upto 1MB chunks for reads and writes)
> 
> Yeah. The problem with relying on OS readahead is that the OS doesn't
> know whether you're doing a sequential scan or an index scan; if you
> have the OS agressively readahead you'll kill your seek performance.
> OTOH, if you don't do readaheads you'll kill your sequential scan
> performance. At the app level you know which makes sense for each
> operation.

This is why we have MADVISE_RANDOM and MADVISE_SEQUENTIAL.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Steve Poe
Chris,

Unless I am wrong, you're making the assumpting the amount of time spent
and ROI is known. Maybe those who've been down this path know how to get
that additional 2-4% in 30 minutes or less? 

While each person and business' performance gains (or not) could vary,
someone spending the 50-100h to gain 2-4% over a course of a month for a
24x7 operation would seem worth the investment?  

I would assume that dbt2 with STP helps minimize the amount of hours
someone has to invest to determine performance gains with configurable
options?   

Steve Poe

> If someone spends 100h working on one of these items, and gets a 2%
> performance improvement, that's almost certain to be less desirable
> than spending 50h on something else that gets a 4% improvement.
> 
> And we might discover that memory management improvements in Linux
> 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such
> improvements "for free" behind our backs without *any* need to write
> database code.  :-)


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Jim C. Nasby
On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
> [EMAIL PROTECTED] (Jignesh Shah) writes:
> >> Does that include increasing the size of read/write blocks? I've
> >> noticedthat with a large enough table it takes a while to do a
> >> sequential scan, even if it's cached; I wonder if the fact that it
> >> takes a million read(2) calls to get through an 8G table is part of
> >> that.
> >
> > Actually some of that readaheads,etc the OS does already if it does
> > some sort of throttling/clubbing of reads/writes. But its not enough
> > for such types of workloads.
> >
> > Here is what I think will help:
> >
> > * Support for different Blocksize TABLESPACE without recompiling the
> > code.. (Atlease support for a different Blocksize for the whole
> > database without recompiling the code)
> >
> > * Support for bigger sizes of WAL files instead of 16MB files
> > WITHOUT recompiling the code.. Should be a tuneable if you ask me
> > (with checkpoint_segments at 256.. you have too many 16MB files in
> > the log directory) (This will help OLTP benchmarks more since now
> > they don't spend time rotating log files)
> >
> > * Introduce a multiblock or extent tunable variable where you can
> > define a multiple of 8K (or BlockSize tuneable) to read a bigger
> > chunk and store it in the bufferpool.. (Maybe writes too) (Most
> > devices now support upto 1MB chunks for reads and writes)
> >
> > *There should be a way to preallocate files for TABLES in
> > TABLESPACES otherwise with multiple table writes in the same
> > filesystem ends with fragmented files which causes poor "READS" from
> > the files.
> >
> > * With 64bit 1GB file chunks is also moot.. Maybe it should be
> > tuneable too like 100GB without recompiling the code.
> >
> > Why recompiling is bad? Most companies that will support Postgres
> > will support their own binaries and they won't prefer different
> > versions of binaries for different blocksizes, different WAL file
> > sizes, etc... and hence more function using the same set of binaries
> > is more desirable in enterprise environments
> 
> Every single one of these still begs the question of whether the
> changes will have a *material* impact on performance.

How many of these things are currently easy to change with a recompile?
I should be able to start testing some of these ideas in the near
future, if they only require minor code or configure changes.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Softwarehttp://pervasive.com512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Michael Stone

On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:

What we have been finding, as RAID controllers get smarter, is that it
is getting increasingly futile to try to attach knobs to 'disk stuff;'
it is *way* more effective to add a few more spindles to an array than
it is to fiddle with which disks are to be allocated to what database
'objects.'


That statement doesn't say anything about trying to maximize performance
to or from a disk array. Yes, controllers are getting smarter--but they
aren't omnicient. IME an I/O bound sequential table scan doesn't get
data moving off the disk nearly as fast as say, a dd with a big ibs.
Why? There's obviously a lot of factors at work, but one of those
factors is that the raid controller can optimize "grab this meg" a lot
more than it can optimize "grab this 8k". 


Mike Stone

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Read/Write block sizes (Was: Caching by Postgres)

2005-08-23 Thread Michael Stone

On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote:

Actually some of that readaheads,etc  the OS does  already if it does
some sort of throttling/clubbing of reads/writes.


Note that I specified the fully cached case--even with the workload in
RAM the system still has to process a heck of a lot of read calls.


* Introduce a multiblock or extent tunable variable where you can
define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk
and store it in the bufferpool.. (Maybe writes too) (Most devices now
support upto 1MB chunks for reads and writes)


Yeah. The problem with relying on OS readahead is that the OS doesn't
know whether you're doing a sequential scan or an index scan; if you
have the OS agressively readahead you'll kill your seek performance.
OTOH, if you don't do readaheads you'll kill your sequential scan
performance. At the app level you know which makes sense for each
operation.

Mike Stone

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Chris Browne
[EMAIL PROTECTED] (Jignesh Shah) writes:
>> Does that include increasing the size of read/write blocks? I've
>> noticedthat with a large enough table it takes a while to do a
>> sequential scan, even if it's cached; I wonder if the fact that it
>> takes a million read(2) calls to get through an 8G table is part of
>> that.
>
> Actually some of that readaheads,etc the OS does already if it does
> some sort of throttling/clubbing of reads/writes. But its not enough
> for such types of workloads.
>
> Here is what I think will help:
>
> * Support for different Blocksize TABLESPACE without recompiling the
> code.. (Atlease support for a different Blocksize for the whole
> database without recompiling the code)
>
> * Support for bigger sizes of WAL files instead of 16MB files
> WITHOUT recompiling the code.. Should be a tuneable if you ask me
> (with checkpoint_segments at 256.. you have too many 16MB files in
> the log directory) (This will help OLTP benchmarks more since now
> they don't spend time rotating log files)
>
> * Introduce a multiblock or extent tunable variable where you can
> define a multiple of 8K (or BlockSize tuneable) to read a bigger
> chunk and store it in the bufferpool.. (Maybe writes too) (Most
> devices now support upto 1MB chunks for reads and writes)
>
> *There should be a way to preallocate files for TABLES in
> TABLESPACES otherwise with multiple table writes in the same
> filesystem ends with fragmented files which causes poor "READS" from
> the files.
>
> * With 64bit 1GB file chunks is also moot.. Maybe it should be
> tuneable too like 100GB without recompiling the code.
>
> Why recompiling is bad? Most companies that will support Postgres
> will support their own binaries and they won't prefer different
> versions of binaries for different blocksizes, different WAL file
> sizes, etc... and hence more function using the same set of binaries
> is more desirable in enterprise environments

Every single one of these still begs the question of whether the
changes will have a *material* impact on performance.

What we have been finding, as RAID controllers get smarter, is that it
is getting increasingly futile to try to attach knobs to 'disk stuff;'
it is *way* more effective to add a few more spindles to an array than
it is to fiddle with which disks are to be allocated to what database
'objects.'

The above suggested 'knobs' are all going to add to complexity and it
is NOT evident that any of them will forcibly help.

I could be wrong; code contributions combined with Actual Benchmarking
would be the actual proof of the merits of the ideas.

But it also suggests another question, namely...

  Will these represent more worthwhile improvements to speed than
  working on other optimizations that are on the TODO list?

If someone spends 100h working on one of these items, and gets a 2%
performance improvement, that's almost certain to be less desirable
than spending 50h on something else that gets a 4% improvement.

And we might discover that memory management improvements in Linux
2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such
improvements "for free" behind our backs without *any* need to write
database code.  :-)
-- 
let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;;
http://www3.sympatico.ca/cbbrowne/postgresql.html
Wiener's Law of Libraries:
There are no answers, only cross references.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Read/Write block sizes (Was: Caching by Postgres)

2005-08-23 Thread Jignesh Shah
> Does that include increasing the size of read/write blocks? I've 
> noticedthat with a large enough table it takes a while to do a 
> sequential scan,
> even if it's cached; I wonder if the fact that it takes a million
> read(2) calls to get through an 8G table is part of that.
> 


Actually some of that readaheads,etc  the OS does  already if it does some sort 
of throttling/clubbing of reads/writes. But its not enough for such types of 
workloads.

Here is what I think will help:

* Support for different Blocksize TABLESPACE without recompiling the code.. 
(Atlease support for a different Blocksize for the whole database without 
recompiling the code)

* Support for bigger sizes of WAL files instead of 16MB files WITHOUT 
recompiling the code.. Should be a tuneable if you ask me (with 
checkpoint_segments at 256.. you have too many 16MB files in the log directory) 
(This will help OLTP benchmarks more since now they don't spend time rotating 
log files)

* Introduce a multiblock or extent tunable variable where you can define a 
multiple of 8K (or BlockSize tuneable) to read a bigger chunk and store it in 
the bufferpool.. (Maybe writes too) (Most devices now support upto 1MB chunks 
for reads and writes)

*There should be a way to preallocate files for TABLES in TABLESPACES otherwise 
with multiple table writes in the same filesystem ends with fragmented files 
which causes poor "READS" from the files. 

* With 64bit 1GB file chunks is also moot.. Maybe it should be tuneable too 
like 100GB without recompiling the code.


Why recompiling is bad? Most companies that will support Postgres will support 
their own binaries and they won't prefer different versions of binaries for 
different blocksizes, different WAL file sizes, etc... and hence more function 
using the same set of binaries is more desirable in enterprise environments


Regards,
Jignesh



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match