Re: [HACKERS] Brain dump: btree collapsing

2003-02-18 Thread Curtis Faith
 tom lane wrote:
  Hm.  A single lock that must be grabbed for operations anywhere in 
  the index is a concurrency bottleneck.

I replied a bit later: 
 I don't think it would be that bad.  It's not a lock but a
 mutex that would only be held for relatively brief duration. 
 It looks like tens or a few hundred instructions at most 
 covered by the mutex and only during scans that cross page boundaries.
 
 Unless you were talking about a 16 to 64 way SMP box, I don't
 think there would be much actual contention since there would 
 have to be contention for the same index at the same time and 
 depending on the number of keys per page, very little of the 
 run time is spent on the lock transition for traversal across 
 pages as compared to all the other stuff that would be going on.


tom also presciently wrote:
 But maybe we could avoid that

After sleeping on it a bit, it occurs to me that you are again correct;
the mutex is not required to do what I suggested. It is simply a matter
of assuring that during the page traversal the lock is either obtained
or the process added to the waiters queue before the current page lock
is released. This will ensure than any process trying to obtain an
exclusive or superexclusive lock will wait or fail an immediate lock
acquisition.


traversal pseudo code:

if the next page lock can be obtained immediately

grab it

release the current page lock

else if we would have to wait

add current process to waiters queue

release current page lock

sleep on the lock

end if

end traversal pseudo code


Like the current release before next page acquisition method, this won't
cause deadlocks with other scanners going in the opposite direction or
with concurrently inserting processes since the current page lock is
released before sleeping.

If this change is made then my original suggestion, that the merge
process attempt a non-deadlocking acquisition of all the locks with a
drop of all locks and a retry if any of the attempts would have caused a
wait, should work fine.

If the traversal code is changed as per above, a merging VACUUM process
is guaranteed to either run into the lock held by the scanning process
on the current page or the next page.

This has the potential side benefit of permitting the simplification of
the scan code in the case where the lock is obtained immediately. Since
the scanning process would know that no inserting process has added
pages to the tree because of a split, it wouldn't have to check for
newly inserted pages to the right of the next page. I don't know how
much work is involved in this check but if it is significant it could be
eliminated in this case, one which is the most likely case in many
scenarios.

What do you think?

- Curtis




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Brain dump: btree collapsing

2003-02-14 Thread Curtis Faith
tom lane wrote:

 How does that help?  The left-moving indexscan still has no 
 way to recover.  It can't go back to the page it was on 
 before and try to determine which entries you added there, 
 because it has no reliable reference point to do so.  The 
 entry it previously returned might not be there anymore, and 
 in a non-unique index key comparisons won't help. And even 
 more to the point, how would it know you've changed the left 
 page?  It has no idea what the previous page version on the 
 left page was, because it was never there before.

Another way this could be handled is by not merging onto one of the
existing pages but to a brand new page, a kind of special case shadow
index page. That way the sibling pointers, and leaf page pointer in the
parent could all be updated atomically to point to the new page. 

In-process index scans would still reference the merged pages which
would not be deleted but marked as dead using a mechanism like you
proposed for marking empty pages dead with the next-transaction-ID
counter. 

Merging could be done after a VACUUM pass that performs deletion of
empty pages in order to provide a pool of empty pages to use for the
merge. This would keep the index from temporarily growing during the
merge process.


A similar approach could be used to reorder the index pages in the
background. An index that was reordered to fairly closely reflect the
tree as a breadth first traversal would provide much faster index scans
if the file is not heavily fragmented.

- Curtis



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Brain dump: btree collapsing

2003-02-14 Thread Curtis Faith
tom lane wrote:
 Hmmm ... that might be made to work, but it would complicate 
 inserts. By the time an insert navigates to the page it 
 should insert on, it might find the page is dead, and then it 
 would have no easy way to get to the replacement page (unless 
 you want to dedicate another link field in every index page 
 for that one use).  I suppose it could recover by starting 
 the search over again.

Inserts could reread just the parent page if they encountered a dead
leaf since the parent would have been correctly updated.

 Another problem is that the two dead pages are now outside of 
 the btree structure and so their left and right links won't 
 get updated in the face of subsequent splits and merges of 
 the nearby pages.

That seems like a show stopper that just defers the problem. A split
of the left sibling would still screw up a scan that was using the old
left leaf page and wanted to move left.

Oh, well, the idea seemed worth exploring.

- Curtis




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Brain dump: btree collapsing

2003-02-14 Thread Curtis Faith
I previously wrote:
 5) A mutex/spinlock that was stored in the index could be 
 acquired by the scan code like this:
 
   buf = ReadBuffer(rel, blkno);   /* pin next page
 */
 
   SpinLockAcquire( indexSpecificMutex );/* lock the index
 reorg mutex */
 
   LockBuffer(buf, BUFFER_LOCK_UNLOCK);/* release lock on
 current page */
   LockBuffer(buf, BT_READ); /* lock next page */
 
   SpinLockRelease( indexSpecificMutex );/* unlock the index
 reorg mutex */
  
   ReleaseBuffer(buf);   /* now release pin on
 previously current page */
 
 6) The same index specific mutex/spinlock could be used for 
 the merge code surrounding only the acquisition of the four 
 page locks. This would obviate any problems with scans and 
 page merges, since the lock acquisition for the merge could 
 never occur while a scan was between pages.
 
 Further, with the reordering, the spinlock for the scan code 
 doesn't seem particularly onerous since it would be 
 surrounding only two LWLock calls.  To reduce the overhead to 
 an absolute minimum for the scan case these could be pushed 
 down into a new IW call (probably necessary since the 
 LockBuffer, ReleaseBuffer code does some error checking and 
 such that one wouldn't want in code guarded by a mutex.

I forgot to mention that the mutex would have to be release in the event
the next page lock could not be immediately acquired just after the
addition of the scan process to the lock waiters list to avoid blocking
all scans and probably causing severe deadlock problems.

- Curtis



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Brain dump: btree collapsing

2003-02-14 Thread Curtis Faith
tom lane wrote:
 How does that help?  The left-moving indexscan still has no 
 way to recover.  It can't go back to the page it was on 
 before and try to determine which entries you added there, 
 because it has no reliable reference point to do so.  The 
 entry it previously returned might not be there anymore, and 
 in a non-unique index key comparisons won't help. And even 
 more to the point, how would it know you've changed the left 
 page?  It has no idea what the previous page version on the 
 left page was, because it was never there before.

Revisiting the idea I proposed in a previous email after more research,
I believe it is definitely possible to implement a mutex based scheme
that will prevent scans from being polluted by merges of index pages and
that does not result in the mutex being held for any significant
duration.

1) Current scan code does this:

bool
_bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
... definitions go here...

if (ScanDirectionIsForward(dir))
{
if (!PageIsEmpty(page)  offnum  maxoff)
offnum = OffsetNumberNext(offnum);
else
{
/* walk right to the next page with data */
for (;;)
{
/* if we're at end of scan, release the
buffer and return */
... skip code here...

/* step right one page */
blkno = opaque-btpo_next;
_bt_relbuf(rel, *bufP);
*bufP = _bt_getbuf(rel, blkno, BT_READ);

... skip rest of code...

3) Note that the calls

_bt_relbuf(rel, *bufP);
*bufP = _bt_getbuf(rel, blkno, BT_READ);

   a) appear adjacent to each other
   b) relbuf calls:

  LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  ReleaseBuffer(buf);

   c) getbuf only calls:

  buf = ReadBuffer(rel, blkno);
LockBuffer(buf, access);
  
  in the case of an existing buffer, rather than a new one.

4) This could easily be reordered into:

  buf = ReadBuffer(rel, blkno); /* pin next page
*/
  LockBuffer(buf, BUFFER_LOCK_UNLOCK);  /* release lock on
current page */
LockBuffer(buf, BT_READ); /* lock next page */
  ReleaseBuffer(buf);   /* now release pin on
previously current page */

   without affecting the logic of the code or causing any deadlock
problems since the release still occurs before the lock of the next
page.

5) A mutex/spinlock that was stored in the index could be acquired by
the scan code like this:

  buf = ReadBuffer(rel, blkno); /* pin next page
*/

SpinLockAcquire( indexSpecificMutex );/* lock the index
reorg mutex */

  LockBuffer(buf, BUFFER_LOCK_UNLOCK);  /* release lock on
current page */
LockBuffer(buf, BT_READ); /* lock next page */

SpinLockRelease( indexSpecificMutex );/* unlock the index
reorg mutex */
 
  ReleaseBuffer(buf);   /* now release pin on
previously current page */

6) The same index specific mutex/spinlock could be used for the merge
code surrounding only the acquisition of the four page locks. This would
obviate any problems with scans and page merges, since the lock
acquisition for the merge could never occur while a scan was between
pages.

Further, with the reordering, the spinlock for the scan code doesn't
seem particularly onerous since it would be surrounding only two LWLock
calls.  To reduce the overhead to an absolute minimum for the scan case
these could be pushed down into a new IW call (probably necessary since
the LockBuffer, ReleaseBuffer code does some error checking and such
that one wouldn't want in code guarded by a mutex.

- Curtis



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Brain dump: btree collapsing

2003-02-13 Thread Curtis Faith
tom lane initially wrote:
 Restructuring the tree during page deletion
 ---
 
 We will delete only completely-empty pages.  If we were to
 merge nearly-empty pages by moving data items from one page 
 to an adjacent one, this would imply changing the parent's 
 idea of the bounding key between them --- which is okay if we 
 are just deleting an internal key in the parent, but what if 
 the pages have different parent pages?

and a bit later wrote:
 My feeling is that what we need to fix now is index bloat during 
 normal operation.

How about doing deletion of partial pages with reorganization among
sibling pages only (where the parent pages are the same)? This avoids
the messiness of propogating the deletes to differing parent pages but
gets most of the value of reorganization.

ISTM, that a VACUUM that only reclaims empty pages will be helpful in
certain cases but won't help much at all in many other common normal
operation cases which would be helped by partial reorganization.

- Curtis



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Brain dump: btree collapsing

2003-02-13 Thread Curtis Faith
tom lane wrote:
 Got any evidence to back that up?  I was relying on
 
 [Johnson89] Johnson, T. and Shasha, D.  Utilization of 
 B-trees with Inserts, Deletes and Modifies  ACM Symp. on 
 PODS, 235-246, 1989.
 
 which provides a ton of math and simulations leading up to 
 the conclusion that collapsing btree pages before they are 
 fully empty doesn't really gain anything meaningful in terms 
 of storage utilization, in most scenarios.

Well, I don't have that specific paper handy but I do have [JS93]
Theodore Johnson , Dennis Shasha, B-trees with inserts and deletes: why
free-at-empty is better than merge-at-half which appears to be their
later thinking on the same subject.

Note the following:

A merge-at-half B-tree will always have a space utilization of at least
50%. When all operations are modify operations, or when the number of
insert operations is the same as the number of delete operations, then
the utilization will be about 60%. In contrast, a free-at-empty B-tree
has a 0% lower bound on its space utilization, and will have about 39%
utilization on a pure-modify instruction mix. However, the space
utilization of a free-at-empty B-tree remains high if there are just a
few more insert operations than delete operations. Thus, merge-at-half
usually buys little in terms of space utilization. 

In Figure 6, we showed that the restructuring rate of a merge-at-half
B-tree is significantly larger than the restructuring rate of a
free-at-empty B-tree for all values of q * :1. For many concurrent
B-tree algorithms used in practice [4, 13], restructuring causes a
serialization bottleneck. Thus, one simple but important way to increase
concurrency in B-tree operations is to reduce the probability of
restructuring. Since merge-at-half buys little space utilization but is
expensive in terms of restructuring, we recommend that B-trees
(especially concurrently accessed ones) use free-at-empty.



I don't dispute their conclusions in that context and under the
circumstances they outline of random distribution of deletion and
insertion values for the index keys. 

However, as  [Jannink]: Implementing Deletion in B+-trees. SIGMOD
RECORD, v.24, n.1, p.33-38, 1995 points out that assumption doesn't
hold under other possibly common circumstances, specifically
circumstances where the deletes are taking place in significant sections
of the index at a much higher rate than inserts.

Consider from [Jannink95]:

There has been some research on the acceptability of relaxing the
constraint of minimum node size to reduce the number of so-called unsafe
tree operations, i.e., those which contain node splitting and merging
[ZH89].

The debate has culminated in analysis of a weaker form of the deletion
algorithm which we call lazy deletion, that imposes no constraints on
the number of entries left in the nodes, allowing them to empty
completely before simply removing them. According to [GR93], most
database system implementations of B+-trees have adopted this approach. 
Its most effective use is when it is vital to allow concurrent access to
the tree [JS93b], and excessive splitting and merging of nodes would
restrict concurrency. [JS89] derives some analytic solutions calculating
memory utilization for B+-trees under a mix of insertions and lazy
deletions, based on previous research which considered insertions only
[BY89]. The simulations in [JS89] support its analysis to show that in
typical situations, where deletions don't outnumber insertions in the
mix of operations, the tree nodes will contain acceptable percentages of

entries.

One of the work's assumptions [JS93a] is that the keys and tree
operations are chosen uniformly from a random distribution. This
assumption is unreasonable in certain realistic situations such as one
described below. Allowing interior nodes with only a single pointer to
exist in a B+-tree creates the possibility for arbitrarily unbalanced 
trees of any height, which are virtually empty, and in which access
times have degenerated from the logarithmic bound B+-trees are meant to
guarantee to a worst case unbounded access time. Since nodes are not
removed until they are completely empty, the lazy deletion algorithm
does not regulate tree height effectively.


Jannink then illustrates an example where an index is created based on a
timestamp where the basic assumption of Johnson and Sasha does not hold
since it is not a random distribution but a monotonically increasing
value. His example is an extreme one but I believe there are many
instances where a timestamp, sequence or some other monotonically
increasing value is used in an index and where deletes are taking place
much more frequently for largely older values. 

Since sequences are often used as foreign keys a significant number of
indexes fit into this category.

Consider a web site that tracked users and that deleted inactive
accounts. There are many real-world scenarios where the number of
inactive accounts is very high as a percentage of 

Re: [HACKERS] Brain dump: btree collapsing

2003-02-13 Thread Curtis Faith
tom lane wrote:
 Curtis Faith [EMAIL PROTECTED] writes:
  I don't dispute their conclusions in that context and under the
  circumstances they outline of random distribution of deletion and 
  insertion values for the index keys.  [But the random-distribution 
  assumption doesn't always hold.]
 
 That's a fair point.  Nonetheless, we have little choice: we
 cannot move keys around during concurrent operations.  If we 
 push keys to the right, we may cause an indexscan moving left 
 to miss them, and vice versa.  So we can only eliminate empty pages.
 
 We could possibly allow VACUUM FULL to collapse partly-full
 pages, since in that case we know there are no concurrent scans.

Couldn't we do an exclusive lock call on both leaf pages and the parent
using a new LockBuffer type function, named something like
LockBufferNoWait, that uses LWLockConditionalAcquire instead of
LWLockAcquire, in the event that all three exclusive locks cannot be
obtained release all three locks, sleep, and try again for N retries. 

(Actually, this would probably be four locks since the sibling pointer
of one of the siblings would have to be updated to point to the new
merged page instead of the to-be-deleted page.)

Having exclusive locks on all three pages prior to a merge would ensure
that no scans were interrupted during that merge.

Releasing all the exclusive locks in the event of failure to obtain any
of the locks will eliminate the possibility of creating deadlocks.

After N retries, VACCUUM could abort leaving the merge to be picked up
in another future VACUUM run.

I would think that this would be fairly easy to implement and that
except for very heavily scanned indexes, most of the time the locks
could be acquired and the merge would take place without causing any
concurrency problems.

- Curtis



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [mail] Re: [HACKERS] Windows Build System

2003-01-31 Thread Curtis Faith
Christopher Browne wrote:
 snip
  From the MySQL site's page about MySQL vs PostgreSQL: 
 http://www.mysql.com/doc/en/MySQL-PostgreSQL_features.html
 
 MySQL Server works better on Windows than PostgreSQL does. MySQL 
 Server runs as a native Windows application (a service on 
 NT/2000/XP), while PostgreSQL is run under the Cygwin emulation.
 
 That seems pretty straightforward.
  
  But it's not /nearly/ that straightforward.
  
  If you look at the downloads that MySQL AB provides, they 
 point you to a link that says Windows binaries use the Cygwin
library.
  
  Which apparently means that this feature is not actually 
 a feature.  
  Unlike
  PostgreSQL, which is run under the Cygwin emulation, 
 MySQL runs as a native 
  Windows application (with Cygwin emulation).  Apparently 
 those are not at all 
  the same thing, even though they are both using Cygwin...

Justin Clift replied:
 Hmm... wonder if they're meaning that MySQL compiles and 
 executes as a True native windows application (skipping any unix 
 compatibility calls), and it's just some of the support utils that
 use cygwin, or if they're trying to say that PostgreSQL has to
 operate entirely in the cygwin environment, whereas they don't?

I just downloaded the latest productin source (3.3.55) and it appears to
me that:

1) It uses Cygwin emulation via a dll.

2) It uses Visual Studio C++ 6.0 for the primary build environment. It
compiles out of the box without having to learn Unix-style build
systems, config, make, etc. No warnings, no errors, it just builds out
of the box. If I did not have a lot of experience building databases I
certainly would have found their support for Windows compelling. This is
a big reason why they are #1.

3) The statement by the MySQL folks above that MySQL runs as a native
Windows application (a service on NT/2000/XP) is indicative of why MySQL
is kicking PostgreSQL's butt in terms of popularity. It is marketing
speak at its best. It is technically true, MySQL runs as a service. As
Christopher Browne points out, they still use the Cygwin Emulation
layer. The statement is misleading, however, as it implies that they
don't use any emulation but they do.

The salient points:
   
   a) Running as a service is important as this the way NT/2000
administrators manage server tasks. The fact that PostgreSQL's Cygwin
emulation doesn't do this is very indicative of inferior Windows
support.

   b) MySQL recognizes that the important issue is to appear to be a
well supported Windows application rather than to actually be one.

   c) It is probably much easier to add the support for running as an NT
service than it is to write a true native port with no Cygwin
dependency. NT Service support is basically a single funtion wrapper for
certain API calls (startup, shutdown, etc.) that enable the Windows
administration tools to deal with all servers in a similar manner. 

They have worked on that which makes them look better, makes their
prospective customers happier, and makes it easier to support. Exactly
what any good product development organization that listens to their
customers would have done.

flame on
IMHO, PostgreSQL will never have the same level of use in the field as
MySQL currently does as long as there is the kind head in the sand
attitude about Windows that I've seen here on the hackers list,
especially as evidenced by the recent outright attacks against those who
are simply trying to port PostgreSQL to the largest platform out there
today.

There have been some very legitimate points about Windows being a new
platform, one that will likely see a lot of users, and therefore one
that should be more thoroughly tested before release than the typical
port to another flavor of *nix. 

However, the way the conversation started reminds me of some of the chat
discussions I've seen between young teens.

I was a Mac developer way, way back and long ago realized that the best
often loses and that better marketing beats better engineering every
single time.
\flame off

DISCLAIMER: I hate Microsoft and Windows drives me nuts.

- Curtis






---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [mail] Re: [HACKERS] Windows Build System

2003-01-29 Thread Curtis Faith
tom lane wrote:
 flame on
 In all honesty, I do not *want* Windows people to think that
 they're not running on the poor stepchild platform.

We should distinguish between poor stepchild from a client support
perspective and a production environment perspective.

What is the downside to supporting development of client products
better? That is what I am really suggesting.

If people are deciding what open-source database server they want to
use, Linux or FreeBSD is the obvious choice for the server OS. The kind
of people who are inclined to use PostgreSQL or MySQL will mostly NOT be
considering Windows servers.


 I have no objection to there being a Windows port that people
 can use to do SQL-client development on their laptops.  But 
 let us please not confuse this with an industrial-strength 
 solution; nor give any level of support that might lead 
 others to make such confusion.

All we can do is simply to make it clear that Windows
is not recommended for production server use and outline
all the reasons why Windows sucks for that purpose.

Beyond that, if people want to shoot themselves in the head, they will
do so and I don't see much point in trying to stop them.


 The MySQL guys made the right choice here: they don't want to
 buy into making Windows a grade-A platform, either. flame off

flame retardent on
How does providing a native Windows executable that doesn't require
Cygwin accomplish your objective. It seems to me that you are going to
have the problem if you release a native version irrespective of the
issue at hand (Visual C++ project support).

I don't see how making it easier to build adds to this problem.

I also don't see how making it harder for Windows client developer to
adopt PostgreSQL helps anyone. flame retardent off

I hate Microsoft and I don't like Windows, but I am forced to use it
because the software we need to run our business runs only on 
Windows. I use Unix whenever possible and whenever reliability is
required.

- Curtis

P.S. The lack of a real C++ client library that supports the most common
development environment out there is another problem that seriously
impedes Windows client developers.

I like libpqxx, Jeroen did a find job. However, one needs to 
jump through hoops to get it to run on Visual C++ 6.0 at
the moment.



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Windows Build System

2003-01-23 Thread Curtis Faith
Curtis Faith wrote:
  The Visual C++ Workspaces and Projects files are actually 
  text files that have a defined format. I don't think the format is 
  published but it looks pretty easy to figure out.

Hannu Krosing wrote:
 will probably change between releases

Even if the format changes, the environment always has a converter that
updates the project and workspace files to the new format. In other
words, Visual C++ 6.0 reads 5.0 projects, 7.0 reads 6.0, etc.

The format is mostly a bunch of options specifications (which wouldn't
get touched) followed by a set of named groups of source files. Even if
the overall format changes, it will be much more likely to change in the
specifications rather than the way lists of source file formats are
specified.

A conversion script would only need to:

1) Read in the template file (containing all the options specifications
and Visual C++ speficic stuff, debug and release target options,
libraries to link in, etc.) This part might change with new versions of
the IDE and would be manually created by someone with Visual C++
experience.

2) Read in the postgreSQL group/directory map, or alternately just
mirror the groups with the directories.

3) Output the files from the PostgreSQL directories in the appropriate
grouping according to the project format into the appropriate space in
the template.

An excerpt of the format follows:

# Begin Group Access
# Begin Group Common
# PROP Default_Filter cpp;c;cxx
# Begin Source File

SOURCE=.\access\common\heaptuple.c
# End Source File
# Begin Source File

SOURCE=.access\common\indextuple.c
# End Source File

... other files in access\common go here
# End Group

# Begin Group Index

# PROP Default_Filter cpp;c;cxx
# Begin Source File

SOURCE=.\access\index\genam.c
# End Source File
# Begin Source File

SOURCE=.access\index\indexam.c
# End Source File

... other files in access\index go here
# End Group

# End Group


As you can see, this is a really easy format, and the folder/group
mapping with directories is pretty natural and probably the way to go.

Using the approach I outline, it should be possible to have the Unix
make system automatically run the BuildWindowsProjectFile tool whenever
any makefile changes so the Windows projects would stay up to date
without additional work for Unix developers.

Hannu Krosing also wrote:
 (also I dont think you can easily compile C source on a
 C# compiler) ;/

I don't think it makes much sense target a compiler that won't compile
the source, therefore, if what you say is true, we shouldn't bother with
targeting C#. Why does this matter? I didn't propose targeting Visual
C#.NET.

- Curtis




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: Windows Build System was: [HACKERS] Win32 port patches submitted

2003-01-22 Thread Curtis Faith
tom lane writes:
 You think we should drive away our existing unix developers 
 in the mere hope of attracting windows developers?  Sorry, it 
 isn't going to happen.

Tom brings up a good point, that changes to support Windows should not
add to the tasks of those who are doing the bulk of the work on Unixen.

I don't think, however, that this necessarily means that having Windows
developers use Cygwin is the right solution. We need to come up with a
way to support Windows Visual C++ projects without adding work to the
other developers. 

I believe this is possible and have outlined some ways at the end, but
first some rationale:

One of the biggest benefits to Open Source projects is the ability to
get in there and debug/fix problems using the source. PostgreSQL will
lose out to lesser DBs if there is no easy way to build and DEBUG the
source on Windows. This is true whether one admits that Windows sucks or
not.

A developer faced with the decision of choosing:

A) a system that has a native Windows Visual C++ project that runs and
compiles the release with no work.

B) a system that requires learning a new way of building, new tools, new
who knows what else.

will always choose A unless there is a very compelling reason to choose
B. There are plenty of reasons a clever (or even not so clever) Windows
developer can use to justify using MySQL or another open source DB
instead of PostgreSQL. It is a bit harder for the neophyte to find and
believe the compelling reasons to use PostgreSQL. We need to make it
easier to choose PostgreSQL not harder.

Think about it from this perspective. How many of you would even think
about working on a project if it required that you stop using your
favorite toolset, gmake? EMACS? grep? shell scripts? etc.?

Professional developers spend years honing their skills. Learning the
proper use of the tools involved is a very big part of the process.

IMHO, having a native port without native (read Visual C++) project
support is a a huge missed opportunity.

Further, lack of Windows project files implies that PostgreSQL just a
Unix port and that the Windows support is immature, whether the code is
well supported or not.

POSSIBLE SOLUTIONS:

The Visual C++ Workspaces and Projects files are actually text files
that have a defined format. I don't think the format is published but it
looks pretty easy to figure out.

The Visual C++ environment does not require dependency specification, it
builds dependency trees by keeping track of the #include files used
during preprocessing.

Because of this, it should be possible to:

A) Write a script/tool that reads the input files from Unix makefiles to
build a list of the files in PostgreSQL and place them in appropriate
projects.

or alternately:

B) A script/tool that recurses the directories and does the same sort of
thing. There could be some sort of mapping between directories and
projects in Visual C++.

In short, for most organizations being able to easily build using the
source is a prerequisite for USING an open source database, not just for
being part of the DEVELOPMENT effort.

-Curtis

P.S. I speak from personal experience, I would have been able to help
out a lot more if I didn't have to spend 90% of my time working with
PostgreSQL learning Unix (or relearning) and gnu tool issues. I don't
personally mind so much because I wanted to learn it better anyway, but
it has definitely limited my ability to help, so far. This is especially
true since I don't have the opportunity to immerse myself in
Unix/PostgreSQL for days at a time and suffer from significant switching
costs.





---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Windows Build System

2003-01-22 Thread Curtis Faith
I (Curtis Faith) previously wrote:
  The Visual C++ Workspaces and Projects files are actually
  text files that have a defined format. I don't think the format is 
  published but it looks pretty easy to figure out.

Hannu Krosing replied:
 will probably change between releases

Even if the format changes, the environment always has a converter that
updates the project and workspace files to the new format. In other
words, Visual C++ 6.0 reads 5.0 projects, 7.0 reads 6.0, etc.

The format is mostly a bunch of options specifications (which wouldn't
get touched) followed by a set of named groups of source files. Even if
the overall format changes, it will be much more likely to change in the
specifications rather than the way lists of source file formats are
specified.

A conversion tool, call it BuildWindowsProjectFile, would only need to:

1) Read in the template file (containing all the options specifications
and Visual C++ speficic stuff, debug and release target options,
libraries to link in, etc.) This part might change with new versions of
the IDE and would be manually created by someone with Visual C++
experience.

2) Read in the postgreSQL group/directory map, or alternately just
mirror the groups with the directories.

3) Output the files from the PostgreSQL directories in the appropriate
grouping according to the project format into the appropriate space in
the template.

An excerpt of the format follows:

# Begin Group Access
# Begin Group Common
# PROP Default_Filter cpp;c;cxx
# Begin Source File

SOURCE=.\access\common\heaptuple.c
# End Source File
# Begin Source File

SOURCE=.access\common\indextuple.c
# End Source File

... other files in access\common go here
# End Group

# Begin Group Index

# PROP Default_Filter cpp;c;cxx
# Begin Source File

SOURCE=.\access\index\genam.c
# End Source File
# Begin Source File

SOURCE=.access\index\indexam.c
# End Source File

... other files in access\index go here
# End Group

# End Group


As you can see, this is a really simple format, and the direct
folder/group mapping to PostgreSQL directory is pretty natural and
probably the way to go.

Using the approach I outline, it should be possible to have the Unix
make system automatically run the BuildWindowsProjectFile tool whenever
any makefile changes so the Windows projects would stay up to date
without additional work for Unix developers.

Hannu Krosing also wrote:
 (also I dont think you can easily compile C source on a
 C# compiler) ;/

I don't think it makes much sense target a compiler that won't compile
the source, therefore, if what you say is true, we shouldn't bother with
targeting C#.

- Curtis



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Windows Build System

2003-01-22 Thread Curtis Faith
Hannu Krosing asked:
 Does anyone know how MySQL and interbase/firebird do it ?
 

From the MySQL web site for version 4.0:

The Windows binaries use the Cygwin library. Source code for the
version of Cygwin we have used is available on this page.

I think this offers a very big opportunity to differentiate. If we had
project support it would make PostgreSQL a more natural choice for
Windows developers.

Even in a shop that uses Unix servers with Windows front-ends, having a
database server that runs on your own machine makes it much easier to
debug, write code on airplanes, etc.

- Curtis





---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[PERFORM] [HACKERS] Realtime VACUUM, was: performance of insert/delete/update

2002-11-26 Thread Curtis Faith
tom lane wrote:
 Sure, it's just shuffling the housekeeping work from one place to
 another.  The thing that I like about Postgres' approach is that we
 put the housekeeping in a background task (VACUUM) rather than in the
 critical path of foreground transaction commit.

Thinking with my marketing hat on, MVCC would be a much bigger win if VACUUM
was not required (or was done automagically). The need for periodic VACUUM
just gives ammunition to the PostgreSQL opponents who can claim we are
deferring work but that it amounts to the same thing.

A fully automatic background VACUUM will significantly reduce but will not
eliminate this perceived weakness.

However, it always seemed to me there should be some way to reuse the space
more dynamically and quickly than a background VACUUM thereby reducing the
percentage of tuples that are expired in heavy update cases. If only a very
tiny number of tuples on the disk are expired this will reduce the aggregate
performance/space penalty of MVCC into insignificance for the majority of
uses.

Couldn't we reuse tuple and index space as soon as there are no transactions
that depend on the old tuple or index values. I have imagined that this was
always part of the long-term master plan.

Couldn't we keep a list of dead tuples in shared memory and look in the list
first when deciding where to place new values for inserts or updates so we
don't have to rely on VACUUM (even a background one)? If there are expired
tuple slots in the list these would be used before allocating a new slot from
the tuple heap.

The only issue is determining the lowest transaction ID for in-process
transactions which seems relatively easy to do (if it's not already done
somewhere).

In the normal shutdown and startup case, a tuple VACUUM could be performed
automatically. This would normally be very fast since there would not be many
tuples in the list.

Index slots would be handled differently since these cannot be substituted
one for another. However, these could be recovered as part of every index
page update. Pages would be scanned before being written and any expired
slots that had transaction ID's lower than the lowest active slot would be
removed. This could be done for non-leaf pages as well and would result in
only reorganizing a page that is already going to be written thereby not
adding much to the overall work.

I don't think that internal pages that contain pointers to values in nodes
further down the tree that are no longer in the leaf nodes because of this
partial expired entry elimination will cause a problem since searches and
scans will still work fine.

Does VACUUM do something that could not be handled in this realtime manner?

- Curtis



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] 500 tpsQL + WAL log implementation

2002-11-15 Thread Curtis Faith
I have been experimenting with empirical tests of file system and device
level writes to determine the actual constraints in order to speed up the WAL
logging code.

Using a raw file partition and a time-based technique for determining the
optimal write position, I am able to get 8K writes physically written to disk
synchronously in the range of 500 to 650 writes per second using FreeBSD raw
device partitions on IDE disks (with write cache disabled).  I will be
testing it soon under linux with 10,00RPM SCSI which should be even better.
It is my belief that the mechanism used to achieve these speeds could be
incorporated into the existing WAL logging code as an abstraction that looks
to the WAL code just like the file level access currently used. The current
speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
disk this is 120/second, for a 10,000 RPM disk this is 166.66/second

The mechanism works by adjusting the seek offset of the write by using
gettimeofday to determine approximately where the disk head is in its
rotation. The mechanism does not use any AIO calls.

Assuming the following:

1) Disk rotation time is 8.333ms or 8333us (7200 RPM).

2) A write at offset 1,500K completes at system time 103s 000ms 000us

3) A new write is requested at system time 103s 004ms 166us

4) A 390K per rotation alignment of the data on the disk.

5) A write must be sent at least 20K ahead of the current head position to
ensure that it is written in less than one rotation.

It can be determined from the above that a write for an offset of something
slightly more than 195K past the last write, or offset 1,695K will be ahead
of the current location of the head and will therefore complete in less than
a single rotation's time.

The disk specific metrics (rotation speed, bytes per rotation, base write
time, etc.) can be derived empirically through a tester program that would
take a few minutes to run and which could be run at log setup time.

The obvious problem with the above mechanism is that the WAL log needs to be
able to read from the log file in transaction order during recovery. This
could be provided for using an abstraction that prepends the logical order
for each block written to the disk and makes sure that the log blocks contain
either a valid logical order number or some other marker indicating that the
block is not being used.

A bitmap of blocks that have already been used would be kept in memory for
quickly determining the next set of possible unused blocks but this bitmap
would not need to be written to disk except during normal shutdown since in
the even of a failure the bitmaps would be reconstructed by reading all the
blocks from the disk.

Checkpointing and something akin to log rotation could be handled using this
mechanism as well.

So, MY REAL QUESTION is whether or not this is the sort of speed improvement
that warrants the work of writing the required abstraction layer and making
this very robust. The WAL code should remain essentially unchanged, with
perhaps new calls for the five or six routines used to access the log files,
and handle the equivalent of log rotation for raw device access. These new
calls would either use the current file based implementation or the new
logging mechanism depending on the configuration.

I anticipate that the extra work required for a PostgreSQL administrator to
use the proposed logging mechanism would be to:

1) Create a raw device partition of the appropriate size
2) Run the metrics tester for that device partition
3) Set the appropriate configuration parameters to indicate raw WAL logging

I anticipate that the additional space requirements for this system would be
on the order of 10% to 15% beyond the current file-based implementation's
requirements.

So, is this worth doing? Would a robust implementation likely be accepted for
7.4 assuming it can demonstrate speed improvements in the range of 500tps?

- Curtis

















---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] 500 tpsQL + WAL log implementation

2002-11-12 Thread Curtis Faith
tom lane wrote:
 What can you do *without* using a raw partition?

 I dislike that idea for two reasons: portability and security.  The
 portability disadvantages are obvious.  And in ordinary system setups
 Postgres would have to run as root in order to write on a raw partition.

 It occurs to me that the same technique could be used without any raw
 device access.  Preallocate a large WAL file and apply the method within
 it.  You'll have more noise in the measurements due to greater
 variability in the physical positioning of the blocks --- but it's
 rather illusory to imagine that you know the disk geometry with any
 accuracy anyway.  Modern drives play a lot of games under the hood.

A write to a raw disk file is immediate and completes with minimal system
overhead. I'll try to test a file-based approach using a write followed by an
immediate fdatasynch and see if that approaches the speed of the raw
partition access. I suspect we'll get decent performance, only perhaps 10% to
15% slower.

As you mention, there is nothing exact about the technique, so we should be
able to get similar improvements with a file based system. I've been able to
get over 1,500 raw writes confirmed to disk using raw partition writes each
slightly offset ahead of they other, yet, only somewhere between 500 and 650
on a sustained basis using the technique I described because of the noise in
the geometry measurements and variable timing for the writes themselves.


 This scares me quite a bit too.  The reason that the existing
 implementation maxes out at one WAL write per rotation is that for small
 transactions it's having to repeatedly write the same disk sector.  You
 could only get around that by writing multiple versions of the same WAL
 page at different disk locations.  Reliably reconstructing what data to
 use is not something that I'm prepared to accept on a handwave...

I'm pretty sure this could be done very reliably but at the cost of slightly
slower reading after a failure for redo.

I figured that whenever a transaction wrote to the log it would set the log
offset marker for new transactions to force the next transaction to use a new
block. This would result in space waste which could be partially offset by
using writes smaller than the 8K block size (along disk block size
boundaries, 512 bytes for my disk). This has the advantage of making it
fairly easy to make sure that the log can be reconstructed in order since
there would be no partial block writes to worry about.

I believe that 4 to 8 full rotations worth of usable blocks could be
maintained and blocks would be written to the lowest offset tracks first
unless there were no free blocks of sufficient size. This would probably
result in 90% to 95% utilization of the blocks (disregarding waste inside the
blocks themselves). When the lowest offset track filled up sufficiently,
another empty track would be added to the usable blocks list and the lowest
offset track taken off the unused list.

This would ensure that a read of 4 to 8 tracks, which needs to be a fixed
number for any given installation, could reconstruct the order of the WAL log
since at no time would blocks be out of order beyond that range.

Disk space is much cheaper than CPU and memory so I think that a logging
system that used as much as three or four times the space but is three or
four times faster would be a worthwhile improvement for those systems where
updates or insert volume are very heavy. Obviously, this needs to be an
option, not the default configuration.

- Curtis


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



[HACKERS] Prepare enabled pgbench

2002-11-12 Thread Curtis Faith
Tatsuo, are you or anyone else working on adding PREPARE, EXECUTE support to
pgbench?

If not, I can do it myself and if you are interested, I'll send you the
patch.

- Curtis


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



[HACKERS] 500 tpsQL + WAL log implementation

2002-11-11 Thread Curtis Faith
I have been experimenting with empirical tests of file system and device
level writes to determine the actual constraints in order to speed up the WAL
logging code.

Using a raw file partition and a time-based technique for determining the
optimal write position, I am able to get 8K writes physically written to disk
synchronously in the range of 500 to 650 writes per second using FreeBSD raw
device partitions on IDE disks (with write cache disabled).  I will be
testing it soon under linux with 10,00RPM SCSI which should be even better.
It is my belief that the mechanism used to achieve these speeds could be
incorporated into the existing WAL logging code as an abstraction that looks
to the WAL code just like the file level access currently used. The current
speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
disk this is 120/second, for a 10,000 RPM disk this is 166.66/second

The mechanism works by adjusting the seek offset of the write by using
gettimeofday to determine approximately where the disk head is in its
rotation. The mechanism does not use any AIO calls.

Assuming the following:

1) Disk rotation time is 8.333ms or 8333us (7200 RPM).

2) A write at offset 1,500K completes at system time 103s 000ms 000us

3) A new write is requested at system time 103s 004ms 166us

4) A 390K per rotation alignment of the data on the disk.

5) A write must be sent at least 20K ahead of the current head position to
ensure that it is written in less than one rotation.

It can be determined from the above that a write for an offset of something
slightly more than 195K past the last write, or offset 1,695K will be ahead
of the current location of the head and will therefore complete in less than
a single rotation's time.

The disk specific metrics (rotation speed, bytes per rotation, base write
time, etc.) can be derived empirically through a tester program that would
take a few minutes to run and which could be run at log setup time.

The obvious problem with the above mechanism is that the WAL log needs to be
able to read from the log file in transaction order during recovery. This
could be provided for using an abstraction that prepends the logical order
for each block written to the disk and makes sure that the log blocks contain
either a valid logical order number or some other marker indicating that the
block is not being used.

A bitmap of blocks that have already been used would be kept in memory for
quickly determining the next set of possible unused blocks but this bitmap
would not need to be written to disk except during normal shutdown since in
the even of a failure the bitmaps would be reconstructed by reading all the
blocks from the disk.

Checkpointing and something akin to log rotation could be handled using this
mechanism as well.

So, MY REAL QUESTION is whether or not this is the sort of speed improvement
that warrants the work of writing the required abstraction layer and making
this very robust. The WAL code should remain essentially unchanged, with
perhaps new calls for the five or six routines used to access the log files,
and handle the equivalent of log rotation for raw device access. These new
calls would either use the current file based implementation or the new
logging mechanism depending on the configuration.

I anticipate that the extra work required for a PostgreSQL administrator to
use the proposed logging mechanism would be to:

1) Create a raw device partition of the appropriate size
2) Run the metrics tester for that device partition
3) Set the appropriate configuration parameters to indicate raw WAL logging

I anticipate that the additional space requirements for this system would be
on the order of 10% to 15% beyond the current file-based implementation's
requirements.

So, is this worth doing? Would a robust implementation likely be accepted for
7.4 assuming it can demonstrate speed improvements in the range of 500tps?

- Curtis


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Postgresql and multithreading

2002-10-16 Thread Curtis Faith

Bruce Momjian wrote:
 Let me add one more thing on this thread.  This is one email in a long
 list of Oh, gee, you aren't using that wizz-bang new
 sync/thread/aio/raid/raw feature discussion where someone shows up and
 wants to know why.  Does anyone know how to address these, efficiently?

You've brought up good points here. I'm sure that you consider me guilty of
this with regard to my aio discussions. So I might offer a few specific
suggestions.

1) Someone's taking the time to generate a summary of the current thinking
with regard to a particular whiz-bang feature. - I can do this as a first
pass  for aio, if you think it's a good idea.

2) Including the pros and cons of the feature/implementation and how close
the group is to deciding whether something would be worth doing. - I can
also do this.

3) A set of criteria that would need to be met or demonstrated before a
feature would be considered good enough for inclusion into the main code.

If there was a separate section of the document Developer's FAQ that
handled this, this would help.

4) A development/feature philosophy section. Maybe three or four paragraphs
with one specific example: perhaps multi-threading.

5) I'd also suggest changing some specfics in the FAQ Section 1.2 where
there is currently:

* The usual process for source additions is:
*
* Review the TODO list.
* Discuss hackers the desirability of the fix/feature.
* How should it behave in complex circumstances?

Add here that a check should be made to the new section in the FAQ
described above. Also, section 1.1 has:

* Discussion on the patch typically happens here. If the patch adds a
* major feature, it would be a good idea to talk about it first on
* the HACKERS list, in order to increase the chances of it being
* accepted, as well as toavoid duplication of effort.

We should perhaps add here a section describing the phenomenon you describe
regarding whiz-bang features.

I tried to prepare as best I could before bringing anything forward to
HACKERS. In particular, I read the last two years of archives with anything
to do with the WAL log and looked at the current code, read the TODOs, read
a fair amount of discussions about aio. etc. So I was attempting to comply
with my interpretation of the process. Yet, despite these efforts, you no
doubt consider me guilty of creating unnecessary work, an outcome neither
of us desired.

I'm undeterred in my desire to come up with something meaningful and am
working on some interesting tests. I do, however, now know that the level
of scepticism and the standards of architectural purity are high. I
consider this good all around.

- Curtis


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Analysis of ganged WAL writes

2002-10-08 Thread Curtis Faith

  Since in your case all transactions A-E want the same buffer written,
  the memory (not it's content) will also be the same.

 But no, it won't: the successive writes will ask to write different
 snapshots of the same buffer.

Successive writes would write different NON-OVERLAPPING sections of the
same log buffer. It wouldn't make sense to send three separate copies of
the entire block. That could indeed cause problems.

If a separate log writing process was doing all the writing, it could
pre-gang the writes. However, I'm not sure this is necessary. I'll test the
simpler way first.

  The problem I can see offhand is how the kaio system can tell which
  transaction can be safely notified of the write,

 Yup, exactly.  Whose snapshot made it down to (stable) disk storage?

If you do as above, it can inform the transactions when the blocks get
written to disk since there are no inconsistent writes. If transactions A,
B and C had commits in block 1023, and the aio system writes that block to
the disk, it can notify all three that their transaction write is complete
when that block (or partial block) is written to disk.

It transaction C's write didn't make it into the buffer, I've got to assume
the aio system or the disk cache logic can handle realizing that it didn't
queue that write and therefore not inform transaction C of a completion.

- Curtis


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Analysis of ganged WAL writes

2002-10-08 Thread Curtis Faith

 Curtis Faith [EMAIL PROTECTED] writes:
  Successive writes would write different NON-OVERLAPPING sections of the
  same log buffer. It wouldn't make sense to send three separate
 copies of
  the entire block. That could indeed cause problems.

 So you're going to undo the code's present property that all writes are
 block-sized?  Aren't you worried about incurring page-in reads because
 the kernel can't know that we don't care about data beyond what we've
 written so far in the block?

Yes, I'll try undoing the current behavior.

I'm not really worried about doing page-in reads because the disks internal
buffers should contain most of the blocks surrounding the end of the log
file. If the successive partial writes exceed a block (which they will in
heavy use) then most of the time this won't be a problem anyway since the
disk will gang the full blocks before writing.

If the inserts are not coming fast enough to fill the log then the disks
cache should contain the data from the last time that block (or the block
before) was written. Disks have become pretty good at this sort of thing
since writing sequentially is a very common scenario.

It may not work, but one doesn't make significant progress without trying
things that might not work.

If it doesn't work, then I'll make certain that commit log records always
fill the buffer they are written too, with variable length commit records
and something to identify the size of the padding used to fill the rest of
the block.


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Analysis of ganged WAL writes

2002-10-08 Thread Curtis Faith

 Curtis Faith [EMAIL PROTECTED] writes:
  I'm not really worried about doing page-in reads because the
 disks internal
  buffers should contain most of the blocks surrounding the end
 of the log
  file. If the successive partial writes exceed a block (which
 they will in
  heavy use) then most of the time this won't be a problem
 anyway since the
  disk will gang the full blocks before writing.

 You seem to be willing to make quite a large number of assumptions about
 what the disk hardware will do or not do.  I trust you're going to test
 your results on a wide range of hardware before claiming they have any
 general validity ...

Tom, I'm actually an empiricist. I trust nothing that I haven't tested or
read the code for myself. I've found too many instances of bugs or poor
implementations in the O/S to believe without testing.

On the other hand, one has to make some assumptions in order to devise
useful tests.

I'm not necessarily expecting that I'll come up with something that will
help everyone all the time. I'm just hoping that I can come up with
something that will help those using modern hardware, most of the time.

Even if it works, this will probably become one of those flags that need to
be tested as part of the performance analysis for any given system. Or
perhaps ideally, I'll come up with a LogDiskTester that simulates log
output and determines the best settings to use for a given disk and O/S,
optimized for size or speed, heavy inserts, etc.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-08 Thread Curtis Faith

 So you think if I try to write a 1 gig file, it will write enough to
 fill up the buffers, then wait while the sync'er writes out a few blocks
 every second, free up some buffers, then write some more?

 Take a look at vfs_bio::getnewbuf() on *BSD and you will see that when
 it can't get a buffer, it will async write a dirty buffer to disk.

We've addressed this scenario before, if I recall, the point Greg made
earlier is that buffers getting full means writes become synchronous.

I was trying to point out was that it was very likely that the buffers will
fill even for large buffers and that the writes are going to be driven out
not by efficient ganging but by something approaching LRU flushing, with an
occasional once a second slightly more efficient write of 1/32nd of the
buffers.

Once the buffers get full, all subsequent writes turn into synchronous
writes, since even if the kernel writes asynchronously (meaning it can do
other work), the writing process can't complete, it has to wait until the
buffer has been flushed and is free for the copy. So the relatively poor
implementation (for database inserts at least) of the syncer mechanism will
cost a lot of performance if we get to this synchronous write mode due to a
full buffer. It appears this scenario is much more likely than I had
thought.

Do you not think this is a potential performance problem to be explored?

I'm only pursuing this as hard as I am because I feel like it's deja vu all
over again. I've done this before and found a huge improvement (12X to 20X
for bulk inserts). I'm not necessarily expecting that level of improvement
here but my gut tells me there is more here than seems obvious on the
surface.

 As far as this AIO conversation is concerned, I want to see someone come
 up with some performance improvement that we can only do with AIO.
 Unless I see it, I am not interested in pursuing this thread.

If I come up with something via aio that helps I'd be more than happy if
someone else points out a non-aio way to accomplish the same thing. I'm by
no means married to any particular solutions, I care about getting problems
solved. And I'll stop trying to sell anyone on aio.

- Curtis


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Analysis of ganged WAL writes

2002-10-08 Thread Curtis Faith

 You example of 1 trx/proc/rev will wok _only_ if no more and no less
 than 1/4 of platter is filled by _other_ log writers.

Not really, if 1/2 the platter has been filled we'll still get in one more
commit in for a given rotation. If more than a rotation's worth of writing
has occurred that means we are bumping into the limit of disk I/O and that
it the limit that we can't do anything about without doing interleaved log
files.

  The case of bulk inserts is one where I would expect that for
 simple tables
  we should be able to peg the disks given today's hardware and enough
  inserting processes.

 bulk inserts should probably be chunked at higher level by inserting
 several records inside a single transaction.

Agreed, that's much more efficient. There are plenty of situations where
the inserts and updates are ongoing rather than initial, Shridhar's
real-world test or TPC benchmarks, for example.

- Curtis


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: Parallel Executors [was RE: [HACKERS] Threaded Sorting]

2002-10-07 Thread Curtis Faith

 Curtis Faith wrote:

  The current transaction/user state seems to be stored in process
  global space. This could be changed to be a sointer to a struct
  stored in a back-end specific shared memory area which would be
  accessed by the executor process at execution start. The backend
  would destroy and recreate the shared memory and restart execution
  in the case where an executor process dies much like the postmaster
  does with backends now.
 
  To the extent the executor process might make changes to the state,
  which I'd try to avoid if possible (don't know if it is), the
  executors could obtain locks, otherwise if the executions were
  constrained to isolated elements (changes to different indexes for
  example) it seems like it would be possible using an architecture
  where you have:

Jan Wieck replied:
 Imagine there is a PL/Tcl function. On the first call in a session, the
 PL/Tcl interpreter get's created (that's during execution, okay?). Now
 the procedure that's called inside of that interpreter creates a
 global variable ... a global Tcl variable inside of that interpreter,
 which is totally unknown to the backend since it doesn't know what Tcl
 is at all and that variable is nothing than an entry in a private hash
 table inside of that interpreter. On a subsequent call to any PL/Tcl
 function during that session, it might be good if that darn hashtable
 entry exists.

 How do you propose to let this happen?

 And while at it, the Tcl procedure next calls spi_exec, causing the
 PL/Tcl function handler to call SPI_exec(), so your isolated executor
 all of the sudden becomes a fully operational backend, doing the
 parsing, planning and optimizing, or what?

You bring up a good point, we couldn't do what I propose for all
situations. I had never anticipated that splitting things up would be the
rule. For example, the optimizer would have to decide whether it made sense
to split up a query from a strictly performance perspective. So now, if we
consider the fact that some things could not be done with split backend
execution, the logic becomes:

if ( splitting is possible  splitting is faster )
do the split execution;
else
do the normal execution;

Since the design already splits the backend internally into a separate
execution phase, it seems like one could keep the current current
implementation for the typical case where splitting doesn't buy anything or
cases where there is complex state information that needs to be maintained.
If there are no triggers or functions that will be accessed by a given
query then I don't see your concerns applying.

If there are triggers or other conditions which preclude multi-process
execution, we can keep exactly the same behavior as now. The plan execution
entry could easily be a place where it either A) did the same thing it
currently does or B) passed execution off to a pool as per the original
proposal.

I have to believe that most SELECTs won't be affected by your concerns.
Additionally, even in the case of an UPDATE, many times there are large
portions of the operation's actual work that wouldn't be affected even if
there are lots of triggers on the tables being updated. The computation of
the inside of the WHERE could often be split out without causing any
problems with context or state information. The master executor could
always be the original backend as it is now and this would be the place
where the UPDATE part would be processed after the WHERE tuples had been
identified.

As with any optimization, it is more complicated and won't handle all the
cases. It's just an idea to handle common cases that would otherwise be
much slower.

That having been said, I'm sure there are much lower hanging fruit on the
performance tree and likely will be for a little while.

- Curtis


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Curtis Faith

 On Sun, 2002-10-06 at 11:46, Tom Lane wrote:
  I can't personally get excited about something that only helps if your
  server is starved for RAM --- who runs servers that aren't fat on RAM
  anymore?  But give it a shot if you like.  Perhaps your analysis is
  pessimistic.

 snipped I don't find it far fetched to
 imagine situations where people may commit large amounts of memory for
 the database yet marginally starve available memory for file system
 buffers.  Especially so on heavily I/O bound systems or where sporadicly
 other types of non-database file activity may occur.

 snipped Of course, that opens the door for simply adding more memory
 and/or slightly reducing the amount of memory available to the database
 (thus making it available elsewhere).  Now, after all that's said and
 done, having something like aio in use would seemingly allowing it to be
 somewhat more self-tuning from a potential performance perspective.

Good points.

Now for some surprising news (at least it surprised me).

I researched the file system source on my system (FreeBSD 4.6) and found
that the behavior was optimized for non-database access to eliminate
unnecessary writes when temp files are created and deleted rapidly. It was
not optimized to get data to the disk in the most efficient manner.

The syncer on FreeBSD appears to place dirtied filesystem buffers into
work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer
processes one of the queues and increments a counter syncer_delayno.

On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd
of the writes that were buffered are processed. If the syncer gets behind
and the writes for a given second exceed one second to process the syncer
does not wait but begins processing the next queue.

AFAICT this means that there is no opportunity to have writes combined by
the  disk since they are processed in buckets based on the time the writes
came in.

Also, it seems very likely that many installations won't have enough
buffers for 30 seconds worth of changes and that there would be some level
of SYNCHRONOUS writing because of this delay and the syncer process getting
backed up. This might happen once per second as the buffers get full and
the syncer has not yet started for that second interval.

Linux might handle this better. I saw some emails exchanged a year or so
ago about starting writes immediately in a low-priority way but I'm not
sure if those patches got applied to the linux kernel or not. The source I
had access to seems to do something analogous to FreeBSD but using fixed
percentages of the dirty blocks or a minimum number of blocks. They appear
to be handled in LRU order however.

On-disk caches are much much larger these days so it seems that some way of
getting the data out sooner would result in better write performance for
the cache. My newer drive is a 10K RPM IBM Ultrastar SCSI and it has a 4M
cache. I don't see these caches getting smaller over time so not letting
the disk see writes will become more and more of a performance drain.

- Curtis


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Curtis Faith

Tom, first of all, excellent job improving the current algorithm. I'm glad
you look at the WALCommitLock code.

 This must be so because the backends that are
 released at the end of any given disk revolution will not be able to
 participate in the next group commit, if there is already at least
 one backend ready to commit.

This is the major reason for my original suggestion about using aio_write.
The writes don't block each other and there is no need for a kernel level
exclusive locking call like fsync or fdatasync.

Even the theoretical limit you mention of one transaction per revolution
per committing process seem like a significant bottleneck.

Is committing 1 and 4 transactions on every revolution good? It's certainly
better than 1 per revolution.

However, what if we could have done 3 transactions per process in the time
it took for a single revolution?

Then we are looking at (1 + 4)/ 2 = 2.5 transactions per revolution versus
the theoretical maximum of (3 * 5) = 15 transactions per revolution if we
can figure out a way to do non-blocking writes that we can guarantee are on
the disk platter so we can return from commit.

Separating out whether or not aio is viable. Do you not agree that
eliminating the blocking would result in potentially a 6X improvement for
the 5 process case?


 So this solution isn't perfect; it would still be nice to have a way to
 delay initiation of the WAL write until just before the disk is ready
 to accept it.  I dunno any good way to do that, though.

I still think that it would be much faster to just keep writing the WAL log
blocks when they fill up and have a separate process wake the commiting
process when the write completes. This would eliminate WAL writing as a
bottleneck.

I have yet to hear anyone say that this can't be done, only that we might
not want to do it because the code might not be clean.

I'm generally only happy when I can finally remove a bottleneck completely,
but speeding one up by 3X like you have done is pretty damn cool for a day
or two's work.

- Curtis


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Curtis Faith

 This is the trickle syncer.  It prevents bursts of disk activity every
 30 seconds.  It is for non-fsync writes, of course, and I assume if the
 kernel buffers get low, it starts to flush faster.

AFAICT, the syncer only speeds up when virtual memory paging fills the
buffers past
a threshold and even in that event it only speeds it up by a factor of two.

I can't find any provision for speeding up flushing of the dirty buffers
when they fill for normal file system writes, so I don't think that
happens.

- Curtis


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Curtis Faith

 Well, too bad.  If you haven't gotten your commit record down to disk,
 then *you have not committed*.  This is not negotiable.  (If you think
 it is, then turn off fsync and quit worrying ;-))

I've never disputed this, so if I seem to be suggesting that, I've beee
unclear. I'm just assuming that the disk can get a confirmation back to the
INSERTing process in much less than one rotation. This would allow that
process to start working again, perhaps in time to complete another
transaction.

 An application that is willing to have multiple transactions in flight
 at the same time can open up multiple backend connections to issue those
 transactions, and thereby perhaps beat the theoretical limit.  But for
 serial transactions, there is not anything we can do to beat that limit.
 (At least not with the log structure we have now.  One could imagine
 dropping a commit record into the nearest one of multiple buckets that
 are carefully scattered around the disk.  But exploiting that would take
 near-perfect knowledge about disk head positioning; it's even harder to
 solve than the problem we're considering now.)

Consider the following scenario:

Time measured in disk rotations.

Time 1.00 - Process A Commits - Causing aio_write to log and wait
Time 1.03 - aio_completes for Process A write - wakes process A
Time 1.05 - Process A Starts another transaction.
Time 1.08 - Process A Commits
etc.

I agree that a process can't proceed from a commit until it receives
confirmation of the write, but if the write has hit the disk before a full
rotation then the process should be able to continue processing new
transactions

 You're failing to distinguish total throughput to the WAL drive from
 response time seen by any one transaction.  Yes, a policy of writing
 each WAL block once when it fills would maximize potential throughput,
 but it would also mean a potentially very large delay for a transaction
 waiting to commit.  The lower the system load, the worse the performance
 on that scale.

You are assuming fsync or fdatasync behavior, I am not. There would be no
delay under the scenario I describe. The transaction would exit commit as
soon as the confirmation of the write is received from the aio system. I
would hope that with a decent aio implementation this would generally be
much less than one rotation.

I think that the single transaction response time is very important and
that's one of the chief problems I sought to solve when I proposed
aio_writes for logging in my original email many moons ago.

 ISTM aio_write only improves the picture if there's some magic in-kernel
 processing that makes this same kind of judgment as to when to issue the
 ganged write for real, and is able to do it on time because it's in
 the kernel.  I haven't heard anything to make me think that that feature
 actually exists.  AFAIK the kernel isn't much more enlightened about
 physical head positions than we are.

All aio_write has to do is pass the write off to the device as soon as it
aio_write gets it bypassing the system buffers. The code on the disk's
hardware is very good at knowing when the disk head is coming. IMHO,
bypassing the kernel's less than enlightened writing system is the main
point of using aio_write.

- Curtis


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Analysis of ganged WAL writes

2002-10-07 Thread Curtis Faith

 I may be missing something obvious, but I don't see a way to get more
 than 1 trx/process/revolution, as each previous transaction in that
 process must be written to disk before the next can start, and the only
 way it can be written to the disk is when the disk heads are on the
 right place and that happens exactly once per revolution.

Okay, consider the following scenario.

1) Process A commits when the platter is at 0 degrees.
2) There are enough XLog writes from other processes to fill 1/4 platter
rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit
record and keeps writing other log entries as the head rotates.
3) Process A receives a confirmation of the write before the platter
rotates 60 degrees.
4) Process A continues and adds another commit before the platter rotates
to 90 degrees.

This should be very possible and more and more likely in the future as CPUs
get faster and faster relative to disks.

I'm not suggesting this would happen all the time, just that it's possible
and that an SMP machine with good CPUs and a fast I/O subsystem should be
able to keep the log writing at close to I/O bandwidth limits.

The case of bulk inserts is one where I would expect that for simple tables
we should be able to peg the disks given today's hardware and enough
inserting processes.

- Curtis


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Dirty Buffer Writing [was Proposed LogWriter Scheme]

2002-10-07 Thread Curtis Faith

 Greg Copeland [EMAIL PROTECTED] writes:
  Doesn't this also increase the likelihood that people will be
  running in a buffer-poor environment more frequently that I
  previously asserted, especially in very heavily I/O bound
  systems?  Unless I'm mistaken, that opens the door for a
  general case of why an aio implementation should be looked into.

Neil Conway replies:
 Well, at least for *this specific sitation*, it doesn't really change
 anything -- since FreeBSD doesn't implement POSIX AIO as far as I
 know, we can't use that as an alternative.

I haven't tried it yet but there does seem to be an aio implementation that
conforms to POSIX in FreeBSD 4.6.2.  Its part of the kernel and can be
found in:
/usr/src/sys/kern/vfs_aio.c

 However, I'd suspect that the FreeBSD kernel allows for some way to
 tune the behavior of the syncer. If that's the case, we could do some
 research into what settings are more appropriate for FreeBSD, and
 recommend those in the docs. I don't run FreeBSD, however -- would
 someone like to volunteer to take a look at this?

I didn't see anything obvious in the docs but I still believe there's some
way to tune it. I'll let everyone know if I find some better settings.

 BTW Curtis, did you happen to check whether this behavior has been
 changed in FreeBSD 5.0?

I haven't checked but I will.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Parallel Executors [was RE: [HACKERS] Threaded Sorting]

2002-10-06 Thread Curtis Faith

tom lane wrote:
 Curtis Faith [EMAIL PROTECTED] writes:
  What about splitting out parsing, optimization and plan generation from
  execution and having a separate pool of exececutor processes.
 
  As an optimizer finished with a query plan it would initiate execution
  by grabbing an executor from a pool and passing it the plan.
 
 So different executors would potentially handle the queries from a
 single transaction?  How will you deal with pushing transaction-local
 state from one to the other?
 
 Even if you restrict it to switching at transaction boundaries, you
 still have session-local state (at minimum user ID and SET settings)
 to worry about.

Hmmm, what transaction boundaries did you mean? Since we are talking
about single statement parallization, there must be some specific
internal semantics that you believe need isolation. It seems like
we'd be able to get most of the benefit and restrict the parallization
in a way that would preserve this isolation but I'm curious what
you were specifically referring to?

The current transaction/user state seems to be stored in process
global space. This could be changed to be a sointer to a struct 
stored in a back-end specific shared memory area which would be
accessed by the executor process at execution start. The backend
would destroy and recreate the shared memory and restart execution
in the case where an executor process dies much like the postmaster
does with backends now.

To the extent the executor process might make changes to the state,
which I'd try to avoid if possible (don't know if it is), the
executors could obtain locks, otherwise if the executions were 
constrained to isolated elements (changes to different indexes for
example) it seems like it would be possible using an architecture
where you have:

Main Executor: Responsible for updating global meta data from
each sub-executor and assembling the results of multiple executions.
In the case of multiple executor sorts, the main executor would
perform a merge sort on the results of it and it's subordinates
pre-sorted sub-sets of the relation.

Subordinate Executor: Executes sub-plans and returns results or
meta-data update information into front-end shared memory directly.

To make this optimal, the index code would have to be changed to
support the idea of partial scans. In the case of btrees it would
be pretty easy using the root page to figure out what index values
delineated different 1/2's, 1/3's, 1/4's etc. of the index space.

I'm not sure what you'd have to do to support this for table scans as
I don't know the PostgreSQL tuple storage mechanism, yet.

This does not seem like too much architectural complexity or
performance overhead (even for the single executor case) for a big
gain for complex query performance.

 Being able to apply multiple CPUs to a single query is attractive,
 but I've not yet seen schemes for it that don't look like the extra
 CPU power would be chewed up in overhead :-(.

Do you remember specifc overhead problems/issues?

- Curtis

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



[HACKERS] Anyone else having list server problems?

2002-10-05 Thread Curtis Faith

I've been getting only about 60% of the emails sent to the list. 

I see many emails in the archives that I never got via email.

Is anyone else having this problems?

- Curtis

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Threaded Sorting

2002-10-05 Thread Curtis Faith

tom lane writes:
The notion of a sort process pool seems possibly attractive.  I'm
unconvinced that it's going to be a win though because of the cost of
shoving data across address-space boundaries.  

What about splitting out parsing, optimization and plan generation from
execution and having a separate pool of exececutor processes.

As an optimizer finished with a query plan it would initiate execution
by grabbing an executor from a pool and passing it the plan.

This would allow the potential for passing partial plans to multiple
executors so a given query might be split up into three or four pieces
and then executed in parallel with the results passed through a
shared memory area owned by each executor process.

This would allow for potential optimization of sorts without threads or 
incurring the overhead problems you mentioned for a separate sorter
process. The optimizer could do things like split a scan into 3 or 4
pieces before sending it off for execution and then join the pieces
back together.

It could also make complex queries much faster if there are idling CPUs
if the optimizer was updated to take advantage of this.

If we are going to split things apart, then this should be done at a
natural communication boundary right? The code has this logical split
right now anyway so the change would be more natural.

OTOH, there are much bigger fish to fry at the moment, I suspect.

- Curtis

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-05 Thread Curtis Faith

 You are confusing WALWriteLock with WALInsertLock.  A
 transaction-committing flush operation only holds the former.
 XLogInsert only needs the latter --- at least as long as it
 doesn't need to write.

Well that make things better than I thought. We still end up with
a disk write for each transaction though and I don't see how this
can ever get better than (Disk RPM)/ 60 transactions per second,
since commit fsyncs are serialized. Every fsync will have to wait
almost a full revolution to reach the end of the log.

As a practial matter then everyone will use commit_delay to
improve this.
 
 This will pessimize performance except in the case where WAL traffic
 is very heavy, because it means you don't commit until the block
 containing your commit record is filled.  What if you are the only
 active backend?

We could handle this using a mechanism analogous to the current
commit delay. If there are more than commit_siblings other processes
running then do the write automatically after commit_delay seconds.

This would make things no more pessimistic than the current
implementation but provide the additional benefit of allowing the
LogWriter to write in optimal sizes if there are many transactions.

The commit_delay method won't be as good in many cases. Consider
a update scenario where a larger commit delay gives better throughput.
A given transaction will flush after commit_delay milliseconds. The
delay is very unlikely to result in a scenario where the dirty log 
buffers are the optimal size.

As a practical matter I think this would tend to make the writes
larger than they would otherwise have been and this would
unnecessarily delay the commit on the transaction.

 I do not, however, see any
 value in forcing all the WAL writes to be done by a single process;
 which is essentially what you're saying we should do.  That just adds
 extra process-switch overhead that we don't really need.

I don't think that an fsync will ever NOT cause the process to get
switched out so I don't see how another process doing the write would
result in more overhead. The fsync'ing process will block on the
fsync, so there will always be at least one process switch (probably
many) while waiting for the fsync to comlete since we are talking
many milliseconds for the fsync in every case.

  The log file would be opened O_DSYNC, O_APPEND every time.
 
 Keep in mind that we support platforms without O_DSYNC.  I am not
 sure whether there are any that don't have O_SYNC either, but I am
 fairly sure that we measured O_SYNC to be slower than fsync()s on
 some platforms.

Well there is no reason that the logwriter couldn't be doing fsyncs
instead of O_DSYNC writes in those cases. I'd leave this switchable
using the current flags. Just change the semantics a bit.

- Curtis

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Curtis Faith

In particular, it would seriously degrade performance if the WAL file
 isn't on its own spindle but has to share bandwidth with
 data file access.

If the OS is stupid I could see this happening. But if there are
buffers and some sort of elevator algorithm the I/O won't happen
at bad times.

I agree with you though that writing for every single insert probably
does not make sense. There should be some blocking of writes. The
optimal size would have to be derived empirically.

 What we really want, of course, is write on every revolution where
 there's something worth writing --- either we've filled a WAL blovk
 or there is a commit pending.  But that just gets us back into the
 same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon.
 I don't see how an extra process makes that problem any easier.

The whole point of the extra process handling all the writes is so
that it can write on every revolution, if there is something to
write. It doesn't need to care if more commits will arrive soon.

 BTW, it would seem to me that aio_write() buys nothing over plain write()
 in terms of ability to gang writes.  If we issue the write at time T
 and it completes at T+X, we really know nothing about exactly when in
 that interval the data was read out of our WAL buffers.  We cannot
 assume that commit records that were stored into the WAL buffer during
 that interval got written to disk.

Why would we need to make that assumption? The only thing we'd need to
know is that a given write succeeded meaning that commits before that
write are done.

The advantage to aio_write in this scenario is when writes cross track
boundaries or when the head is in the wrong spot. If we write
in reasonable blocks with aio_write the write might get to the disk
before the head passes the location for the write.

Consider a scenario where:

Head is at file offset 10,000.

Log contains blocks 12,000 - 12,500

..time passes..

Head is now at 12,050

Commit occurs writing block 12,501

In the aio_write case the write would already have been done for blocks  
12,000 to 12,050 and would be queued up for some additional blocks up to
potentially 12,500. So the write for the commit could occur without an
additional rotation delay. We are talking 85 to 200 milliseconds
delay for this rotation on a single disk. I don't know how often this
happens in actual practice but it might occur as often as every other
time.

- Curtis

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large PerformanceGain in WAL synching

2002-10-05 Thread Curtis Faith

Bruce Momjian wrote:
 So every backend is to going to wait around until its fsync gets done by
 the backend process?  How is that a win?  This is just another version
 of our GUC parameters:
   
   #commit_delay = 0   # range 0-10, in microseconds
   #commit_siblings = 5# range 1-1000
 
 which attempt to delay fsync if other backends are nearing commit.  
 Pushing things out to another process isn't a win;  figuring out if
 someone else is coming for commit is.  

It's not the same at all. My proposal make two extremely important changes
from a performance perspective.

1) WALWriteLocks are never held by processes for lengthy transations. Only
for long enough to copy the log entry into the buffer. This means real
work can be done by other processes while a transaction is waiting for
it's commit to finish. I'm sure that blocking on XLogInsert because another
transaction is performing an fsync is extremely common with frequent update
scenarios.

2) The log is written using optimal write sizes which is much better than
a user-defined guess of the microseconds to delay the fsync. We should be
able to get the bottleneck to be the maximum write throughput of the disk
with the modifications to Tom Lane's scheme I proposed.

 Remember, write() is fast, fsync is slow.

Okay, it's clear I missed the point about Unix write earlier :-)

However, it's not just saving fsyncs that we need to worry about. It's the
unnecessary blocking of other processes that are simply trying to
append some log records in the course of whatever updating, inserting they
are doing. They may be a long way from commit.

fsync being slow is the whole reason for not wanting to have exclusive
locks held for the duration of an fsync.

On an SMP machine this change alone would probably speed things up by
an order of magnitude (assuming there aren't any other similar locks
causing the same problem).

- Curtis

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Curtis Faith

 So, you are saying that we may get back aio confirmation quicker than if
 we issued our own write/fsync because the OS was able to slip our flush
 to disk in as part of someone else's or a general fsync?
 
 I don't buy that because it is possible our write() gets in as part of
 someone else's fsync and our fsync becomes a no-op, meaning there aren't
 any dirty buffers for that file.  Isn't that also possible?

Separate out the two concepts:

1) Writing of incomplete transactions at the block level by a
background LogWriter. 

I think it doesn't matter whether the write is aio_write or
write, writing blocks when we get them should provide the benefit
I outlined.

Waiting till fsync could miss the opporunity to write before the 
head passes the end of the last durable write because the drive
buffers might empty causing up to a full rotation's delay.

2) aio_write vs. normal write.

Since as you and others have pointed out aio_write and write are both
asynchronous, the issue becomes one of whether or not the copies to the
file system buffers happen synchronously or not.

This is not a big difference but it seems to me that the OS might be
able to avoid some context switches by grouping copying in the case
of aio_write. I've heard anecdotal reports that this is significantly
faster for some things but I don't know for certain.

 
 Also, remember the kernel doesn't know where the platter rotation is
 either. Only the SCSI drive can reorder the requests to match this. The
 OS can group based on head location, but it doesn't know much about the
 platter location, and it doesn't even know where the head is.

The kernel doesn't need to know anything about platter rotation. It
just needs to keep the disk write buffers full enough not to cause
a rotational latency.

It's not so much a matter of reordering as it is of getting the data
into the SCSI drive before the head passes the last write's position.
If the SCSI drive's buffers are kept full it can continue writing at
its full throughput. If the writes stop and the buffers empty
it will need to wait up to a full rotation before it gets to the end 
of the log again

 Also, does aio return info when the data is in the kernel buffers or
 when it is actually on the disk?   
 
 Simply, aio allows us to do the write and get notification when it is
 complete.  I don't see how that helps us, and I don't see any other
 advantages to aio.  To use aio, we need to find something that _can't_
 be solved with more traditional Unix API's, and I haven't seen that yet.
 
 This aio thing is getting out of hand.  It's like we have a hammer, and
 everything looks like a nail, or a use for aio.

Yes, while I think its probably worth doing and faster, it won't help as
much as just keeping the drive buffers full even if that's by using write
calls.

I still don't understand the opposition to aio_write. Could we just have
the configuration setup determine whether one or the other is used? I 
don't see why we wouldn't use the faster calls if they were present and
reliable on a given system.

- Curtis

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Curtis Faith

 No question about that!  The sooner we can get stuff to the WAL buffers,
 the more likely we will get some other transaction to do our fsync work.
 Any ideas on how we can do that?

More like the sooner we get stuff out of the WAL buffers and into the
disk's buffers whether by write or aio_write.

It doesn't do any good to have information in the XLog unless it
gets written to the disk buffers before they empty.

 We hesitate to add code relying on new features unless it is a
 significant win, and in the aio case, we would have different WAL disk
 write models for with/without aio, so it clearly could be two code
 paths, and with two code paths, we can't as easily improve or optimize. 
 If we get 2% boost out of some feature,  but it later discourages us
 from adding a 5% optimization, it is a loss.  And, in most cases, the 2%
 optimization is for a few platform, while the 5% optimization is for
 all.  This code is +15 years old, so we are looking way down the road,
 not just for today's hot feature.

I'll just have to implement it and see if it's as easy and isolated as I
think it might be and would allow the same algorithm for aio_write or
write.

 I can't tell you how many aio/mmap/fancy feature discussions we have
 had, and we obviously discuss them, but in the end, they end up being of
 questionable value for the risk/complexity;  but, we keep talking,
 hoping we are wrong or some good ideas come out of it.

I'm all in favor of keeping clean designs. I'm very pleased with how
easy PostreSQL is to read and understand given how much it does.

- Curtis

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

I wrote:
  ... most file systems can't process fsync's
  simultaneous with other writes, so those writes block because the file
  system grabs its own internal locks.


tom lane replies:
 Oh?  That would be a serious problem, but I've never heard that asserted
 before.  Please provide some evidence.

Well I'm basing this on past empirical testing and having read some man
pages that describe fsync under this exact scenario. I'll have to write
a test to prove this one way or another. I'll also try and look into
the linux/BSD source for the common file systems used for PostgreSQL.

 On a filesystem that does have that kind of problem, can't you avoid it
 just by using O_DSYNC on the WAL files?  Then there's no need to call
 fsync() at all, except during checkpoints (which actually issue sync()
 not fsync(), anyway).


No, they're not exactly the same thing. Consider:

Process A   File System
-   ---
Writes index buffer .idling...
Writes entry to log cache   .
Writes another index buffer .
Writes another log entry.
Writes tuple buffer .
Writes another log entry.
Index scan  .
Large table sort.
Writes tuple buffer .
Writes another log entry.
Writes  .
Writes another index buffer .
Writes another log entry.
Writes another index buffer .
Writes another log entry.
Index scan  .
Large table sort.
Commit  .
File Write Log Entry.
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
Write Commit Log Entry  .idling...
.idling...  Write to cache
Call fsync  .idling...
.idling...  Write all buffers to device.
.DONE.

In this case, Process A is waiting for all the buffers to write
at the end of the transaction.

With asynchronous I/O this becomes:

Process A   File System
-   ---
Writes index buffer .idling...
Writes entry to log cache   Queue up write - move head to cylinder
Writes another index buffer Write log entry to media
Writes another log entryImmediate write to cylinder since head is
still there.
Writes tuple buffer .
Writes another log entryQueue up write - move head to cylinder
Index scan  .busy with scan...
Large table sortWrite log entry to media
Writes tuple buffer .
Writes another log entryQueue up write - move head to cylinder
Writes  .
Writes another index buffer Write log entry to media
Writes another log entryQueue up write - move head to cylinder
Writes another index buffer .
Writes another log entryWrite log entry to media
Index scan  .
Large table sortWrite log entry to media
Commit  .
Write Commit Log Entry  Immediate write to cylinder since head is
still there.
.DONE.

Effectively the real work of writing the cache is done while the CPU
for the process is busy doing index scans, sorts, etc. With the WAL
log on another device and SCSI I/O the log writing should almost always be
done except for the final commit write.

  Whether by threads or multiple processes, there is the same
 contention on
  the file through multiple writers. The file system can decide to reorder
  writes before they start but not after. If a write comes after a
  fsync starts it will have to wait on that fsync.

 AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
 on safety grounds (we want to be sure we have a consistent WAL up to the
 end of what we've written).  Even if we can allow some reordering when a
 single transaction puts out a large volume of WAL data, I fail to see
 where any large gain is going to come from.  We're going to be issuing
 those writes sequentially and that ought to match the disk layout about
 as well as can be hoped anyway.

My comment was applying to reads and writes of other processes not the
WAL log. In my original email, recall I mentioned using the O_APPEND
open flag which will ensure that all log entries are done sequentially.

  Likewise a given process's writes can NEVER be reordered if they are
  submitted synchronously, as is done in the calls to flush the log as
  well as the dirty pages in the buffer in the current code.

 We do not fsync buffer pages; in fact a transaction commit doesn't write
 buffer pages at all.  I think the above is just a misunderstanding of
 what's really 

[HACKERS] fsync exlusive lock evidence WAS: Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

After some research I still hold that fsync blocks, at least on
FreeBSD. Am I missing something?

Here's the evidence:

Code from: /usr/src/sys/syscalls/vfs_syscalls

int
fsync(p, uap)
struct proc *p;
struct fsync_args /* {
syscallarg(int) fd;
} */ *uap;
{
register struct vnode *vp;
struct file *fp;
vm_object_t obj;
int error;

if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0)
return (error);
vp = (struct vnode *)fp-f_data;
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
if (VOP_GETVOBJECT(vp, obj) == 0)
vm_object_page_clean(obj, 0, 0, 0);
if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 
vp-v_mount  (vp-v_mount-mnt_flag  MNT_SOFTDEP) 
bioops.io_fsync)
error = (*bioops.io_fsync)(vp);
VOP_UNLOCK(vp, 0, p);
return (error);
}

Notice the calls to:

vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
..
VOP_UNLOCK(vp, 0, p);

surrounding the call to VOP_FSYNC.

From the man pages for VOP_UNLOCK:


HEADER STUFF .


 VOP_LOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_ISLOCKED(struct vnode *vp, struct proc *p);

 int
 vn_lock(struct vnode *vp, int flags, struct proc *p);



DESCRIPTION
 These calls are used to serialize access to the filesystem, such as to
 prevent two writes to the same file from happening at the same time.

 The arguments are:

 vp the vnode being locked or unlocked

 flags  One of the lock request types:

  LK_SHARED Shared lock
  LK_EXCLUSIVE  Exclusive lock
  LK_UPGRADEShared-to-exclusive upgrade
  LK_EXCLUPGRADEFirst shared-to-exclusive upgrade
  LK_DOWNGRADE  Exclusive-to-shared downgrade
  LK_RELEASERelease any type of lock
  LK_DRAIN  Wait for all lock activity to end

The lock type may be or'ed with these lock flags:

  LK_NOWAITDo not sleep to wait for lock
  LK_SLEEPFAIL Sleep, then return failure
  LK_CANRECURSEAllow recursive exclusive lock
  LK_REENABLE  Lock is to be reenabled after drain
  LK_NOPAUSE   No spinloop

The lock type may be or'ed with these control flags:

  LK_INTERLOCKSpecify when the caller already has a simple
  lock (VOP_LOCK will unlock the simple lock
  after getting the lock)
  LK_RETRYRetry until locked
  LK_NOOBJDon't create object

 p  process context to use for the locks

 Kernel code should use vn_lock() to lock a vnode rather than calling
 VOP_LOCK() directly.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

I resent this since it didn't seem to get to the list.

After some research I still hold that fsync blocks, at least on
FreeBSD. Am I missing something?

Here's the evidence:

Code from: /usr/src/sys/syscalls/vfs_syscalls

int
fsync(p, uap)
struct proc *p;
struct fsync_args /* {
syscallarg(int) fd;
} */ *uap;
{
register struct vnode *vp;
struct file *fp;
vm_object_t obj;
int error;

if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0)
return (error);
vp = (struct vnode *)fp-f_data;
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
if (VOP_GETVOBJECT(vp, obj) == 0)
vm_object_page_clean(obj, 0, 0, 0);
if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 
vp-v_mount  (vp-v_mount-mnt_flag  MNT_SOFTDEP) 
bioops.io_fsync)
error = (*bioops.io_fsync)(vp);
VOP_UNLOCK(vp, 0, p);
return (error);
}

Notice the calls to:

vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
..
VOP_UNLOCK(vp, 0, p);

surrounding the call to VOP_FSYNC.

From the man pages for VOP_UNLOCK:


HEADER STUFF .


 VOP_LOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_ISLOCKED(struct vnode *vp, struct proc *p);

 int
 vn_lock(struct vnode *vp, int flags, struct proc *p);



DESCRIPTION
 These calls are used to serialize access to the filesystem, such as to
 prevent two writes to the same file from happening at the same time.

 The arguments are:

 vp the vnode being locked or unlocked

 flags  One of the lock request types:

  LK_SHARED Shared lock
  LK_EXCLUSIVE  Exclusive lock
  LK_UPGRADEShared-to-exclusive upgrade
  LK_EXCLUPGRADEFirst shared-to-exclusive upgrade
  LK_DOWNGRADE  Exclusive-to-shared downgrade
  LK_RELEASERelease any type of lock
  LK_DRAIN  Wait for all lock activity to end

The lock type may be or'ed with these lock flags:

  LK_NOWAITDo not sleep to wait for lock
  LK_SLEEPFAIL Sleep, then return failure
  LK_CANRECURSEAllow recursive exclusive lock
  LK_REENABLE  Lock is to be reenabled after drain
  LK_NOPAUSE   No spinloop

The lock type may be or'ed with these control flags:

  LK_INTERLOCKSpecify when the caller already has a simple
  lock (VOP_LOCK will unlock the simple lock
  after getting the lock)
  LK_RETRYRetry until locked
  LK_NOOBJDon't create object

 p  process context to use for the locks

 Kernel code should use vn_lock() to lock a vnode rather than calling
 VOP_LOCK() directly.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

It appears the fsync problem is pervasive. Here's Linux 2.4.19's
version from fs/buffer.c:

lock-  down(inode-i_sem);
ret = filemap_fdatasync(inode-i_mapping);
err = file-f_op-fsync(file, dentry, 1);
if (err  !ret)
ret = err;
err = filemap_fdatawait(inode-i_mapping);
if (err  !ret)
ret = err;
unlock-up(inode-i_sem);

But this is probably not a big factor as you outline below because
the WALWriteLock is causing the same kind of contention.

tom lane wrote:
 This is kind of ugly in general terms but I'm not sure that it really
 hurts Postgres.  In our present scheme, the only files we ever fsync()
 are WAL log files, not data files.  And in normal operation there is
 only one WAL writer at a time, and *no* WAL readers.  So an exclusive
 kernel-level lock on a WAL file while we fsync really shouldn't create
 any problem for us.  (Unless this indirectly blocks other operations
 that I'm missing?)

I hope you're right but I see some very similar contention problems in
the case of many small transactions because of the WALWriteLock.

Assume Transaction A which writes a lot of buffers and XLog entries,
so the Commit forces a relatively lengthy fsynch.

Transactions B - E block not on the kernel lock from fsync but on
the WALWriteLock. 

When A finishes the fsync and subsequently releases the WALWriteLock
B unblocks and gets the WALWriteLock for its fsync for the flush.

C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT.

B Releases and now C writes its XLOG_XACT_COMMIT.

There now seems to be a lot of contention on the WALWriteLock. This
is a shame for a system that has no locking at the logical level and
therefore seems like it could be very, very fast and offer
incredible concurrency.

 As I commented before, I think we could do with an extra process to
 issue WAL writes in places where they're not in the critical path for
 a foreground process.  But that seems to be orthogonal from this issue.
 
It's only orthogonal to the fsync-specific contention issue. We now
have to worry about WALWriteLock semantics causes the same contention.
Your idea of a separate LogWriter process could very nicely solve this
problem and accomplish a few other things at the same time if we make
a few enhancements.

Back-end servers would not issue fsync calls. They would simply block
waiting until the LogWriter had written their record to the disk, i.e.
until the sync'd block # was greater than the block that contained the
XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
ends after its log write returns.

The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
would issue writes of the optimal size when enough data was present or
of smaller chunks if enough time had elapsed since the last write.

The nice part is that the WALWriteLock semantics could be changed to
allow the LogWriter to write to disk while WALWriteLocks are acquired
by back-end servers. WALWriteLocks would only be held for the brief time
needed to copy the entries into the log buffer. The LogWriter would
only need to grab a lock to determine the current end of the log
buffer. Since it would be writing blocks that occur earlier in the
cache than the XLogInsert log writers it won't need to grab a
WALWriteLock before writing the cache buffers.

Many transactions would commit on the same fsync (now really a write
with O_DSYNC) and we would get optimal write throughput for the log
system.

This would handle all the issues I had and it doesn't sound like a
huge change. In fact, it ends up being almost semantically identical 
to the aio_write suggestion I made orignally, except the
LogWriter is doing the background writing instead of the OS and we
don't have to worry about aio implementations and portability.

- Curtis



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith


Bruce Momjian wrote:
 I am again confused.  When we do write(), we don't have to lock
 anything, do we?  (Multiple processes can write() to the same file just
 fine.)  We do block the current process, but we have nothing else to do
 until we know it is written/fsync'ed.  Does aio more easily allow the
 kernel to order those write?  Is that the issue?  Well, certainly the
 kernel already order the writes.  Just because we write() doesn't mean
 it goes to disk.  Only fsync() or the kernel do that.

We don't have to lock anything, but most file systems can't process
fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks. The fsync call is more
contentious than typical writes because its duration is usually
longer so it holds the locks longer over more pages and structures.
That is the real issue. The contention caused by fsync'ing very frequently
which blocks other writers and readers.

For the buffer manager, the blocking of readers is probably even more
problematic when the cache is a small percentage (say  10% to 15%) of
the total database size because most leaf node accesses will result in
a read. Each of these reads will have to wait on the fsync as well. Again,
a very well written file system probably can minimize this but I've not
seen any.

Further comment on:
We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed.

Writing out a bunch of calls at the end, after having consumed a lot
of CPU cycles and then waiting is not as efficient as writing them out,
while those CPU cycles are being used. We are currently waisting the
time it takes for a given process to write.

The thinking probably has been that this is no big deal because other
processes, say B, C and D can use the CPU cycles while process A blocks.
This is true UNLESS the other processes are blocking on reads or
writes caused by process A doing the final writes and fsync.

 Yes, but Oracle is threaded, right, so, yes, they clearly could win with
 it.  I read the second URL and it said we could issue separate writes
 and have them be done in an optimal order.  However, we use the file
 system, not raw devices, so don't we already have that in the kernel
 with fsync()?

Whether by threads or multiple processes, there is the same contention on
the file through multiple writers. The file system can decide to reorder
writes before they start but not after. If a write comes after a
fsync starts it will have to wait on that fsync.

Likewise a given process's writes can NEVER be reordered if they are
submitted synchronously, as is done in the calls to flush the log as
well as the dirty pages in the buffer in the current code.

 Probably.  Having seen the Informix 5/7 debacle, I don't want to fall
 into the trap where we add stuff that just makes things faster on
 SMP/threaded systems when it makes our code _slower_ on single CPU
 systems, which is exaclty what Informix did in Informix 7, and we know
 how that ended (lost customers, bought by IBM).  I don't think that's
 going to happen to us, but I thought I would mention it.

Yes, I hate improvements that make things worse for most people. Any
changes I'd contemplate would be simply another configuration driven
optimization that could be turned off very easily.

- Curtis


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith


I wrote:
  I'm no Unix filesystem expert but I don't see how the OS can
  handle multiple writes and fsyncs to the same file descriptors without
  blocking other processes from writing at the same time. It may be that
  there are some clever data structures they use but I've not seen huge
  praise for most of the file systems. A well written file system could
  minimize this contention but I'll bet it's there with most of the ones
  that PostgreSQL most commonly runs on.
  
  I'll have to write a test and see if there really is a problem.

Bruce Momjian wrote:

 Yes, I can see some contention, but what does aio solve?
 

Well, theoretically, aio lets the file system handle the writes without
requiring any locks being held by the processes issuing those reads. 
The disk i/o scheduler can therefore issue the writes using spinlocks or
something very fast since it controls the timing of each of the actual
writes. In some systems this is handled by the kernal and can be very
fast.

I suspect that with large RAID controllers or intelligent disk systems
like EMC this is even more important because they should be able to
handle a much higher level of concurrent i/o.

Now whether or not the common file systems handle this well, I can't say,

Take a look at some comments on how Oracle uses asynchronous I/O

http://www.ixora.com.au/notes/redo_write_multiplexing.htm
http://www.ixora.com.au/notes/asynchronous_io.htm
http://www.ixora.com.au/notes/raw_asynchronous_io.htm

It seems that OS support for this will likely increase and that this
issue will become more and more important as uses contemplate SMP systems
or if threading is added to certain PostgreSQL subsystems.

It might be easier for me to implement the change I propose and then
see what kind of difference it makes.

I wanted to run the idea past this group first. We can all postulate
whether or not it will work but we won't know unless we try it. My real
issue is one of what happens in the event that it does work.

I've had very good luck implementing this sort of thing for other systems
but I don't yet know the range of i/o requests that PostgreSQL makes.

Assuming we can demonstrate no detrimental effects on system reliability
and that the change is implemented in such a way that it can be turned
on or off easily, will a 50% or better increase in speed for updates
justify the sort or change I am proposing. 20%? 10%?

- Curtis

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Advice: Where could I be of help?

2002-10-03 Thread Curtis Faith

tom lane wrote:
 But more globally, I think that our worst problems these days have to do
 with planner misestimations leading to bad plans.  The planner is
 usually *capable* of generating a good plan, but all too often it picks
 the wrong one.  We need work on improving the cost modeling equations
 to be closer to reality.  If that's at all close to your sphere of
 interest then I think it should be #1 priority --- it's both localized,
 which I think is important for a first project, and potentially a
 considerable win.

This seems like a very interesting problem. One of the ways that I thought
would be interesting and would solve the problem of trying to figure out the
right numbers is to have certain guesses for the actual values based on
statistics gathered during vacuum and general running and then have the
planner run the best plan.

Then during execution if the planner turned out to be VERY wrong about
certain assumptions the execution system could update the stats that led to
those wrong assumptions. That way the system would seek the correct values
automatically. We could also gather the stats that the system produces for
certain actual databases and then use those to make smarter initial guesses.

I've found that I can never predict costs. I always end up testing
empirically and find myself surprised at the results.

We should be able to make the executor smart enough to keep count of actual
costs (or a statistical approximation) without introducing any significant
overhead.

tom lane also wrote:
 There is no cache flushing.  We have a shared buffer cache management
 algorithm that's straight LRU across all buffers.  There's been some
 interest in smarter cache-replacement code; I believe Neil Conway is
 messing around with an LRU-2 implementation right now.  If you've got
 better ideas we're all ears.

Hmmm, this is the area that I think could lead to huge performance gains.

Consider a simple system with a table tbl_master that gets read by each
process many times but with very infrequent inserts and that contains about
3,000 rows. The single but heavily used index for this table is contained in
a btree with a depth of three with 20 - 8K pages in the first two levels of
the btree.

Another table tbl_detail with 10 indices that gets very frequent inserts.
There are over 300,000 rows. Some queries result in index scans over the
approximatley 5,000 8K pages in the index.

There is a 40M shared cache for this system.

Everytime a query which requires the index scan runs it will blow out the
entire cache since the scan will load more blocks than the cache holds. Only
blocks that are accessed while the scan is going will survive. LRU is bad,
bad, bad!

LRU-2 might be better but it seems like it still won't give enough priority
to the most frequently used blocks. I don't see how it would do better for
the above case.

I once implemented a modified cache algorithm that was based on the clock
algorithm for VM page caches. VM paging is similar to databases in that
there is definite locality of reference and certain pages are MUCH more
likely to be requested.

The basic idea was to have a flag in each block that represented the access
time in clock intervals. Imagine a clock hand sweeping across a clock, every
access is like a tiny movement in the clock hand. Blocks that are not
accessed during a sweep are candidates for removal.

My modification was to use access counts to increase the durability of the
more accessed blocks. Each time a block is accessed it's flag is shifted
left (up to a maximum number of shifts - ShiftN ) and 1 is added to it.
Every so many cache accesses (and synchronously when the cache is full) a
pass is made over each block, right shifting the flags (a clock sweep). This
can also be done one block at a time each access so the clock is directly
linked to the cache access rate. Any blocks with 0 are placed into a doubly
linked list of candidates for removal. New cache blocks are allocated from
the list of candidates. Accesses of blocks in the candidate list just
removes them from the list.

An index root node page would likely be accessed frequently enough so that
all it's bits would be set so it would take ShiftN clock sweeps.

This algorithm increased the cache hit ratio from 40% to about 90% for the
cases I tested when compared to a simple LRU mechanism. The paging ratio is
greatly dependent on the ratio of the actual database size to the cache
size.

The bottom line that it is very important to keep blocks that are frequently
accessed in the cache. The top levels of large btrees are accessed many
hundreds (actually a power of the number of keys in each page) of times more
frequently than the leaf pages. LRU can be the worst possible algorithm for
something like an index or table scan of large tables since it flushes a
large number of potentially frequently accessed blocks in favor of ones that
are very unlikely to be retrieved again.

tom lane also wrote:
 This is an 

[HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Curtis Faith

I've been looking at the TODO lists and caching issues and think there may
be a way to greatly improve the performance of the WAL.

I've made the following assumptions based on my reading in the manual and
the WAL archives since about November 2000:

1) WAL is currently fsync'd before commit succeeds. This is done to ensure
that the D in ACID is satisfied.
2) The wait on fsync is the biggest time cost for inserts or updates.
3) fsync itself probably increases contention for file i/o on the same file
since some OS file system cache structures must be locked as part of fsync.
Depending on the file system this could be a significant choke on total i/o
throughput.

The issue is that there must be a definite record in durable storage for the
log before one can be certain that a transaction has succeeded.

I'm not familiar with the exact WAL implementation in PostgreSQL but am
familiar with others including ARIES II, however, it seems that it comes
down to making sure that the write to the WAL log has been positively
written to disk.

So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log
and then use aio_write for all log writes? A transaction would simple do all
the log writing using aio_write and block until all the last log aio request
has completed using aio_waitcomplete. The call to aio_waitcomplete won't
return until the log record has been written to the disk. Opening with
O_DSYNC ensures that when i/o completes the write has been written to the
disk, and aio_write with O_APPEND opened files ensures that writes append in
the order they are received, hence when the aio_write for the last log entry
for a transaction completes, the transaction can be sure that its log
records are in durable storage (IDE problems aside).

It seems to me that this would:

1) Preserve the required D semantics.
2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.
3) Obviate the need for commit_delay, since there is no blocking and the
file system and the disk controller can put multiple writes to the log
together as the drive is waiting for the end of the log file to come under
one of the heads.

Here are the relevant TODO's:

Delay fsync() when other backends are about to commit too [fsync]
Determine optimal commit_delay value

Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options
   Allow multiple blocks to be written to WAL with one write()


Am I missing something?

Curtis Faith
Principal
Galt Capital, LLP

--
Galt Capitalhttp://www.galtcapital.com
12 Wimmelskafts Gade
Post Office Box 7549   voice: 340.776.0144
Charlotte Amalie,  St. Thomasfax: 340.776.0244
United States Virgin Islands  00801 cell: 340.643.5368


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Curtis Faith


tom lane replies:
 Curtis Faith [EMAIL PROTECTED] writes:
  So, why don't we use files opened with O_DSYNC | O_APPEND for 
 the WAL log
  and then use aio_write for all log writes?
 
 We already offer an O_DSYNC option.  It's not obvious to me what
 aio_write brings to the table (aside from loss of portability).
 You still have to wait for the final write to complete, no?

Well, for starters by the time the write which includes the commit
log entry is written, much of the writing of the log for the
transaction will already be on disk, or in a controller on its 
way.

I don't see any O_NONBLOCK or O_NDELAY references in the sources 
so it looks like the log writes are blocking. If I read correctly,
XLogInsert calls XLogWrite which calls write which blocks. If these
assumptions are correct, there should be some significant gain here but I
won't know how much until I try to change it. This issue only affects the
speed of a given back-ends transaction processing capability.

The REAL issue and the one that will greatly affect total system
throughput is that of contention on the file locks. Since fsynch needs to
obtain a write lock on the file descriptor, as does the write calls which
originate from XLogWrite as the writes are written to the disk, other
back-ends will block while another transaction is committing if the
log cache fills to the point where their XLogInsert results in a 
XLogWrite call to flush the log cache. I'd guess this means that one
won't gain much by adding other back-end processes past three or four
if there are a lot of inserts or updates.

The method I propose does not result in any blocking because of writes
other than the final commit's write and it has the very significant
advantage of allowing other transactions (from other back-ends) to
continue until they enter commit (and blocking waiting for their final
commit write to complete).

  2) Allow transactions to complete and do work while other threads are
  waiting on the completion of the log write.
 
 I'm missing something.  There is no useful work that a transaction can
 do between writing its commit record and reporting completion, is there?
 It has to wait for that record to hit disk.

The key here is that a thread that has not committed and therefore is
not blocking can do work while other threads (should have said back-ends 
or processes) are waiting on their commit writes.

- Curtis

P.S. If I am right in my assumptions about the way the current system
works, I'll bet the change would speed up inserts in Shridhar's huge
database test by at least a factor of two or three, perhaps even an
order of magnitude. :-)

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, October 03, 2002 7:17 PM
 To: Curtis Faith
 Cc: Pgsql-Hackers
 Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching 
 
 
 Curtis Faith [EMAIL PROTECTED] writes:
  So, why don't we use files opened with O_DSYNC | O_APPEND for 
 the WAL log
  and then use aio_write for all log writes?
 
 We already offer an O_DSYNC option.  It's not obvious to me what
 aio_write brings to the table (aside from loss of portability).
 You still have to wait for the final write to complete, no?
 
  2) Allow transactions to complete and do work while other threads are
  waiting on the completion of the log write.
 
 I'm missing something.  There is no useful work that a transaction can
 do between writing its commit record and reporting completion, is there?
 It has to wait for that record to hit disk.
 
   regards, tom lane
 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Advice: Where could I be of help?

2002-10-03 Thread Curtis Faith

I wrote:

  My modification was to use access counts to increase the
 durability of the
  more accessed blocks.


tom lane replies:
 You could do it that way too, but I'm unsure whether the extra
 complexity will buy anything.  Ultimately, I think an LRU-anything
 algorithm is equivalent to a clock sweep for those pages that only get
 touched once per some-long-interval: the single-touch guys get recycled
 in order of last use, which seems just like a clock sweep around the
 cache.  The guys with some amount of preference get excluded from the
 once-around sweep.  To determine whether LRU-2 is better or worse than
 some other preference algorithm requires a finer grain of analysis than
 this.  I'm not a fan of more complex must be better, so I'd want to see
 why it's better before buying into it ...

I'm definitely not a fan of more complex must be better either. In fact,
its surprising how often the real performance problems are easy to fix
and simple while many person years are spent solving the issue everyone
knows must be causing the performance problems only to find little gain.

The key here is empirical testing. If the cache hit ratio for LRU-2 is
much better then there may be no need here. OTOH, it took less than
less than 30 lines or so of code to do what I described, so I don't consider
it too, too more complex :=} We should run a test which includes
running indexes (or is indices the PostgreSQL convention?) that are three
or more times the size of the cache to see how well LRU-2 works. Is there
any cache performance reporting built into pgsql?

tom lane wrote:
 Shouldn't the OS be responsible for scheduling those writes
 appropriately?  Ye good olde elevator algorithm ought to handle this;
 and it's at least one layer closer to the actual disk layout than we
 are, thus more likely to issue the writes in a good order.  It's worth
 experimenting with, perhaps, but I'm pretty dubious about it.

I wasn't proposing anything other than changing the order of the writes,
not actually ensuring that they get written that way at the level you
describe above. This will help a lot on brain-dead file systems that
can't do this ordering and probably also in cases where the number
of blocks in the cache is very large.

On a related note, while looking at the code, it seems to me that we
are writing out the buffer cache synchronously, so there won't be
any possibility of the file system reordering anyway. This appears to be
a huge performance problem. I've read claims  in the archives that
that the buffers are written asynchronously but my read of the
code says otherwise. Can someone point out my error?

I only see calls that ultimately call FileWrite or write(2) which will
block without a O_NOBLOCK open. I thought one of the main reasons
for having a WAL is so that you can write out the buffer's asynchronously.

What am I missing?

I wrote:
  Then during execution if the planner turned out to be VERY wrong about
  certain assumptions the execution system could update the stats
 that led to
  those wrong assumptions. That way the system would seek the
 correct values
  automatically.

tom lane replied:
 That has been suggested before, but I'm unsure how to make it work.
 There are a lot of parameters involved in any planning decision and it's
 not obvious which ones to tweak, or in which direction, if the plan
 turns out to be bad.  But if you can come up with some ideas, go to
 it!

I'll have to look at the current planner before I can suggest
anything concrete.

- Curtis


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Curtis Faith

I wrote:
  The REAL issue and the one that will greatly affect total system
  throughput is that of contention on the file locks. Since
 fsynch needs to
  obtain a write lock on the file descriptor, as does the write
 calls which
  originate from XLogWrite as the writes are written to the disk, other
  back-ends will block while another transaction is committing if the
  log cache fills to the point where their XLogInsert results in a
  XLogWrite call to flush the log cache.

tom lane wrote:
 But that's exactly *why* we have a log cache: to ensure we can buffer a
 reasonable amount of log data between XLogFlush calls.  If the above
 scenario is really causing a problem, doesn't that just mean you need
 to increase wal_buffers?

Well, in cases where there are a lot of small transactions the contention
will not be on the XLogWrite calls from caches getting full but from
XLogWrite calls from transaction commits which will happen very frequently.
I think this will have a detrimental effect on very high update frequency
performance.

So while larger WAL caches will help in the case of cache flushing because
of its being full I don't think it will make any difference for the
potentially
more common case of transaction commits.

- Curtis


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Curtis Faith

Bruce Momjian wrote:
 I may be missing something here, but other backends don't block while
 one writes to WAL.

I don't think they'll block until they get to the fsync or XLogWrite
call while another transaction is fsync'ing.

I'm no Unix filesystem expert but I don't see how the OS can
handle multiple writes and fsyncs to the same file descriptors without
blocking other processes from writing at the same time. It may be that
there are some clever data structures they use but I've not seen huge
praise for most of the file systems. A well written file system could
minimize this contention but I'll bet it's there with most of the ones
that PostgreSQL most commonly runs on.

I'll have to write a test and see if there really is a problem.

- Curtis

 -Original Message-
 From: Bruce Momjian [mailto:[EMAIL PROTECTED]]
 Sent: Friday, October 04, 2002 12:44 AM
 To: Curtis Faith
 Cc: Tom Lane; Pgsql-Hackers
 Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching


 Curtis Faith wrote:
  The method I propose does not result in any blocking because of writes
  other than the final commit's write and it has the very significant
  advantage of allowing other transactions (from other back-ends) to
  continue until they enter commit (and blocking waiting for their final
  commit write to complete).
 
2) Allow transactions to complete and do work while other
 threads are
waiting on the completion of the log write.
  
   I'm missing something.  There is no useful work that a transaction can
   do between writing its commit record and reporting
 completion, is there?
   It has to wait for that record to hit disk.
 
  The key here is that a thread that has not committed and therefore is
  not blocking can do work while other threads (should have
 said back-ends
  or processes) are waiting on their commit writes.

 I may be missing something here, but other backends don't block while
 one writes to WAL.  Remember, we are proccess based, not thread based,
 so the write() call only blocks the one session.  If you had threads,
 and you did a write() call that blocked other threads, I can see where
 your idea would be good, and where async i/o becomes an advantage.

 --
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square,
 Pennsylvania 19073



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



[HACKERS] Advice: Where could I be of help?

2002-10-02 Thread Curtis Faith

All,

I'd like to help work on some 7.4 features, however, since you've not seen
my name before, I'm obviously new to the list and the org.

I really like working on speed optimizations and rewrites. I have 15 years
experience with C++-based systems and databases,  and have worked on
commercial database engines (i.e. indexing and query execution systems), sql
execution and optimization, various lex and yacc based compilers and
parsers. I've generally been able to get code to perform as well or better
than competitive systems with similar functionality, and usually have been
able to beat other code by 3 to 10 X. My unix experience is reasonable but
I'm not an expert.

Any suggestions for where to start? I don't mind digging into very hairy
code or large problems. I'm willing to run the risk of a patch not being
accepted (for large changes) since I'll make sure whatever I do is well
known to those who will do the accept/deny and the approach approved of
ahead of time.

Since I'm new here, I'm thinking a problem that would not otherwise get
handled by the experienced group would be the best place to start. Where is
the system especially slow?

I've read the TODO's, and the last five months of the archives for this
list, so I have some general ideas.

I've also had a lot experience marketing to I.T. organizations so I'd be
happy to help out on the Product Marketing for PostgreSQL advocacy, i.e.
developing a marketing strategy, press releases, etc.

- Curtis

Curtis Faith
Principal
Galt Capital, LLP

--
Galt Capitalhttp://www.galtcapital.com
12 Wimmelskafts Gade
Post Office Box 7549   voice: 340.776.0144
Charlotte Amalie,  St. Thomasfax: 340.776.0244
United States Virgin Islands  00801 cell: 340.643.5368


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



FW: [HACKERS] Advice: Where could I be of help?

2002-10-02 Thread Curtis Faith

Forgot to cc' the list.

-Original Message-
From: Curtis Faith [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, October 02, 2002 10:59 PM
To: Tom Lane
Subject: RE: [HACKERS] Advice: Where could I be of help?


Tom,

Here are the things that I think look interesting:

1) Eliminate unchanged column indices:
Prevent index uniqueness checks when UPDATE does not modifying column

Small little task that will make a noticeable improvement. I've done this
before in a b* tree system, it had a huge impact. Should be pretty isolated.

2) Use indexes for min() and max() or convert to SELECT col FROM tab ORDER
BY col DESC LIMIT 1 if appropriate index exists and WHERE clause
acceptable - This will probably be a little more involved but I've done this
exact optimization in a SQL system 6 or 7 years ago.

3) General cache and i/o optimization:

Use bitmaps to fetch heap pages in sequential order

Based on my reading of the emails in [performance] it appears to me that
there might be huge potential in the caching system. I've worked on these
caches and there are some very non-intuitive interactions between database
type access and file systems that I believe offer good potential for
improvement. I'm basing this assessment on the assumption that the sorts of
improvements discussed in the [performance] emails have not been added in
subsequent releases.

Where does the current code stand? How are we currently doing cache flushing
in general and for indices in particular?

4) General index improvements including:
Order duplicate index entries by tid for faster heap lookups
Add FILLFACTOR to btree index creation

I've done the first one before and fill factor is pretty easy, as well.

5) Bitmaps:
Implement a bitmap index:
Use bitmaps to combine existing indexes
I've done something similar, it looks pretty interesting.

6) Improve concurrency of hash indexes (Neil Conway)- Probably more
exploration than implementation and fairly isolated problem.

Based on past experience, from a bang-for-buck perspective, I'd probably do
this in the numerical order. What do you think? I know what I like and can
do but I don't really know enough about PostgreSQL's performance weaknesses
yet.

What are we getting killed on?

- Curtis

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Tom Lane
Sent: Wednesday, October 02, 2002 6:55 PM
To: Curtis Faith
Cc: [EMAIL PROTECTED]
Subject: Re: [HACKERS] Advice: Where could I be of help?


Bruce Momjian [EMAIL PROTECTED] writes:
 I would read the developers corner stuff, the developers FAQ, pick a
 TODO item, and try a patch.  It's that simple.

Yup.  I'd also suggest starting with something relatively small and
localized (the nearby suggestion to fix IN/EXISTS, for example, is
probably not a good first project --- and anyway I was going to work
on that myself this month ;-)).

Neil Conway's thought of working on plpgsql seems a good one to me;
and as he says, there's lots of other possibilities.  What do you
find interesting in the TODO list?

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html