Re: [HACKERS] POSTGRES WAL

2007-02-28 Thread Takayuki Tsunakawa
Is there any method or utility to convert content of WAL files into 
Human Readable format.


xlogdump (or xlog viewer) might help.  Sorry, I've never used it yet.
http://pgfoundry.org/projects/xlogviewer/



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Dead Space Map version 2

2007-02-28 Thread Takayuki Tsunakawa
Hello, long time no see.

This topic looks interesting.  I'm enrious of Itagaki-san and others.
I can't do now what I want, due to other work that I don't want to do
(isn't my boss seeing this?).  I wish I could join the community some
day and contribute to the development like the great experts here.
# I can't wait to try Itagakis-san's latest patch for load distributed
checkpoint in my environment and report the result.
# But I may not have enough time...

Let me give some comment below.


From: "Heikki Linnakangas" <[EMAIL PROTECTED]>
> While I understand that 100% reliable coverage is a significantly
> stronger guarantee, I don't see any particular problems in
implementing
> that. WAL logging isn't that hard.
>
> I won't insist, I'm not the one doing the programming after all.
> Anything is better than what we have now. However, I do hope that
> whatever is implemented doesn't need a complete rewrite to make it
100%
> reliable in the future.
>
> The basic wish I have is to not use a fixed size shared memory area
like
> FSM for the DSM. I'd like it to use the shared buffers instead,
which
> makes the memory management and tuning easier. And it also makes it
> easier to get the WAL logging right, even if it's not done for 8.3
but
> added later.
>

I hope for the same thing as Heikki-san.  Though I'm relatively new to
PostgreSQL source code, I don't think it is very difficult (at least
for experts here) to implement the reliable space management scheme,
so I proposed the following before -- not separate memory area for
FSM, but treating it the same way as data files in the shared buffers.
Though Tom-san is worrying about performance, what makes the
performance degrade greatly?  Additional WAL records for updating
space management structures are written sequentially in batch.
Additional dirty shared buffers are written efficiently by kernel (at
least now.)  And PostgreSQL is released from the giant lwlock for FSM.
Some performance degradation would surely result.  However,
reliability is more important because "vacuum" is almost the greatest
concern for real serious users (not for hobbists who enjoy
performance.)  Can anybody say to users "we are working hard, but our
work may not be reliable and sometimes fails.  Can you see if our
vacuuming effort failed and try this...?"

And I'm afraid that increasing the number of configuration parameters
is unacceptable for users.  It is merely the excuse of developers.
PostgreSQL already has more than 100 parameters.  Some of them, such
as bgwriter_*, are difficult for normal users to understand.  It's
best to use shared_buffers parameter and show how to set it in the
document.
Addressing the vacuum problem correctly is very important.  I hope you
don't introduce new parameters for unfinished work and force users to
check the manual to change the parameters in later versions, i.e.
"managed_* parameters are not supported from this release.  Please use
shared_buffers..."  Is it a "must" to release 8.3 by this summer?  I
think that delaying the release a bit for correct (reliable) vacuum
resolution is worth.


From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
> Yes! I'm completely in favor of Itagaki-san.  Separating the cache
for
> FSM may produce a new configuration parameter like fsm_cache_size,
> which the normal users would not desire (unless they like enjoying
> difficult DBMS.)
> I think that integrating the treatment of space management structure
> and data area is good.  That means, for example, implementing "Free
> Space Table" described in section 14.2.2.1 of Jim Gray's book
> "Transaction Processing: Concepts and Techniques", though it may
have
> been discussed in PostgreSQL community far long ago (really?).  Of
> course, some refinements may be necessary to tune to PostgreSQL's
> concept, say, creating one free space table file for each data file
to
> make the implementation easy.  It would reduce the source code
solely
> for FSM.
>
> In addition, it would provide the transactional space management.
If
> I understand correctly, in the current implementation, updates to
FSM
> are lost when the server crashes, aren't they?  The idea assumes
that
> FSM will be rebuilt by vacuum because vacuum is inevitable.  If
> updates to space management area were made transactional, it might
> provide the infrastructure for "vacuumless PostgreSQL."



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?

2007-02-09 Thread Takayuki Tsunakawa
From: "Magnus Hagander" <[EMAIL PROTECTED]>
> Right. Which is why you're likely to see better performance if you
keep
> shared buffers smaller. There is something in dealing with it that's
> slow on win32, per reports from the field. It needs to be
investigated
> further...
> We've had reports that it's slow with large shared_buffers, yes.

That's a shocking news.  I'm sad.
I wonder whether the field you are talking about set Windows to use
more memory for programs than for filesystem cache, which is
selectable from [System] applet of Control Panel (Oh, I wonder how my
machine is set in this respect... have to check.)  If filesystem cache
is preferred, the following senario may be possible:

1. PostgreSQL tries to read data from disk into database cache.
2. The kernel tries to allocate filesystem buffers by paging out
PostgreSQL's memory (possibly shared buffers).
3. PostgreSQL finds data requested by its clients in database cache,
and tries to get it in memory.
4. But the shared buffers are paged out, and page-ins happen.

> Are you sure you're not running this on for example
> IDE disks with write-cache that lies? Windows will write through
that
> write-cache even if the disk lies, whereas most linux versions
won't. At
> least that used to be the case not too long ago, but there has also
been
> talking about fixign that in linux, so maybe that's done...

I'm using a PC server whose disks are all SCSI.  It has no IDE disk.

> Also note that when you run pg_bench on the local machine, you take
a
> much higher hit from the fact that context switching between
processes
> is a lot more expensive on Windows than it is on Linux. But it
shouldn't
> be big enough to explain the huge difference you had in your test.

Yes, I suspect it, too.  So, Oracle uses one multi-threaded server
process on Windows, while it employs multi-process architecture.  SQL
Server is of course multi-threaded.  SRA's original PostgreSQL for
Windows (based on 7.x) was also multi-threaded.





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?

2007-02-08 Thread Takayuki Tsunakawa
From: "Magnus Hagander" <[EMAIL PROTECTED]>
> hnetcfg.dll is a part of Windows. "Home Networking Configuration
> Manager". LPK.DLL is also a part of Windows - it's the language
pack.

Thank you for information.


> On Thu, Feb 08, 2007 at 09:50:26PM +0900, Takayuki Tsunakawa wrote:
>> When I try to start PostgreSQL 8.2.1 on Windows 2003 Server with
>> shared_buffers=1024MB, I get the following error messages in the
Event
>> Log (with log_min_messages=debug5) and can't start PostgreSQL:
>
> Is this for testing, or for production? From what I've heard, you
would
> normally never want that much shared memory - I've seen more reports
on
> taht you shuld keep it as low as possible, really. For performance
> reasons.

For testing.  I wanted to place all data in shared buffers to
eliminate reads from disk while I run pgbench repeatedly (actually
most reads should come from kernel cache, though.)

Does PostgreSQL for Windows have any problem when using a large
database cache unlike UNIX versions?  I'm excited about your current
great work to enable building all of PostgreSQL with MSVC.  I thought
you are aiming at making PostgreSQL 64-bit on Windows in the near
future (though you may not have signified in ML.)  I'm afraid MinGW
will not extend to 64-bit (for x64 and Itanium) at least reliably and
immediately, due to the difference of data model -- 'long' is still
32-bit in 64-bit applications on Windows.  I thought Magnus-san got
worried about it and started the activity of completely switching to
MSVC.

BTW, the current PostgreSQL for Windows is very slow, isn't it?  I
compared the performance of PostgreSQL 8.2.x for Linux (RHEL4 for x86,
kernel 2.6.x) and Windows Server 2003.  I ran 'pgbench -c32 -t500' on
the same machine with the same disk layout for data files and WAL,
i.e. they are stored on separate disks.  The settings in
postgresql.conf is the same, except for wal_sync_method -- it is set
to open_sync on Linux and open_datasync on Windows, because they are
the best for each platform.
Linux version shows 1100 tps, but Windows version shows only 450 tps.






---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] What is the motivation of include directive and

2007-01-17 Thread Takayuki Tsunakawa
From: "Andrew Dunstan" <[EMAIL PROTECTED]>
> Meeting FHS requirements is no bad thing, though. And the ability to
> include a common configuration set in multiple instances is surely
> useful to a number of people. After all, you aren't forced to use
these
> facilities - I typically don't.

Thank you, Andrew-san.
What I want to know is a more concreet thing.

How useful are those facilities to what kind of users in what cases?
Is there a reason why users in the real world positively use those
facilities?

If I want to develop an add-on tool that manipulates PostgreSQL
settings even when PostgreSQL is stopped, should that tool support the
environments where include directive is used and config files are
placed outside the data directory?  If the need for include directive
and flexible config file placement is very low or vague, I thought I
could say "this tool does not support those environments".

Are any facilities (functions) provided in PostgreSQL that external
tool developers can use to read/write config files?  Suppose a user
created a database instance manually, with the config files as
follows:

[$PGDATA/postgresql.conf]
...
max_connections=10
include '/some_dir/some.conf'
include '/other_dir/other.conf'

[/some_dir/some.conf]
...
max_connections=50

[/other_dir/other.conf]
...
max_connections=100


If an external tool wants to determine the current value of
max_connections or modify the setting, what should the tool do?
Should the tool parse and write out the files by itself?



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Idea for fixing the Windows fsync problem

2007-01-17 Thread Takayuki Tsunakawa
From: "Tom Lane" <[EMAIL PROTECTED]>
> It's still not 100% bulletproof, because it's possible that some
other
> backend is holding an open file in the database as a consequence of
> having had to dump some shared buffer for itself, but that should be
> pretty darn rare if the bgwriter is getting its job done.

I've understood that you are talking about the case where backends
have to evict dirty buffers containing data of a database they are not
connected to.  The problem is that the backends don't close the data
file after writing dirty buffers.
Then, how about making the backends close the data files?  Concretely,
in FlushBuffer() in src/backend/storage/buffer/bufmgr.c, call
SmgrClose() after SmgrWrite() like this:

--
if (reln passed to FlushBuffer() was NULL &&
reln->smgr_rnode.dbNode != my database's oid(where is this
stored?)
SmgrClose(reln);
}
--


Or, to make the intention clearer, it may be better to add calls to
SmgrOpen() and SmgrClose() in succession after FlushBuffer() in
BufferAlloc().


BTW, fsync() is causing trouble here in addition to load-distributed
checkpoint that Itagaki-san has been addressing, isn't it?  If fsync
were eliminated by using O_SYNC as commercial databases, Tom-san
didn't have to make efforts to solve this problem.






---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Idea for fixing the Windows fsync problem

2007-01-17 Thread Takayuki Tsunakawa
Hello, Stefan-san

 tom is talking about the postgresql distributed buildfarm:
>
> http://buildfarm.postgresql.org/cgi-bin/show_status.pl

Thank you for telling me.  This is a great system, isn't it?

- Original Message - 
From: "Stefan Kaltenbrunner" <[EMAIL PROTECTED]>
To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
Cc: "Magnus Hagander" <[EMAIL PROTECTED]>;
; "Tom Lane" <[EMAIL PROTECTED]>
Sent: Wednesday, January 17, 2007 4:44 PM
Subject: Re: [HACKERS] Idea for fixing the Windows fsync problem


> Takayuki Tsunakawa wrote:
>> From: "Tom Lane" <[EMAIL PROTECTED]>
>>> I wrote:
>>>> I've committed a tentative patch along these lines to HEAD.
Please
>>>> test.
>>> So I come home from dinner out, and find the buildfarm all red :-(
>>>
>>> I'm not sure why I didn't see this failure in my own testing, but
in
>>> hindsight it's quite obvious that if the bgwriter is to take a
hard
>>> line about fsync failures, it's got to be told about DROP DATABASE
>>> not only DROP TABLE --- that is, there has to be a signaling
message
>>> for "revoke fsync requests across whole database".
>>>
>>
>> Excuse me if I misunderstand English and say something strange.
>> I thought "buildfarm is red" meant the failure of regression test.
>> What kind of errors did you get in what operation (e.g. DROP
INDEX)?
>> Is everyone is able to see the test result freely?  Sorry, I'll
read
>> developer's FAQ when I have more time.
>
> tom is talking about the postgresql distributed buildfarm:
>
> http://buildfarm.postgresql.org/cgi-bin/show_status.pl
>
> and right now most of the members are indeed "red" due to the fsync
> related changes.
>
>
> Stefan
>



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Idea for fixing the Windows fsync problem

2007-01-16 Thread Takayuki Tsunakawa
From: "Tom Lane" <[EMAIL PROTECTED]>
>I wrote:
>> I've committed a tentative patch along these lines to HEAD.  Please
>> test.
>
> So I come home from dinner out, and find the buildfarm all red :-(
>
> I'm not sure why I didn't see this failure in my own testing, but in
> hindsight it's quite obvious that if the bgwriter is to take a hard
> line about fsync failures, it's got to be told about DROP DATABASE
> not only DROP TABLE --- that is, there has to be a signaling message
> for "revoke fsync requests across whole database".
>

Excuse me if I misunderstand English and say something strange.
I thought "buildfarm is red" meant the failure of regression test.
What kind of errors did you get in what operation (e.g. DROP INDEX)?
Is everyone is able to see the test result freely?  Sorry, I'll read
developer's FAQ when I have more time.
And I thought you meant you have a question: why doesn't bgwriter
report "checkpoint request failed" when a checkpoint occurs after
dropping a database?
No, checkpoint after DROP DATABASE should not report "checkpoint
request failed."  When dropping a database, checkpoint is forced
before removing the database files, so bgwriter does not try to open
for fsync the database files.  I saw this in the following code
fragment in src/backend/commands/dbcommands.c:


 /*
  * On Windows, force a checkpoint so that the bgwriter doesn't hold
any
  * open files, which would cause rmdir() to fail.
  */
#ifdef WIN32
 RequestCheckpoint(true, false);
#endif

 /*
  * Remove all tablespace subdirs belonging to the database.
  */
 remove_dbtablespaces(db_id);



> I think that it should not be necessary to invent a signal for "drop
> across tablespace", though, because we don't allow DROP TABLESPACE
to>> remove any tables --- you've got to drop tables and/or databases
to
> clean out the tablespace, first.  Anyone see a flaw in that?

I think you are right.

 BTW: what happens on Windows if we're trying to do the equivalent
> of "rm -rf database-dir" and someone is holding open one of the
files
> in the directory?  Or has the directory itself open for readdir()?

And I wonder what happens if Windows "copy" command is accessing the
data files when bgwriter tries to open them for fsync, or the reverse
of it.  copy would fail?  If so, it means that online backup sometimes
fails.

And how about a checkpoint after ALTER TABLE/INDEX ... SET TABLESPACE?
When bgwriter tries to open the table/index file, the original file
does not exist.




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] What is the motivation of include directive and

2007-01-16 Thread Takayuki Tsunakawa
From: "Tom Lane" <[EMAIL PROTECTED]>
> "Takayuki Tsunakawa" <[EMAIL PROTECTED]> writes:
>> Still, I don't understand well why config files need to be placed
>> outside the data directory, except for daring conform to FHS.
>
> The killer argument for it is that most of what is in $PGDATA should
be
> excluded from your normal filesystem backup method, because you need
to
> be using some database-aware mechanism for backing up the database.
But
> the config files are perfectly suited for standard filesystem
backup,
> and indeed will *not* be covered by, say, pg_dumpall.  So putting
them
> somewhere else helps in creating a coherent backup strategy.

Thank you, I've understood the reason for placement.  As I supposed,
it is provided mainly for placing config files in /etc to allow
frequent backup of the configuration as a whole system, because most
config files of UNIX are stored in /etc (recommended so in FHS) and
/etc is small.




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] What is the motivation of include directive and

2007-01-16 Thread Takayuki Tsunakawa
From: "Bruce Momjian" <[EMAIL PROTECTED]>
> Takayuki Tsunakawa wrote:
>> In section "17.1. Setting Parameters", include directive is
described.
>> Why was this directive prepared?  What usage is assumed?  Is it for
>> GUI tools, or for placing custom parameters in other files?
>>
>> In section "17.2. File Locations", the following parameters are
>> described:
>> What demand is assumed for placing configuration files outside the
>> data directory?  Is it for placing configuration files in /etc to
>> conform to the FHS (Filesystem Hierarchy Standard, as documented
>> below) to enable the backup of all configuration files on the
system?
>
> It was designed for people who have multiple postmasters, but and
want
> to centralize some of that configuation.

Thank you.  But I'm still a bit unclear.
Do you mean those functionalities were designed for, e.g.,
organizations that provide hosting service -- runs multiple database
instances on a single large machine?  The image is:

/DataDirForCustomer1/postgresql.conf
[content]
include '/etc/postgresql_common.conf'
hba_file='/etc/pg_hba.conf'
...settings specific to customer 1

/DataDirForCustomer2/postgresql.conf
[content]
include '/etc/postgresql_common.conf'
hba_file='/etc/pg_hba.conf'
...settings specific to customer 2

/etc/postgresql_common.conf
[content]
...settings common among all customers

Was this mechanism invented in response to the demand from users, or
invented from the idea of PostgreSQL developers?

Still, I don't understand well why config files need to be placed
outside the data directory, except for daring conform to FHS.



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] What is the motivation of include directive and configuration files outside PGDATA

2007-01-16 Thread Takayuki Tsunakawa
Hello,

Let me ask about the background of configuration files.  I couldn't
find the relevant information in the 8.2 documentation.  I'm sorry to
cause you trouble.

In section "17.1. Setting Parameters", include directive is described.
Why was this directive prepared?  What usage is assumed?  Is it for
GUI tools, or for placing custom parameters in other files?

In section "17.2. File Locations", the following parameters are
described:

data_directory (string)
config_file (string)
hba_file (string)
ident_file (string)
external_pid_file (string)

What demand is assumed for placing configuration files outside the
data directory?  Is it for placing configuration files in /etc to
conform to the FHS (Filesystem Hierarchy Standard, as documented
below) to enable the backup of all configuration files on the system?

http://www.pathname.com/fhs/




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Idea for fixing the Windows fsync problem

2007-01-16 Thread Takayuki Tsunakawa
From: "Tom Lane" <[EMAIL PROTECTED]>
 I suggested that here
> http://archives.postgresql.org/pgsql-hackers/2007-01/msg00642.php
> but have received no feedback about it ...

I'm sorry, I missed it.

From: "Tom Lane" <[EMAIL PROTECTED]>
> Magnus Hagander <[EMAIL PROTECTED]> writes:
>> Tom Lane wrote:
>>> So: maybe the solution is to add a step to the drop sequence,
namely
>>> revoking any pending fsync request, before unlink.
>
>> Perhaps we could have the bgwrite check the queue *if* it gets the
>> ENOENT/EACCESS error and then re-check the queue for drops on that
file?
>
> I've committed a tentative patch along these lines to HEAD.  Please
> test.

I agree with Magnus-san's suggestion, too.
Though I'm willing to test, I'm not familiar with building on Windows
yet and do not have enogh time for other works right now.  If someone
builds and gives me the new postgres.exe, I'll put it on my 8.2
installation and test.  Or, could anyone do the following?  These are
what I did in yesterday's test.

1. Open two psql sessions.  Let me call those session1  and session2.

2. On session1, execute:

create table a (c int);
insert into a values(1);

3. On session2, execute:

select * from a;

4. On session1, execute:

drop table a;
checkpoint;

Checkpoint command here reported an error yesterday.  If Tom-san's
patch is effective, it should not fail and no messages are put in the
event log.




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [GENERAL] Checkpoint request failed on version 8.2.1.

2007-01-15 Thread Takayuki Tsunakawa
From: "Magnus Hagander" <[EMAIL PROTECTED]>
> But yeah, that's probably a good idea. A quick look at the code says
we
> should at least ask people who have this problem to give it a run
with
> logging at DEBUG5 which should then log exactly what the errorcode
was.
> Or are you seeing more places that need such logging first?

I'm sorry we can't get get the Win32 error code to be displayed.  I
got the following messages:

2007-01-16 09:24:48 DEBUG:  checkpoint starting
2007-01-16 09:24:48 ERROR:  could not open relation 1663/10819/18296:
Permission denied
2007-01-16 09:24:48 ERROR:  checkpoint request failed
2007-01-16 09:24:48 HINT:  Consult recent messages in the server log
for details.
2007-01-16 09:24:48 STATEMENT:  checkpoint;


The reason is that src/port/open.c does not use _dosmaperr().  It
converts the Win32 error code to errno directly.  EACCES is converted
from ERROR_ACCESS_DENIED only.  Mmm, we may have to compromise as
Tom-san says.

BTW, why does the bgwriter try to open and write the pages of already
dropped relations?  When dropping relations, DropRelFileNodeBuffers is
called to discard dirty buffers.  If the relation being dropeed has
already been registered in the list of files to be fsynced, isn't it
possible to remove the file from the list before unlinking the file,
asking bgwriter in a similar way as ForwardFsyncRequest()?

#
The timestamp at the head of each message is noisy since the event
viewer has the time info, isn't it?  Besides, several PostgreSQL
messages appeared as one entry of event log, separated by a LF instead
of CR LF.  On Windows, CR LF should separate lines.





---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Dynamically sizing FSM?

2007-01-09 Thread Takayuki Tsunakawa
From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
> Tom Lane <[EMAIL PROTECTED]> wrote:
>
>> I'm of the opinion that the solution to FSM being fixed-size is to
keep
>> it somewhere else, ie, on disk (possibly with some sort of cache in
>> shared memory for currently-used entries).
>
> What do you think dynamic allocation from shared_buffers? ie, remove
> a buffer page in the shared buffer pool and use the 8kB of memory
> for another purpose. To be sure, we don't free from
out-of-FSM-memory,
> but it can get rid of deciding the amount of FSM buffers.
> I think we could use the above as "shared memory allocator".
> It is useful for Dead Space Map, shared prepared statements, and so
on.

Yes! I'm completely in favor of Itagaki-san.  Separating the cache for
FSM may produce a new configuration parameter like fsm_cache_size,
which the normal users would not desire (unless they like enjoying
difficult DBMS.)
I think that integrating the treatment of space management structure
and data area is good.  That means, for example, implementing "Free
Space Table" described in section 14.2.2.1 of Jim Gray's book
"Transaction Processing: Concepts and Techniques", though it may have
been discussed in PostgreSQL community far long ago (really?).  Of
course, some refinements may be necessary to tune to PostgreSQL's
concept, say, creating one free space table file for each data file to
make the implementation easy.  It would reduce the source code solely
for FSM.

In addition, it would provide the transactional space management.  If
I understand correctly, in the current implementation, updates to FSM
are lost when the server crashes, aren't they?  The idea assumes that
FSM will be rebuilt by vacuum because vacuum is inevitable.  If
updates to space management area were made transactional, it might
provide the infrastructure for "vacuumless PostgreSQL."






---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Load distributed checkpoint

2007-01-08 Thread Takayuki Tsunakawa
Happy new year

From: "Simon Riggs" <[EMAIL PROTECTED]>
> Have you tried setting deadline scheduler on the WAL device and CFQ
on
> the data device? That should allow the I/Os to move through
different
> queues and prevent interference.

No, I've not tried yet.  Inaam-san told me that Linux had a few I/O
schedulers but I'm not familiar with them.  I'll find information
about them (how to change the scheduler settings) and try the same
test.


- Original Message - 
From: "Simon Riggs" <[EMAIL PROTECTED]>
To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
Cc: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>;

Sent: Thursday, December 28, 2006 7:07 AM
Subject: Re: [HACKERS] Load distributed checkpoint


> On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote:
>> Hello, Itagaki-san, all
>>
>> Sorry for my long mail.  I've had trouble in sending this mail
because
>> it's too long for pgsql-hackers to accept (I couldn't find how
large
>> mail is accepted.)  So I'm trying to send several times.
>> Please see the attachment for the content.
>
> Your results for fsync are interesting.
>
> I've noticed that a checkpoint seems to increase the activity on the
WAL
> drive as well as increasing I/O wait times. That doesn't correspond
to
> any real increase in WAL traffic I'm aware of.
>
> Have you tried setting deadline scheduler on the WAL device and CFQ
on
> the data device? That should allow the I/Os to move through
different
> queues and prevent interference.
>
> -- 
>  Simon Riggs
>  EnterpriseDB   http://www.enterprisedb.com
>
>
>



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-25 Thread Takayuki Tsunakawa
From: "Bruce Momjian" <[EMAIL PROTECTED]>
> On an idle system, would someone dirty a large file, and watch the
disk
> I/O to see how long it takes for the I/O to complete to disk?

I ran "dd if=/dev/zero of= bs=8k count=`expr 1048576
/ 8`, that is, writing 1GB file with 8KB write()'s.  It took about 175
seconds for the kernel to flush buffers.  I'll put the result of
"iostat -x 5" which was started at the same time as dd.  175 seconds
means 35 rows with nonzero %util * 5 seconds of interval.

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sde  0.00 981.56  0.00 50.900.00 8439.28 0.00  4219.64
165.8034.38  525.01   6.25  31.82
sde  0.00 25737.00  0.00 118.600.00 219820.80 0.00
109910.40  1853.46  6529.74 1587.08   8.43 100.02
sde  0.00 1912.97  0.00 127.740.00 16325.75 0.00
8162.87   127.80  8192.67 6502.98   7.81  99.82
sde  0.00 1728.00  0.00 117.800.00 14745.60 0.00
7372.80   125.17  8209.36 11227.29   8.49 100.02
sde  0.00 1536.00  0.00 103.800.00 13107.20 0.00
6553.60   126.27  8209.10 16729.09   9.64 100.02
sde  0.00 1344.00  0.00 91.600.00 11468.80 0.00
5734.40   125.21  8208.69 21481.22  10.92 100.02
sde  0.00 1532.80  0.00 98.000.00 13081.60 0.00
6540.80   133.49  8209.34 26583.57  10.21 100.02
sde  0.00 1632.00  0.00 107.000.00 13926.40 0.00
6963.20   130.15  8208.89 31662.93   9.35 100.02
sde  0.00 1536.00  0.00 106.400.00 13107.20 0.00
6553.60   123.19  8209.66 36443.07   9.40 100.02
sde  0.00 1442.89  0.00 95.590.00 12312.63 0.00
6156.31   128.81  8227.23 41446.78  10.49 100.24
sde  0.00 1532.80  0.00 103.000.00 13081.60 0.00
6540.80   127.01  8210.77 46606.03   9.71 100.04
sde  0.00 1440.00  0.00 92.000.00 12288.00 0.00
6144.00   133.57  8208.82 51421.78  10.87 100.02
sde  0.00 1344.00  0.00 91.800.00 11468.80 0.00
5734.40   124.93  8209.86 56524.37  10.90 100.02
sde  0.00 1539.08  0.00 101.000.00 13133.47 0.00
6566.73   130.03  8225.59 61477.93   9.92 100.22
sde  0.00 1436.80  0.00 95.400.00 12262.40 0.00
6131.20   128.54  8208.88 66566.42  10.48 100.02
sde  0.00 1344.00  0.00 92.400.00 11468.80 0.00
5734.40   124.12  8209.47 71466.12  10.82 100.02
sde  0.00 1102.40  0.00 101.000.00 9408.38 0.00
4704.1993.15  8174.36 76538.41   9.88  99.82
sde  0.00   0.00  0.00 89.000.000.00 0.00 0.00
0.00  7855.72 80795.64  11.24 100.02
sde  0.00   0.00  0.00 91.380.000.00 0.00 0.00
0.00  7422.53 81823.89  10.97 100.22
sde  0.00   0.00  0.00 96.800.000.00 0.00 0.00
0.00  6919.07 83194.91  10.33 100.02
sde  0.00   0.00  0.00 91.780.000.00 0.00 0.00
0.00  6480.85 84657.04  10.92 100.22
sde  0.00   0.00  0.00 92.600.000.00 0.00 0.00
0.00  5997.22 84749.79  10.80 100.02
sde  0.00   0.00  0.00 91.020.000.00 0.00 0.00
0.00  5528.97 85345.07  10.97  99.82
sde  0.00   0.00  0.00 102.610.000.00 0.00
0.00 0.00  5059.61 85057.91   9.77 100.22
sde  0.00   0.00  0.00 93.200.000.00 0.00 0.00
0.00  4572.57 85284.49  10.73 100.02
sde  0.00   0.00  0.00 98.200.000.00 0.00 0.00
0.00  4109.15 86086.50  10.21 100.22
sde  0.00   0.00  0.00 91.420.000.00 0.00 0.00
0.00  3611.72 86405.24  10.92  99.82
sde  0.00   0.00  0.00 100.000.000.00 0.00
0.00 0.00  3135.62 86292.49  10.00 100.02
sde  0.00   0.00  0.00 100.400.000.00 0.00
0.00 0.00  2652.63 86609.79   9.96 100.02
sde  0.00   0.00  0.00 92.800.000.00 0.00 0.00
0.00  2153.69 86168.58  10.78 100.02
sde  0.00   0.00  0.00 88.800.000.00 0.00 0.00
0.00  1694.74 86275.58  11.26 100.02
sde  0.00   0.00  0.00 98.200.000.00 0.00 0.00
0.00  1241.41 86708.40  10.19 100.02
sde  0.00   0.00  0.00 95.200.000.00 0.00 0.00
0.00   747.52 86505.59  10.51 100.02
sde  0.00   0.00  0.00 89.200.000.00 0.00 0.00
0.00   283.44 86551.11  11.21 100.02
sde  0.00  51.20  0.00 17.200.00  449.60 0.00   224.80
26.14 4.30 61572.65   9.05  15.56
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.00   

Re: [HACKERS] Load distributed checkpoint

2006-12-25 Thread Takayuki Tsunakawa
Hello, Inaam-san,

> There are four IO schedulers in Linux. Anticipatory, CFQ (default),
deadline, and noop. For typical OLTP type loads generally deadline is
recommended. If you are constrained on CPU and you have a good controller
then its better to use noop.
> Deadline attempts to merge requests by maintaining two red black trees in
sector sort order and it also ensures that a request is serviced in given
time by using FIFO. I don't expect it to do the magic but was wondering that
it may dilute the issue of fsync() elbowing out WAL writes.
> You can look into /sys/block//queue/scheduler to see which
scheduler you are using.

Thank you for your information.
I could only find the following files in /sys/block//queue/:

iosched
max_hw_sectors_kb
max_sectors_kb
nr_requests
read_ahead_kb

In iosched, the following files exist:

quantum (the content is "4")
queued (the content is "8")


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Takayuki Tsunakawa
> (3) is very strange. Your machine seems to be too restricted
> by WAL so that other factors cannot be measured properly.

Right... It takes as long as 15 seconds to fsync 1GB file.  It's
strange.  This is a borrowed PC server, so the disk may be RAID 5?
However, the WAL disk and DB  disks show the same throughput.  I'll
investigate.  I may have to find another machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)

Oh, Windows.  Maybe the fsync() problem Itagaki-san pointed out does
not exist.
BTW, your env is showing attractive result, isn't it?

- Original Message - 
From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
Cc: 
Sent: Friday, December 22, 2006 6:09 PM
Subject: Re: [HACKERS] Load distributed checkpoint


"Takayuki Tsunakawa" <[EMAIL PROTECTED]> wrote:

> (1) Default case(this is show again for comparison and reminder)
> 235  80  226  77  240
> (2) Default + WAL 1MB case
> 302  328  82  330  85
> (3) Default + wal_sync_method=open_sync case
> 162  67  176  67  164
> (4) (2)+(3) case
> 322  350  85  321  84
> (5) (4) + /proc/sys/vm/dirty* tuning
> 308  349  84  349  84

(3) is very strange. Your machine seems to be too restricted
by WAL so that other factors cannot be measured properly.


I'll send results on my machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)
- shared_buffers=1GB
- wal_sync_method = open_datasync
- wal_buffers = 1MB
- checkpoint_segments = 16
- checkpoint_timeout = 5min

I repeated "pgbench -c16 -t500 -s50"
and picked up results around checkpoints.

[HEAD]
...
560.8
373.5 <- checkpoint is here
570.8
...

[with patch]
...
562.0
528.4 <- checkpoint (fsync) is here
547.0
...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center





---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Takayuki Tsunakawa
From: "Greg Smith" <[EMAIL PROTECTED]>
> This is actually a question I'd been meaning to throw out myself to
this
> list.  How hard would it be to add an internal counter to the buffer
> management scheme that kept track of the current number of dirty
pages?
> I've been looking at the bufmgr code lately trying to figure out how
to
> insert one as part of building an auto-tuning bgwriter, but it's
unclear
> to me how I'd lock such a resource properly and scalably.  I have a
> feeling I'd be inserting a single-process locking bottleneck into
that
> code with any of the naive implementations I considered.

To put it in an extreme way, how about making bgwriter count the dirty
buffers periodically scanning all the buffers?  Do you know the book
"Principles of Transaction Processing"?  Jim Gray was one of the
reviewers of this book.

http://www.amazon.com/gp/aa.html?HMAC=&CartId=&Operation=ItemLookup&&ItemId=1558604154&ResponseGroup=Request,Large,Variations&bStyle=aaz.jpg&MerchantId=All&isdetail=true&bsi=Books&logo=foo&Marketplace=us&AssociateTag=pocketpc

In chapter 8, the author describes fuzzy checkpoint combined with
two-checkpoint approach.  In his explanation, recovery manager (which
would be bgwriter in PostgreSQL) scans the buffers and records the
list of dirty buffers at each checkpoint.  This won't need any locking
in PostgreSQL if I understand correctly.  Then, the recovery manager
performs the next checkpoint after writing those dirty buffers.  In
two-checkpoint approach, crash recovery starts redoing from the second
to last checkpoint.  Two-checkpoint is described in Jim Gray's book,
too.  But they don't refer to how the recovery manager tunes the speed
of writing.


> slightly different from the proposals here.  What if all the
database page
> writes (background writer, buffer eviction, or checkpoint scan) were
> counted and periodic fsync requests send to the bgwriter based on
that?
> For example, when I know I have a battery-backed caching controller
that
> will buffer 64MB worth of data for me, if I forced a fsync after
every
> 6000 8K writes, no single fsync would get stuck waiting for the disk
to
> write for longer than I'd like.

That seems interesting.

> You can do sync
> writes with perfectly good performance on systems with a good
> battery-backed cache, but I think you'll get creamed in comparisons
> against MySQL on IDE disks if you start walking down that path;
since
> right now a fair comparison with similar logging behavior is an even
match
> there, that's a step backwards.

I wonder what characteristics SATA disks have compared to IDE.  Recent
PCs are equiped with SATA disks, aren't they?
What do you feel your approach compares to MySQL on IDE disks?

> Also on the topic of sync writes to the database proper:  wouldn't
using
> O_DIRECT for those potentially counter-productive?  I was under the
> impressions that one of the behaviors counted on by Postgres was
that data
> evicted from its buffer cache, eventually intended for writing to
disk,
> was still kept around for a bit in the OS buffer cache.  A
subsequent read
> because the data was needed again might find the data already in the
OS
> buffer, therefore avoiding an actual disk read; that substantially
reduces
> the typical penalty for the database engine making a bad choice on
what to
> evict.  I fear a move to direct writes would put more pressure on
the LRU
> implementation to be very smart, and that's code that you really
don't
> want to be more complicated.

I'm worried about this, too.




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Takayuki Tsunakawa
From: Inaam Rana
> Which IO Shceduler (elevator) you are using?

Elevator?  Sorry, I'm not familiar with the kernel implementation, so I
don't what it is.  My Linux distribution is Red Hat Enterprise Linux 4.0 for
AMD64/EM64T, and the kernel is 2.6.9-42.ELsmp.  I probably havn't changed
any kernel settings, except for IPC settings to run PostgreSQL.





Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
> (5) (4) + /proc/sys/vm/dirty* tuning
> dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
> changed from 40 to 4.
>
> 308  349  84  349  84

Sorry, I forgot to include the result when using Itagaki-san's patch.
The patch showd the following tps for case (5).

323  350  340  59  225

The best response time was 4 msec, and the worst one was 16 seconds.


- Original Message - 
From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
To: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
Cc: 
Sent: Friday, December 22, 2006 3:20 PM
Subject: Re: [HACKERS] Load distributed checkpoint


> Hello, Itagaki-san,
>
> Thank you for an interesting piece of information.
>
> From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
>> If you use linux, try the following settings:
>>  1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
>>  2. Increase wal_buffers to redule WAL flushing.
>>  3. Set wal_sync_method to open_sync; O_SYNC is faster then
fsync().
>>  4. Separate data and WAL files into different partitions or disks.
>>
>> I suppose 1 is important for you, because kernel will not write
> dirty
>> buffers until 10% of buffers become dirty in default settings.
>> You have large memory (8GB), but small data set (800MB). So kernel
>> almost never writes buffers not in checkpoints. Accumulate dirty
> buffers
>> are written at a burst in fsync().
>
> I'll show the results of this tuning to share information with
people
> who don't have experience of this kind.
> The numbers shown below are the tps when running "pgbench -c16 -t100
> postgres" five times in succession.
>
> (1) Default case(this is show again for comparison and reminder)
> The bgwriter_* and checkpoint_* are set to those defaults.
> wal_buffers and wal_sync_method are also set to those defaults (64kB
> and fdatasync respectively.)
>
> 235  80  226  77  240
>
>
> (2) Default + WAL 1MB case
> The configuration is the same as case (1) except that wal_buffers is
> set to 1024kB.
>
> 302  328  82  330  85
>
> This is better improvement than I expected.
>
>
> (3) Default + wal_sync_method=open_sync case
> The configuration is the same as case (1) except that
wal_sync_method
> is set to open_sync.
>
> 162  67  176  67  164
>
> Too bad compared to case (2).  Do you know the reason?
>
>
> (4) (2)+(3) case
>
> 322  350  85  321  84
>
> This is good, too.
>
>
> (5) (4) + /proc/sys/vm/dirty* tuning
> dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
> changed from 40 to 4.
>
> 308  349  84  349  84
>
> The tuning of kernel cache doesn't appear to bring performance
> improvement in my env.  The kernel still waits too long before it
> starts flushing dirty buffers because the cache is large?  If so,
> increasingly available RAM may cause trouble more frequently in the
> near future.  Do the dirty_*_ratio accept values less than 1?
>
> BTW, in case (1), the best response time of a transaction was 6
> milliseconds.  On the other hand, the worst response time was 13
> seconds.
>
>
>> We would be happy if we would be free from a difficult combination
>> of tuning. If you have *idea for improvements*, please suggest it.
>> I think we've already understood *problem itself*.
>
> I agree with you.  Let's make the ideas more concrete, doing some
> experimentations.
>
>
>
>
> ---(end of
broadcast)---
> TIP 7: You can help support the PostgreSQL project by donating at
>
>http://www.postgresql.org/about/donate
>



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
Hello, Itagaki-san,

Thank you for an interesting piece of information.

From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
> If you use linux, try the following settings:
>  1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
>  2. Increase wal_buffers to redule WAL flushing.
>  3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().
>  4. Separate data and WAL files into different partitions or disks.
>
> I suppose 1 is important for you, because kernel will not write
dirty
> buffers until 10% of buffers become dirty in default settings.
> You have large memory (8GB), but small data set (800MB). So kernel
> almost never writes buffers not in checkpoints. Accumulate dirty
buffers
> are written at a burst in fsync().

I'll show the results of this tuning to share information with people
who don't have experience of this kind.
The numbers shown below are the tps when running "pgbench -c16 -t100
postgres" five times in succession.

(1) Default case(this is show again for comparison and reminder)
The bgwriter_* and checkpoint_* are set to those defaults.
wal_buffers and wal_sync_method are also set to those defaults (64kB
and fdatasync respectively.)

235  80  226  77  240


(2) Default + WAL 1MB case
The configuration is the same as case (1) except that wal_buffers is
set to 1024kB.

302  328  82  330  85

This is better improvement than I expected.


(3) Default + wal_sync_method=open_sync case
The configuration is the same as case (1) except that wal_sync_method
is set to open_sync.

162  67  176  67  164

Too bad compared to case (2).  Do you know the reason?


(4) (2)+(3) case

322  350  85  321  84

This is good, too.


(5) (4) + /proc/sys/vm/dirty* tuning
dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
changed from 40 to 4.

308  349  84  349  84

The tuning of kernel cache doesn't appear to bring performance
improvement in my env.  The kernel still waits too long before it
starts flushing dirty buffers because the cache is large?  If so,
increasingly available RAM may cause trouble more frequently in the
near future.  Do the dirty_*_ratio accept values less than 1?

BTW, in case (1), the best response time of a transaction was 6
milliseconds.  On the other hand, the worst response time was 13
seconds.


> We would be happy if we would be free from a difficult combination
> of tuning. If you have *idea for improvements*, please suggest it.
> I think we've already understood *problem itself*.

I agree with you.  Let's make the ideas more concrete, doing some
experimentations.




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [PATCHES] Load distributed checkpoint patch

2006-12-21 Thread Takayuki Tsunakawa
Hello, Mr. Grittner,

From: "Kevin Grittner" <[EMAIL PROTECTED]>
> We have 3,000 "directly connected" users, various business partner
> interfaces, and public web entry doing OLTP in 72 databases
distributed
> around the state, with real-time replication to central databases
which
> are considered derived copies.

What a big system you have.

>   If all the pages modified on the central
> databases were held in buffers or cache until after peak hours,
query
> performance would suffer -- assuming it would all even fit in cache.
We
> must have a way for dirty pages to be written under load while
> responding to hundreds of thousands of queries per hour without
> disturbing "freezes" during checkpoints.

I agree with you.  My words were not good.  I consider it is necessary
to always advance checkpoints even under heavy load, caring OLTP
transactions.

> I raise this only to be sure that such environments are considered
with
> these changes, not to discourage improvements in the checkpoint
> techniques.  We have effectively eliminated checkpoint problems in
our
> environment with a combination of battery backed controller cache
and
> aggressive background writer configuration.  When you have a patch
which
> seems to help those who still have problems, I'll try to get time
> approved to run a transaction replication stream onto one of our
servers
> (in "catch up mode") while we do a web "stress test" by playing back
> requests from our production log.  That should indicate how the
patch
> will affect us.

Thank you very much for your kind offer.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
- Original Message - 
From: "Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]>
To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>; "ITAGAKI
Takahiro" <[EMAIL PROTECTED]>
> > Yes, I used half the size of RAM as the shared buffers, which is
> > reasonable.  And I cached all the data.

> For pg, half RAM for shared_buffers is too much. The ratio is good
for
> other db software, that does not use the OS cache.

What percentage of RAM is recommended for shared buffers in general?
40%?  30%?  Or, is the general recommendation like "According to the
amount of your data, this much RAM should be left for the kernel
cache.  But tha's the story on Linux.  It may be different for other
OSes."?
Hmm,  if it is so, it sounds hard for system designers/administrators
to judge.




- Original Message - 
From: "Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]>
To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>; "ITAGAKI
Takahiro" <[EMAIL PROTECTED]>
Cc: 
Sent: Thursday, December 21, 2006 11:04 PM
Subject: RE: [HACKERS] Load distributed checkpoint



> > You were running the test on the very memory-depend machine.
> >> shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
> > Thet would be why the patch did not work. I tested it with DBT-2,
10GB of
> > data and 2GB of memory. Storage is always the main part of
performace here,
> > even not in checkpoints.
>
> Yes, I used half the size of RAM as the shared buffers, which is
> reasonable.  And I cached all the data.

For pg, half RAM for shared_buffers is too much. The ratio is good for
other db software, that does not use the OS cache.

Andreas



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
Cc: 
Sent: Thursday, December 21, 2006 6:46 PM
Subject: Re: [HACKERS] Load distributed checkpoint


>
From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
> "Takayuki Tsunakawa" <[EMAIL PROTECTED]> wrote:
>> Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
>> waiting for lseek() to complete when it committed.  But why does
the
>> backend which is syncing WAL/pg_control have to wait for syncing
the
>> data file?  They are, not to mention, different files, and WAL and
>> data files are stored on separate disks.
>
> Backends call lseek() in planning, so they have to wait fsync() to
> the table that they will access. Even if all of data in the file is
in
> the cache, lseek() conflict with fsync(). You can see a lot of
backends
> are waiting in planning phase in checkpoints, not executing phase.

I see.  I found one backend like the following.  But one in my case
one out of 16 backends.  Most of others are waiting to acquire
WALWRITE lock.

#0  0x003a629c6902 in __lseek_nocancel () from
/lib64/tls/libc.so.6
#1  0x0056789f in FileSeek ()
#2  0x00574053 in mdnblocks ()
#3  0x00574f4a in smgrnblocks ()
#4  0x005489e8 in estimate_rel_size ()
#5  0x00548bee in get_relation_info ()
#6  0x0054aa3d in build_simple_rel ()
#7  0x00539c6b in add_base_rels_to_query ()
#8  0x0053b955 in query_planner ()
#9  0x0053c1c9 in grouping_planner ()
#10 0x0053d3b4 in subquery_planner ()
#11 0x0053d5b3 in planner ()
#12 0x005778fc in pg_plan_query ()
#13 0x0057798c in pg_plan_queries ()
#14 0x00577c53 in exec_simple_query ()




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
> You were running the test on the very memory-depend machine.
>> shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
> Thet would be why the patch did not work. I tested it with DBT-2,
10GB of
> data and 2GB of memory. Storage is always the main part of
performace here,
> even not in checkpoints.

Yes, I used half the size of RAM as the shared buffers, which is
reasonable.  And I cached all the data.  The effect of fsync() is a
heavier offence, isn't it?  System administrators would say "I have
enough memory.  The data hasn't exhausted the DB cache yet.  But the
users complain to me about the response.  Why?  What should I do?
What?  Checkpoint??  Why doesn't PostgreSQL take care of frontend
users?"
BTW, is DBT-2 an OLTP benchmark which randomly access some parts of
data, or a batch application which accesses all data?  I'm not
familiar with it.  I know that IPA opens it to the public.

> If you use Linux, it has very unpleased behavior in fsync(); It
locks all
> metadata of the file being fsync-ed. We have to wait for the
completion of
> fsync when we do read(), write(), and even lseek().
> Almost of your data is in the accounts table and it was stored in a
single
> file. All of transactions must wait for fsync to the single largest
file,
> so you saw the bottleneck was in the fsync.

Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
waiting for lseek() to complete when it committed.  But why does the
backend which is syncing WAL/pg_control have to wait for syncing the
data file?  They are, not to mention, different files, and WAL and
data files are stored on separate disks.


>> [Conclusion]
>> I believe that the problem cannot be solved in a real sense by
>> avoiding fsync/fdatasync().
>
> I think so, too. However, I assume we can resolve a part of the
> checkpoint spikes with smoothing of write() alone.

First, what's the goal (if possible numerically?  Have you explained
to community members why the patch would help many people?  At least,
I haven't heard that fsync() can be seriously bad and we would close
our eyes to what fsync() does.
By the way, what good results did you get with DBT-2?  If you don't
mind, can you show us?


> BTW, can we use the same way to fsync? We call fsync()s to all
modified
> files without rest in mdsync(), but it's not difficult at all to
insert
> sleeps between fsync()s. Do you think it helps us? One of issues is
that
> we have to sleep in file unit, which is maybe rough granularity.

No, it definitely won't help us.  There is no reason why it will help.
It might help in some limited environments, though, how can we
characterize such environments?  Can we say "our approach helps our
environments, but it won't help you.  The kernel VM settings may help
you.  Good luck!"?
We have to consider seriously.  I think it's time to face the problem
and we should follow the approaches of experts like Jim Gray and DBMS
vendors, unless we have a new clever idea like them.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Takayuki Tsunakawa
On 12/20/06, Takayuki Tsunakawa <[EMAIL PROTECTED]> wrote:
> > [Conclusion]
> > I believe that the problem cannot be solved in a real sense by
> > avoiding fsync/fdatasync().  We can't ignore what commercial databases
> > have done so far.  The kernel does as much as he likes when PostgreSQL
> > requests him to fsync().


From: Inaam Rana
> I am new to the community and am very interested in the tests that you
have done. I am also working on resolving the sudden IO spikes at checkpoint
time. I agree with you that fsync() is the core issue here.

Thank you for understanding my bad English correctly.  Yes, what I've been
insisting is that it is necessary to avoid fsync()/fdatasync() and to use
O_SYNC (plus O_DIRECT if supported on the target platform) to really
eliminate the big spikes.
In my mail, the following sentence made a small mistake.

"I believe that the problem cannot be solved in a real sense by avoiding
fsync/fdatasync()."

The correct sentence is:

"I believe that the problem cannot be solved in a real sense without
avoiding fsync/fdatasync()."


> Being a new member I was wondering if someone on this list has done
testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the
most logical way of dealing with fsync() flood at checkpoint time. If so,
I'll be very interested in the results.

Could you see the mail I sent on Dec 18?  Its content was so long that I
zipped the whole content and attached to the mail.  I just performed the
same test simply adding O_SYNC to open() in mdopen() and another function in
md.c.  I couldn't succeed in running with O_DIRECT because O_DIRECT requires
the shared buffers to be aligned on the sector-size boundary.  To perform
O_DIRECT test, a little more modification is necessary to the code where the
shared buffers are allocated.
The result was bad.  But that's just a starting point.  We need some
improvements that commercial databases have done.  I think some approaches
we should take are:

(1) two-checkpoint (described in Jim Gray's textbook "Transaction
Processing: Concepts and Techniques"
(2) what Oracle suggests in its manual (see my previous mails)
(3) write multiple contiguous buffers with one write() to decrease the count
of write() calls

> As mentioned in this thread that a single bgwriter with O_DIRECT will not
be able to keep pace with cleaning effort causing backend writes. I think
(i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this
issue.

I agree with you.  Oracle provides a parameter called DB_WRITER_PROCESSES to
set the number of database writer processes.  Oracle also provides
asynchronous I/O to solve the problem you are saying about.  Please see
section 10.3.9 the following page:

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14211/instance_tune.htm#sthref1049

> Talking of bgwriter_* parameters I think we are missing a crucial internal
counter i.e. number of dirty pages. How much work bgwriter has to do at each
wakeup call should be a function of total buffers and currently dirty
buffers. Relying on both these values instead of just one static NBuffers
should allow bgwriter to adapt more quickly to workload changes and ensure
that not much work is accumulated for checkpoint.

I agree with you in the sense that the current bgwriter is a bit careless
about the system load.  I believe that PostgreSQL should be more gentle to
OLTP transactions -- many users of the system as a result.  I think the
speed of WAL accumulation should also be taken into account.  Let's list up
the problems and ideas.

-- 


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Takayuki Tsunakawa
> That implies that fsyncing a datafile blocks fsyncing the WAL. That
> seems terribly unlikely (although...). What OS/Kernel/Filesystem is
> this. I note a sync bug in linux for ext3 that may have relevence.

Oh, really?  What bug?  I've heard that ext3 reports wrong data to
iostat when it performs writes (the data is correct when performing
reads.)
My env is:

OS: RHEL 4.0 for AMD64/EM64T
kernel: 2.6.9-42.ELsmp
The file system is ext3.

Terribly unlikely?  But I've seen the disk utilization quite often.




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] Load distributed checkpoint patch

2006-12-20 Thread Takayuki Tsunakawa
From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
> Bruce Momjian <[EMAIL PROTECTED]> wrote:
>> Do you use the same delay autovacuum uses?
>
> What do you mean 'the same delay'? Autovacuum does VACUUM, not
CHECKPOINT.
> If you think cost-based-delay, I think we cannot use it here. It's
hard to
> estimate how much checkpoints delay by cost-based sleeping, but we
should
> finish asynchronous checkpoints by the start of next checkpoint. So
I gave
> priority to punctuality over load smoothing.

I consider that smoothing the load (more meaningfully, response time)
has higher priority over checkpoint punctuality in a practical sense,
because the users of a system benefit from good steady response and
give good reputation to the system.  If the checkpoint processing is
not punctual, crash recovery would take longer time.  But which would
you give higher priority, the unlikely event (=crash of the system) or
likely event (=peek hours of the system)?  I believe the latter should
be regarded.  The system can write dirty buffers after the peek hours
pass.  User experience should be taken much case of.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Takayuki Tsunakawa
Hello, Itagaki-san, all

I have to report a sad result.  Your patch didn't work.  Let's
consider the solution together.  What you are addressing is very
important for the system designers in the real world -- smoothing
response time.

Recall that unpatched PostgreSQL showed the following tps's in case
(1) (i.e. with default bgwriter_* and checkpoint_* settings.)

235  80  226 77  240

The patched PostgreSQL showed the following tps's:

230  228  77  209  66

[disk usage]
The same tendency can be seen as with the unpatched PostgreSQL.  That
is:
When the tps is low, the %util of disk for data files is high, and
%util of disk for WAL is low.  Why is transaction logging is disturbed
by cleaning and/or syncing activity?  While the bgwriter is
fsync()ing, it does not lock any data structures that the transactions
want to access.  Even though they share the same SCSI controller and
bus, they are different disks.  The bandwidth does not appear to be
exhausted, since Ultra320 is said to have 256MB band width in
practice.
(Recall that WAL is on sdd and data files are on sde.)

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdd  0.00 810.78  0.00 102.200.00 7306.99 0.00
3653.4971.50 1.12   10.95   7.32  74.77
sde  0.00  25.35  0.00  6.190.00  252.30 0.00   126.15
40.77 0.50   81.32   5.94   3.67
sdd  0.00 884.20  0.00 126.000.00 8080.00 0.00
4040.0064.13 1.26   10.00   7.11  89.64
sde  0.00  21.40  0.00  5.000.00  211.20 0.00   105.60
42.24 0.31   62.56   6.52   3.26
sdd  0.00 924.80  0.00 116.200.00 8326.40 0.00
4163.2071.66 1.23   10.59   7.37  85.64
sde  0.00  27.60  0.00 26.600.00  433.60 0.00   216.80
16.30 4.24  159.29   2.44   6.50
sdd  0.00 721.20  0.00 102.400.00 6588.80 0.00
3294.4064.34 0.999.71   7.07  72.40
sde  0.00 1446.80  0.00 101.600.00 20289.60 0.00
10144.80   199.70  1192.40  572.45   2.29  23.30
sdd  0.00   0.00  0.00  0.200.001.60 0.00 0.80
8.00 0.11  539.00 539.00  10.80
sde  0.00   0.00  0.00 452.100.000.00 0.00
0.00 0.00  3829.57 3715.83   2.22 100.22
sdd  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00 349.800.000.00 0.00
0.00 0.00  1745.52 8515.74   2.86 100.02
sdd  0.00 442.40  0.00 51.000.00 3948.80 0.00  1974.40
77.43 0.60   11.73   7.54  38.46
sde  0.00   2.80  0.00 184.000.00   25.60 0.00
12.80 0.14   277.52 12629.41   3.19  58.74
sdd  0.00 898.00  0.00 124.800.00 8182.40 0.00
4091.2065.56 1.30   10.40   7.24  90.30
sde  0.00  19.20  0.00  3.800.00  184.00 0.0092.00
48.42 0.24   62.11  14.11   5.36
sdd  0.00 842.28  0.00 109.020.00 7612.02 0.00
3806.0169.82 1.33   12.26   8.35  91.02
sde  0.00  45.49  0.00 46.890.00  739.08 0.00   369.54
15.76 9.04  192.73   3.38  15.85
sdd  0.00 1198.41  0.00 71.510.00 10505.18 0.00
5252.59   146.90   128.19   99.76  13.48  96.43
sde  0.00 1357.77  0.00 199.800.00 19263.75 0.00
9631.8796.41  2251.09 1179.42   2.39  47.81
sdd  0.00   0.00  0.00  7.200.000.00 0.00 0.00
0.00   203.87 5671.83 138.92 100.02
sde  0.00   0.00  0.00 409.600.000.00 0.00
0.00 0.00  3171.04 4779.83   2.44 100.02
sdd  0.00   0.00  0.00 17.800.000.00 0.00 0.00
0.00   137.87 10240.90  56.19 100.02
sde  0.00   0.00  0.00 240.600.000.00 0.00
0.00 0.00  1573.85 9815.29   4.16 100.02
sdd  0.00 109.80  0.00 35.400.00 1012.80 0.00   506.40
28.6142.14 7974.47  27.86  98.64
sde  0.00   2.80  0.00 198.800.00   30.40 0.00
15.20 0.15   428.49 14474.39   4.30  85.56
sdd  0.00 466.20  0.00 62.800.00 4230.40 0.00  2115.20
67.36 0.599.49   6.79  42.62
sde  0.00   5.20  0.00  0.800.00   48.00 0.0024.00
60.00 0.01   16.25  11.25   0.90
sdd  0.00   0.00  0.00  0.200.001.60 0.00 0.80
8.00 0.01   35.00  35.00   0.70
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00


I suspect that fsync() is the criminal as I've been afraid.  I'll show
you an interesting data.  I collected the stack traces of backend
processes while a checkpoint is happening.


[bgwriter]
Oh, he is fsync()ing hard.

#0  0x003a629bfbb2 in __fsync_nocancel () from
/lib64/tls/libc.so.6
#1  0x005742a1 in mdsync ()
#2  0x005753d7 in smgrsync ()
#3  0x00564d65 in FlushBufferPool ()
...

[some backends]
They are forced to wait for some lock pertaining to WAL when they try
to insert a log record.

#0

Re: [HACKERS] Load distributed checkpoint

2006-12-19 Thread Takayuki Tsunakawa
Hello, Itagaki-san

> I posted a patch to PATCHES. Please try out it.

Really!?  I've just joined pgsql-patches.  When did you post it,
yesterday?  I couldn't find the patch in the following page which
lists the mails to pgsql-patches of this month:

http://archives.postgresql.org/pgsql-patches/2006-12/index.php

Could you send me the patch if it has not already registered on the
above page.  I want to try the patch by all means, because smoothing
response times is very important so that we can recommend PostgreSQL
to system designers.  I really wish your patch will be the real
solution.

> It does write() smoothly, but fsync() at a burst.
> I suppose the result will be between (3) and (5).

Hmm...  I think some logical reasoning is needed to get the
understanding from community members (sorry if the community members
have already agreed.)  Excuse me for repeating myself, but I'm afraid
fsync() will be the evil sometime in some environments.  Success in
one test environment is not the real success.




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Question about debugging bootstrapping and catalog entries

2006-12-18 Thread Takayuki Tsunakawa
From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
 How about mimicing postgres with a script that starts gdb to run
> postgres?  That is, rename the original postgres module to
> postgres.org and create a shell script named postgres like this:
>
> #!/bin/bash
> gdb postgres $*

Sorry, this should be postgres.org $*.

- Original Message - 
From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
To: "Gregory Stark" <[EMAIL PROTECTED]>; "PostgreSQL Hackers"

Sent: Tuesday, December 19, 2006 9:37 AM
Subject: Re: [HACKERS] Question about debugging bootstrapping and
catalog entries


> Hello, Mr. Stark
>
>> Are there any tricks people have for debugging bootstrapping
> processing? I
>> just need to know what index it's trying to build here and that
> should be
>> enough to point me in the right direction:
>
> As Mr. Lane says, it would be best to be able to make postgres sleep
> for an arbitrary time.  The direction may be either a command line
> option or an environment variable (like BOOTSTRAP_SLEEP) or both.
iI
> think the env variable is easy to handle n this case.
>
> How about mimicing postgres with a script that starts gdb to run
> postgres?  That is, rename the original postgres module to
> postgres.org and create a shell script named postgres like this:
>
> #!/bin/bash
> gdb postgres $*
>
> Tell me if it works.
>
>
>
>
>
>
> ---(end of
broadcast)---
> TIP 7: You can help support the PostgreSQL project by donating at
>
>http://www.postgresql.org/about/donate
>



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Question about debugging bootstrapping and catalog entries

2006-12-18 Thread Takayuki Tsunakawa
Hello, Mr. Stark

> Are there any tricks people have for debugging bootstrapping
processing? I
> just need to know what index it's trying to build here and that
should be
> enough to point me in the right direction:

As Mr. Lane says, it would be best to be able to make postgres sleep
for an arbitrary time.  The direction may be either a command line
option or an environment variable (like BOOTSTRAP_SLEEP) or both.  iI
think the env variable is easy to handle n this case.

How about mimicing postgres with a script that starts gdb to run
postgres?  That is, rename the original postgres module to
postgres.org and create a shell script named postgres like this:

#!/bin/bash
gdb postgres $*

Tell me if it works.






---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Is there any limitation on the size of a mail?

2006-12-17 Thread Takayuki Tsunakawa
Hello, Mr. Lane

> "Takayuki Tsunakawa" <[EMAIL PROTECTED]> writes:
>> But none has appeared on pgsql-hackers ML yet.  What's wrong?
>> One thing I worry about is the size.  The size of my mail is 42KB.
It
>> has only text and no attachment.  Is there any limitation on size?
>
> Yes.  Consider gzip ... or if appropriate, post to pgsql-patches,
> which has a higher limit.

Thank you for your advice.  I've just succeeded in sending my mail.
22KB has been accepted.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Is there any limitation on the size of a mail?

2006-12-17 Thread Takayuki Tsunakawa
Hello,

Sorry for this noisy mail.  If there is more appropriate address to
send to, please tell me.

I sent one mail about load-distributed checkpoint three times on the
following dates as I couldn't see the mail on the ML:

2006/12/16 17:53
2006/12/18 9:07
2006/12/18 12:10

But none has appeared on pgsql-hackers ML yet.  What's wrong?
One thing I worry about is the size.  The size of my mail is 42KB.  It
has only text and no attachment.  Is there any limitation on size?

I'm sending this as a test too.





---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Load distributed checkpoint

2006-12-13 Thread Takayuki Tsunakawa
Hello,

From: "Jim C. Nasby" <[EMAIL PROTECTED]>
 Also, I have a dumb question... BgBufferSync uses buf_id1 to keep
track
> of what buffer the bgwriter_all scan is looking at, which means that
> it should remember where it was at the end of the last scan; yet
it's
> initialized to 0 every time BgBufferSync is called. Is there
someplace
> else that is remembering where the complete scan is leaving off when
> bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the
scan
> in fact just keep re-scanning the beginning of the buffers?

No. BgBufferSync() correctly keeps track of the position to restart
scanning at. bufid1 is not initialized to 0 every time BgBufferSync()
is called, because bufid1 is a static local variable. Please see the
following code. It prints:

a=0
a=1
a=2


#include 

void func(void)
{
 static int a = 0;

 printf("a=%d\n", a);
 a++;
}

int main(void)
{
 func();
 func();
 func();

  return 0;
}




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Takayuki Tsunakawa
Mr. Riggs,

Thank you for teaching me the following. I seem to have misunderstood.
I'll learn more.

From: "Simon Riggs" <[EMAIL PROTECTED]>
> On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:
>> I understand that checkpoints occur during crash
>> recovery and PITR, so time for those operations would get longer.
>
> A restorepoint happens during recovery, not a checkpoint. The
recovery
> is merely repeating the work of the checkpoint that occurred in the
> original WAL stream. Elongating the checkpoint would not have any
effect
> on a restorepoint: we only record the checkpoint when it is complete
and
> we only create a restorepoint when we see the checkpoint record.


Regards,

- Original Message - 
From: "Simon Riggs" <[EMAIL PROTECTED]>
To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>
Cc: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>;

Sent: Monday, December 11, 2006 6:30 PM
Subject: Re: [HACKERS] Load distributed checkpoint


> On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:
>> I understand that checkpoints occur during crash
>> recovery and PITR, so time for those operations would get longer.
>
> A restorepoint happens during recovery, not a checkpoint. The
recovery
> is merely repeating the work of the checkpoint that occurred in the
> original WAL stream. Elongating the checkpoint would not have any
effect
> on a restorepoint: we only record the checkpoint when it is complete
and
> we only create a restorepoint when we see the checkpoint record.
>
> Crash recovery and PITR use almost exactly the same code path (by
> design), so there isn't anything special to say about PITR either.
>
> -- 
>  Simon Riggs
>  EnterpriseDB   http://www.enterprisedb.com
>
>
>



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Takayuki Tsunakawa
Hello,

From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>
"Takayuki Tsunakawa" <[EMAIL PROTECTED]> wrote:
>> I'm afraid it is difficult for system designers to expect steady
>> throughput/response time, as long as PostgreSQL depends on the
>> flushing of file system cache. How does Oracle provide stable
>> performance?
>> Though I'm not sure, isn't it the key to use O_SYNC so that
write()s
>> transfer data to disk?
>
> AFAIK, other databases use write() and fsync() in combination. They
call
> fsync() immediately after they write buffers in some small batches.
Otherwise,
> they uses asynchronous and direct I/O options. Therefore, dirty
pages in
> kernel buffers are keeped to be low at any time.


Oracle seems to use O_SYNC. I've found a related page in the Oracle
manuals.

--
http://download-west.oracle.com/docs/cd/B19306_01/win.102/b15688/ap_unix.htm
Direct Writes to Disk
On both UNIX and Windows platforms, bypassing the file system buffer
cache ensures data is written to disk.
On UNIX, Oracle Database uses the O_SYNC flag to bypass the file
system buffer cache. The flag name depends on the UNIX port.
On Windows, Oracle Database bypasses the file system buffer cache
completely.
--

As Itagaki-san says, asynchronous+direct I/O provides best
performance, I believe. Oracle supplies the combination as follows:

--
http://download-west.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref870
Oracle Database supports kernel asynchronous I/O. This feature is
disabled by default.
By default, the DISK_ASYNCH_IO initialization parameter in the
parameter file is set to TRUE to enable asynchronous I/O on raw
devices. To enable asynchronous I/O on file system files:
Ensure that all Oracle Database files are located on file systems that
support asynchronous I/O.
Set the FILESYSTEMIO_OPTIONS initialization parameter in the parameter
file to one of the following values:

Linux Distribution Recommended Value
SUSE Linux Enterprise Server 9 SETALL
Other distributions ASYNCH
--

I believe SQL Server also uses direct+asynchronous I/O, because
Microsoft recommends in MSDN that the combination plus appropriate
multi-threading provides best performance.

I tested Oracle9i on RHEL 2.1. I straced DBWn (database writer, which
is like the bgwriter of PostgreSQL) while creating tables, indexes,
etc. and shutting down the database server. Oracle surely uses the
O_SYNC as follows, but it doesn't use fsync().

24462 open("/work4/ora/tuna/users01.dbf", O_RDWR|O_SYNC|O_LARGEFILE) =
16

I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san
referring to DB2?




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org