Re: [HACKERS] Streaming replication on win32, still broken

2010-02-21 Thread Fujii Masao
On Fri, Feb 19, 2010 at 7:54 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Heikki Linnakangas wrote:
 Magnus Hagander wrote:
 Well, it's going to make the process that reads the WAL cause actual
 physical I/O... That'll take a chunk out of your total available I/O,
 which is likely to push you to the limit of your I/O capacity much
 quicker.

 Right, doesn't seem sensible, though it would be nice to see a benchmark
 on that.

 Here's a patch to disable O_DIRECT when archiving or streaming is
 enabled. This is pretty hard to test, so any extra eyeballs would be nice..

 Committed. Can you check that this fixed the PANIC you saw?

Thanks! Yeah, SR works fine in my MinGW environment.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-19 Thread Heikki Linnakangas
Heikki Linnakangas wrote:
 Magnus Hagander wrote:
 Well, it's going to make the process that reads the WAL cause actual
 physical I/O... That'll take a chunk out of your total available I/O,
 which is likely to push you to the limit of your I/O capacity much
 quicker.
 
 Right, doesn't seem sensible, though it would be nice to see a benchmark
 on that.
 
 Here's a patch to disable O_DIRECT when archiving or streaming is
 enabled. This is pretty hard to test, so any extra eyeballs would be nice..

Committed. Can you check that this fixed the PANIC you saw?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-18 Thread Magnus Hagander
2010/2/18 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
 Fujii Masao wrote:
 On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 If I'm reading the patch correctly, when wal_sync_method is 'open_sync',
 walreceiver nevertheless opens the WAL file without the O_DIRECT flag.
 When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(),
 issue_xlog_fsync() will do nothing because it assumes the write() synced
 it already. So the data written isn't being forced to disk at all.

 When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC
 flag. So I think that write() flushes the data to disk even if O_DIRECT
 flag is not given. Am I missing something?

 Ah, ok, you're right.

Yes, I believe the difference is that with O_DIRECT it bypasses the
cache completely. Without it, we still sync it out, but it also goes
into the cache.

O_DIRECT helps us when we're not going to read the file again, because
we don't waste cache on it. If we are, which is the case here, it
should be really bad for performance, since we actually have to do a
physical read.

Incidentally, that should also apply to general WAL when archive_mdoe
is on. Do we optimize for that?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-18 Thread Heikki Linnakangas
Magnus Hagander wrote:
 O_DIRECT helps us when we're not going to read the file again, because
 we don't waste cache on it. If we are, which is the case here, it
 should be really bad for performance, since we actually have to do a
 physical read.
 
 Incidentally, that should also apply to general WAL when archive_mdoe
 is on. Do we optimize for that?

Hmm, no we don't. We do take that into account so that we refrain from
issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't
disable O_DIRECT. Maybe we should..

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-18 Thread Fujii Masao
On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Magnus Hagander wrote:
 O_DIRECT helps us when we're not going to read the file again, because
 we don't waste cache on it. If we are, which is the case here, it
 should be really bad for performance, since we actually have to do a
 physical read.

 Incidentally, that should also apply to general WAL when archive_mdoe
 is on. Do we optimize for that?

 Hmm, no we don't. We do take that into account so that we refrain from
 issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't
 disable O_DIRECT. Maybe we should..

Since the performance of WAL write is more important than that of WAL
archiving in general, that optimization might offer little benefit.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-18 Thread Magnus Hagander
2010/2/18 Fujii Masao masao.fu...@gmail.com:
 On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Magnus Hagander wrote:
 O_DIRECT helps us when we're not going to read the file again, because
 we don't waste cache on it. If we are, which is the case here, it
 should be really bad for performance, since we actually have to do a
 physical read.

 Incidentally, that should also apply to general WAL when archive_mdoe
 is on. Do we optimize for that?

 Hmm, no we don't. We do take that into account so that we refrain from
 issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't
 disable O_DIRECT. Maybe we should..

 Since the performance of WAL write is more important than that of WAL
 archiving in general, that optimization might offer little benefit.

Well, it's going to make the process that reads the WAL cause actual
physical I/O... That'll take a chunk out of your total available I/O,
which is likely to push you to the limit of your I/O capacity much
quicker.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-18 Thread Heikki Linnakangas
Magnus Hagander wrote:
 2010/2/18 Fujii Masao masao.fu...@gmail.com:
 On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Magnus Hagander wrote:
 O_DIRECT helps us when we're not going to read the file again, because
 we don't waste cache on it. If we are, which is the case here, it
 should be really bad for performance, since we actually have to do a
 physical read.

 Incidentally, that should also apply to general WAL when archive_mdoe
 is on. Do we optimize for that?
 Hmm, no we don't. We do take that into account so that we refrain from
 issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't
 disable O_DIRECT. Maybe we should..
 Since the performance of WAL write is more important than that of WAL
 archiving in general, that optimization might offer little benefit.
 
 Well, it's going to make the process that reads the WAL cause actual
 physical I/O... That'll take a chunk out of your total available I/O,
 which is likely to push you to the limit of your I/O capacity much
 quicker.

Right, doesn't seem sensible, though it would be nice to see a benchmark
on that.

Here's a patch to disable O_DIRECT when archiving or streaming is
enabled. This is pretty hard to test, so any extra eyeballs would be nice..

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
? GNUmakefile
? a.patch
? b
? backend
? config.log
? config.status
? config.status.lineno
? configure.lineno
? gin-splay-1.patch
? gin-splay-2.patch
? gin-splay-3.patch
? include
? libpq
? log_unlogged_op_0115.patch
? md-1.c
? md-1.patch
? replication
? temp-file-resowner-2.patch
? contrib/pgbench/.deps
? contrib/pgbench/fsynctest
? contrib/pgbench/fsynctest.c
? contrib/pgbench/fsynctestfile
? contrib/pgbench/pgbench
? contrib/spi/.deps
? doc/src/sgml/HTML.index
? doc/src/sgml/bookindex.sgml
? doc/src/sgml/features-supported.sgml
? doc/src/sgml/features-unsupported.sgml
? doc/src/sgml/version.sgml
? src/Makefile.global
? src/backend/aaa.patch
? src/backend/postgres
? src/backend/access/common/.deps
? src/backend/access/gin/.deps
? src/backend/access/gist/.deps
? src/backend/access/hash/.deps
? src/backend/access/heap/.deps
? src/backend/access/index/.deps
? src/backend/access/nbtree/.deps
? src/backend/access/transam/.deps
? src/backend/bootstrap/.deps
? src/backend/catalog/.deps
? src/backend/commands/.deps
? src/backend/executor/.deps
? src/backend/foreign/.deps
? src/backend/foreign/dummy/.deps
? src/backend/foreign/postgresql/.deps
? src/backend/lib/.deps
? src/backend/libpq/.deps
? src/backend/main/.deps
? src/backend/nodes/.deps
? src/backend/optimizer/geqo/.deps
? src/backend/optimizer/path/.deps
? src/backend/optimizer/plan/.deps
? src/backend/optimizer/prep/.deps
? src/backend/optimizer/util/.deps
? src/backend/parser/.deps
? src/backend/po/af.mo
? src/backend/po/cs.mo
? src/backend/po/hr.mo
? src/backend/po/hu.mo
? src/backend/po/it.mo
? src/backend/po/ko.mo
? src/backend/po/nb.mo
? src/backend/po/nl.mo
? src/backend/po/pl.mo
? src/backend/po/ro.mo
? src/backend/po/ru.mo
? src/backend/po/sk.mo
? src/backend/po/sl.mo
? src/backend/po/sv.mo
? src/backend/po/zh_CN.mo
? src/backend/po/zh_TW.mo
? src/backend/port/.deps
? src/backend/postmaster/.deps
? src/backend/regex/.deps
? src/backend/replication/.deps
? src/backend/replication/libpqwalreceiver/.deps
? src/backend/rewrite/.deps
? src/backend/snowball/.deps
? src/backend/snowball/snowball_create.sql
? src/backend/storage/buffer/.deps
? src/backend/storage/file/.deps
? src/backend/storage/freespace/.deps
? src/backend/storage/ipc/.deps
? src/backend/storage/large_object/.deps
? src/backend/storage/lmgr/.deps
? src/backend/storage/page/.deps
? src/backend/storage/smgr/.deps
? src/backend/tcop/.deps
? src/backend/tsearch/.deps
? src/backend/utils/.deps
? src/backend/utils/probes.h
? src/backend/utils/adt/.deps
? src/backend/utils/cache/.deps
? src/backend/utils/error/.deps
? src/backend/utils/fmgr/.deps
? src/backend/utils/hash/.deps
? src/backend/utils/init/.deps
? src/backend/utils/mb/.deps
? src/backend/utils/mb/Unicode/BIG5.TXT
? src/backend/utils/mb/Unicode/CP950.TXT
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/backend/utils/mb/conversion_procs/ascii_and_mic/.deps
? src/backend/utils/mb/conversion_procs/cyrillic_and_mic/.deps
? src/backend/utils/mb/conversion_procs/euc2004_sjis2004/.deps
? src/backend/utils/mb/conversion_procs/euc_cn_and_mic/.deps
? src/backend/utils/mb/conversion_procs/euc_jis_2004_and_shift_jis_2004/.deps
? src/backend/utils/mb/conversion_procs/euc_jp_and_sjis/.deps
? src/backend/utils/mb/conversion_procs/euc_kr_and_mic/.deps
? src/backend/utils/mb/conversion_procs/euc_tw_and_big5/.deps
? src/backend/utils/mb/conversion_procs/latin2_and_win1250/.deps
? src/backend/utils/mb/conversion_procs/latin_and_mic/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_ascii/.deps
? src/backend/utils/mb/conversion_procs/utf8_and_big5/.deps
? 

Re: [HACKERS] Streaming replication on win32, still broken

2010-02-17 Thread Fujii Masao
On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander mag...@hagander.net wrote:
 In that case, O_DIRECT would be counterproductive, no? It maps to
 FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
 cache. So the read in the startup proc is actually guaranteed to
 reuqire a physical read - of something we just wrote, so it'll almost
 certainly end up waiting for a rotation, no?

 Seems like getting rid of O_DIRECT here is the right thing to do,
 regardless of this.

 Agreed. I'll remove O_DIRECT from walreceiver.

Here is the patch to do that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 1627,1633  XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			/* create/use new log file */
  			use_existent = true;
  			openLogFile = XLogFileInit(openLogId, openLogSeg,
! 	   use_existent, true);
  			openLogOff = 0;
  		}
  
--- 1627,1633 
  			/* create/use new log file */
  			use_existent = true;
  			openLogFile = XLogFileInit(openLogId, openLogSeg,
! 	   use_existent, true, true);
  			openLogOff = 0;
  		}
  
***
*** 2184,2189  XLogNeedsFlush(XLogRecPtr record)
--- 2184,2195 
   * place.  This should be TRUE except during bootstrap log creation.  The
   * caller must *not* hold the lock at call.
   *
+  * allow_direct_io: if TRUE, allow a WAL write to bypass the kernel cache
+  * by using PG_O_DIRECT for opening a file. Otherwise, PG_O_DIRECT is
+  * forcibly removed from the sync flag of open(). This should be FALSE
+  * only when walreceiver process writes WAL data because it's read
+  * immediately by the startup process.
+  *
   * Returns FD of opened file.
   *
   * Note: errors here are ERROR not PANIC because we might or might not be
***
*** 2193,2199  XLogNeedsFlush(XLogRecPtr record)
   */
  int
  XLogFileInit(uint32 log, uint32 seg,
! 			 bool *use_existent, bool use_lock)
  {
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
--- 2199,2205 
   */
  int
  XLogFileInit(uint32 log, uint32 seg,
! 			 bool *use_existent, bool use_lock, bool allow_direct_io)
  {
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
***
*** 2203,2208  XLogFileInit(uint32 log, uint32 seg,
--- 2209,2217 
  	int			max_advance;
  	int			fd;
  	int			nbytes;
+ 	int			sync_bit;
+ 
+ 	sync_bit = get_sync_bit(sync_method)  (allow_direct_io ? 0 : ~PG_O_DIRECT);
  
  	XLogFilePath(path, ThisTimeLineID, log, seg);
  
***
*** 2211,2217  XLogFileInit(uint32 log, uint32 seg,
  	 */
  	if (*use_existent)
  	{
! 		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
  		   S_IRUSR | S_IWUSR);
  		if (fd  0)
  		{
--- 2220,2226 
  	 */
  	if (*use_existent)
  	{
! 		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | sync_bit,
  		   S_IRUSR | S_IWUSR);
  		if (fd  0)
  		{
***
*** 2237,2243  XLogFileInit(uint32 log, uint32 seg,
  
  	unlink(tmppath);
  
! 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
  	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
  	   S_IRUSR | S_IWUSR);
  	if (fd  0)
--- 2246,2252 
  
  	unlink(tmppath);
  
! 	/* do not use sync_bit here --- want to fsync only at end of fill */
  	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
  	   S_IRUSR | S_IWUSR);
  	if (fd  0)
***
*** 2317,2323  XLogFileInit(uint32 log, uint32 seg,
  	*use_existent = false;
  
  	/* Now open original target segment (might not be file I just made) */
! 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
  	   S_IRUSR | S_IWUSR);
  	if (fd  0)
  		ereport(ERROR,
--- 2326,2332 
  	*use_existent = false;
  
  	/* Now open original target segment (might not be file I just made) */
! 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | sync_bit,
  	   S_IRUSR | S_IWUSR);
  	if (fd  0)
  		ereport(ERROR,
***
*** 3121,3127  PreallocXlogFiles(XLogRecPtr endptr)
  	{
  		NextLogSeg(_logId, _logSeg);
  		use_existent = true;
! 		lf = XLogFileInit(_logId, _logSeg, use_existent, true);
  		close(lf);
  		if (!use_existent)
  			CheckpointStats.ckpt_segs_added++;
--- 3130,3136 
  	{
  		NextLogSeg(_logId, _logSeg);
  		use_existent = true;
! 		lf = XLogFileInit(_logId, _logSeg, use_existent, true, true);
  		close(lf);
  		if (!use_existent)
  			CheckpointStats.ckpt_segs_added++;
***
*** 4794,4800  BootStrapXLOG(void)
  
  	/* Create first XLOG segment file */
  	use_existent = false;
! 	openLogFile = XLogFileInit(0, 0, use_existent, false);
  
  	/* Write the first page with the initial record */
  	errno = 0;
--- 4803,4809 
  
  	/* Create first XLOG segment file */
  	use_existent = false;

Re: [HACKERS] Streaming replication on win32, still broken

2010-02-17 Thread Fujii Masao
On Wed, Feb 17, 2010 at 6:00 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander mag...@hagander.net wrote:
 In that case, O_DIRECT would be counterproductive, no? It maps to
 FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
 cache. So the read in the startup proc is actually guaranteed to
 reuqire a physical read - of something we just wrote, so it'll almost
 certainly end up waiting for a rotation, no?

 Seems like getting rid of O_DIRECT here is the right thing to do,
 regardless of this.

 Agreed. I'll remove O_DIRECT from walreceiver.

 Here is the patch to do that.

Ooops! I found the bug in the patch. Here is the updated version.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 1627,1633  XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  			/* create/use new log file */
  			use_existent = true;
  			openLogFile = XLogFileInit(openLogId, openLogSeg,
! 	   use_existent, true);
  			openLogOff = 0;
  		}
  
--- 1627,1633 
  			/* create/use new log file */
  			use_existent = true;
  			openLogFile = XLogFileInit(openLogId, openLogSeg,
! 	   use_existent, true, true);
  			openLogOff = 0;
  		}
  
***
*** 2184,2189  XLogNeedsFlush(XLogRecPtr record)
--- 2184,2195 
   * place.  This should be TRUE except during bootstrap log creation.  The
   * caller must *not* hold the lock at call.
   *
+  * allow_direct_io: if TRUE, allow a WAL write to bypass the kernel cache
+  * by using PG_O_DIRECT for opening a file. Otherwise, PG_O_DIRECT is
+  * forcibly removed from the sync flag of open(). This should be FALSE
+  * only when walreceiver process writes WAL data because it's read
+  * immediately by the startup process.
+  *
   * Returns FD of opened file.
   *
   * Note: errors here are ERROR not PANIC because we might or might not be
***
*** 2193,2199  XLogNeedsFlush(XLogRecPtr record)
   */
  int
  XLogFileInit(uint32 log, uint32 seg,
! 			 bool *use_existent, bool use_lock)
  {
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
--- 2199,2205 
   */
  int
  XLogFileInit(uint32 log, uint32 seg,
! 			 bool *use_existent, bool use_lock, bool allow_direct_io)
  {
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
***
*** 2203,2208  XLogFileInit(uint32 log, uint32 seg,
--- 2209,2219 
  	int			max_advance;
  	int			fd;
  	int			nbytes;
+ 	int			sync_bit;
+ 
+ 	sync_bit = get_sync_bit(sync_method);
+ 	if (!allow_direct_io)
+ 		sync_bit = ~PG_O_DIRECT;
  
  	XLogFilePath(path, ThisTimeLineID, log, seg);
  
***
*** 2211,2217  XLogFileInit(uint32 log, uint32 seg,
  	 */
  	if (*use_existent)
  	{
! 		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
  		   S_IRUSR | S_IWUSR);
  		if (fd  0)
  		{
--- ,2228 
  	 */
  	if (*use_existent)
  	{
! 		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | sync_bit,
  		   S_IRUSR | S_IWUSR);
  		if (fd  0)
  		{
***
*** 2237,2243  XLogFileInit(uint32 log, uint32 seg,
  
  	unlink(tmppath);
  
! 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
  	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
  	   S_IRUSR | S_IWUSR);
  	if (fd  0)
--- 2248,2254 
  
  	unlink(tmppath);
  
! 	/* do not use sync_bit here --- want to fsync only at end of fill */
  	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
  	   S_IRUSR | S_IWUSR);
  	if (fd  0)
***
*** 2317,2323  XLogFileInit(uint32 log, uint32 seg,
  	*use_existent = false;
  
  	/* Now open original target segment (might not be file I just made) */
! 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
  	   S_IRUSR | S_IWUSR);
  	if (fd  0)
  		ereport(ERROR,
--- 2328,2334 
  	*use_existent = false;
  
  	/* Now open original target segment (might not be file I just made) */
! 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | sync_bit,
  	   S_IRUSR | S_IWUSR);
  	if (fd  0)
  		ereport(ERROR,
***
*** 3121,3127  PreallocXlogFiles(XLogRecPtr endptr)
  	{
  		NextLogSeg(_logId, _logSeg);
  		use_existent = true;
! 		lf = XLogFileInit(_logId, _logSeg, use_existent, true);
  		close(lf);
  		if (!use_existent)
  			CheckpointStats.ckpt_segs_added++;
--- 3132,3138 
  	{
  		NextLogSeg(_logId, _logSeg);
  		use_existent = true;
! 		lf = XLogFileInit(_logId, _logSeg, use_existent, true, true);
  		close(lf);
  		if (!use_existent)
  			CheckpointStats.ckpt_segs_added++;
***
*** 4794,4800  BootStrapXLOG(void)
  
  	/* Create first XLOG segment file */
  	use_existent = false;
! 	openLogFile = XLogFileInit(0, 0, use_existent, false);

Re: [HACKERS] Streaming replication on win32, still broken

2010-02-17 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Wed, Feb 17, 2010 at 6:00 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander mag...@hagander.net 
 wrote:
 In that case, O_DIRECT would be counterproductive, no? It maps to
 FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
 cache. So the read in the startup proc is actually guaranteed to
 reuqire a physical read - of something we just wrote, so it'll almost
 certainly end up waiting for a rotation, no?

 Seems like getting rid of O_DIRECT here is the right thing to do,
 regardless of this.
 Agreed. I'll remove O_DIRECT from walreceiver.
 Here is the patch to do that.
 
 Ooops! I found the bug in the patch. Here is the updated version.

If I'm reading the patch correctly, when wal_sync_method is 'open_sync',
walreceiver nevertheless opens the WAL file without the O_DIRECT flag.
When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(),
issue_xlog_fsync() will do nothing because it assumes the write() synced
it already. So the data written isn't being forced to disk at all.

How about just forcing sync_method to 'fsync' in walreceiver?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-17 Thread Fujii Masao
On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 If I'm reading the patch correctly, when wal_sync_method is 'open_sync',
 walreceiver nevertheless opens the WAL file without the O_DIRECT flag.
 When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(),
 issue_xlog_fsync() will do nothing because it assumes the write() synced
 it already. So the data written isn't being forced to disk at all.

When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC
flag. So I think that write() flushes the data to disk even if O_DIRECT
flag is not given. Am I missing something?

 How about just forcing sync_method to 'fsync' in walreceiver?

In win32, O_DSYNC seems to be preferred to 'fsync' so far. So I'm not sure
if reshuffling of priority is harmless.
http://archives.postgresql.org/pgsql-hackers-win32/2005-03/msg00148.php

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-17 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 If I'm reading the patch correctly, when wal_sync_method is 'open_sync',
 walreceiver nevertheless opens the WAL file without the O_DIRECT flag.
 When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(),
 issue_xlog_fsync() will do nothing because it assumes the write() synced
 it already. So the data written isn't being forced to disk at all.
 
 When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC
 flag. So I think that write() flushes the data to disk even if O_DIRECT
 flag is not given. Am I missing something?

Ah, ok, you're right.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Fujii Masao
On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander mag...@hagander.net wrote:
 With the libpq fixes, I get further (more on that fix later, btw), but
 now I get stuck in this. When I do something on the master that
 generates WAL, such as insert a record, and then try to query this on
 the slave, the walreceiver process crashes with:

 PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 
 160:
  Invalid argument
 LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487

 I'll keep digging at the details, but if somebody has a good idea here.. ;)

Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too.
Though I've not idenfied the cause yet, I guess that it derives from wrong use
of the type of local variables in XLogWalRcvWrite(). I'll continue investigation
of it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Magnus Hagander
2010/2/16 Fujii Masao masao.fu...@gmail.com:
 On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander mag...@hagander.net wrote:
 With the libpq fixes, I get further (more on that fix later, btw), but
 now I get stuck in this. When I do something on the master that
 generates WAL, such as insert a record, and then try to query this on
 the slave, the walreceiver process crashes with:

 PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 
 160:
  Invalid argument
 LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487

 I'll keep digging at the details, but if somebody has a good idea here.. ;)

 Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, 
 too.
 Though I've not idenfied the cause yet, I guess that it derives from wrong use
 of the type of local variables in XLogWalRcvWrite(). I'll continue 
 investigation
 of it.

Thanks!

I will be somewhat spottily available over the next two days due to
on-site work with clients.

Let me know if you would be helped by some details of how to get a
(somewhat faster) EC2 image up and running with MSVC to test on :-)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Fujii Masao
On Tue, Feb 16, 2010 at 7:20 PM, Magnus Hagander mag...@hagander.net wrote:
 2010/2/16 Fujii Masao masao.fu...@gmail.com:
 On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 With the libpq fixes, I get further (more on that fix later, btw), but
 now I get stuck in this. When I do something on the master that
 generates WAL, such as insert a record, and then try to query this on
 the slave, the walreceiver process crashes with:

 PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, length 
 160:
  Invalid argument
 LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487

 I'll keep digging at the details, but if somebody has a good idea here.. ;)

 Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, 
 too.
 Though I've not idenfied the cause yet, I guess that it derives from wrong 
 use
 of the type of local variables in XLogWalRcvWrite(). I'll continue 
 investigation
 of it.

 Thanks!

 I will be somewhat spottily available over the next two days due to
 on-site work with clients.

 Let me know if you would be helped by some details of how to get a
 (somewhat faster) EC2 image up and running with MSVC to test on :-)

Thanks! I can probably use the EC2 image by reading your great blog post.
http://blog.hagander.net/archives/151-Testing-PostgreSQL-patches-on-Windows-using-Amazon-EC2.html

But it might take some time to make my sysadmin open the port for
rdesktop for some reasons...

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Magnus Hagander
2010/2/16 Fujii Masao masao.fu...@gmail.com:
 On Tue, Feb 16, 2010 at 7:20 PM, Magnus Hagander mag...@hagander.net wrote:
 2010/2/16 Fujii Masao masao.fu...@gmail.com:
 On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander mag...@hagander.net 
 wrote:
 With the libpq fixes, I get further (more on that fix later, btw), but
 now I get stuck in this. When I do something on the master that
 generates WAL, such as insert a record, and then try to query this on
 the slave, the walreceiver process crashes with:

 PANIC:  XX000: could not write to log file 0, segment 9 at offset 0, 
 length 160:
  Invalid argument
 LOCATION:  XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487

 I'll keep digging at the details, but if somebody has a good idea here.. ;)

 Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, 
 too.
 Though I've not idenfied the cause yet, I guess that it derives from wrong 
 use
 of the type of local variables in XLogWalRcvWrite(). I'll continue 
 investigation
 of it.

 Thanks!

 I will be somewhat spottily available over the next two days due to
 on-site work with clients.

 Let me know if you would be helped by some details of how to get a
 (somewhat faster) EC2 image up and running with MSVC to test on :-)

 Thanks! I can probably use the EC2 image by reading your great blog post.
 http://blog.hagander.net/archives/151-Testing-PostgreSQL-patches-on-Windows-using-Amazon-EC2.html

Actually, that one deosn't work anymore, because I managed to break
the image :-)

If you send me your amazon id, I can get you premissions on my private
image. I plan to clean it up and make it public, just haven't gotten
around to it yet...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Fujii Masao
On Wed, Feb 17, 2010 at 6:28 AM, Magnus Hagander mag...@hagander.net wrote:
 If you send me your amazon id, I can get you premissions on my private
 image. I plan to clean it up and make it public, just haven't gotten
 around to it yet...

Thanks for your concern! I'll send the ID when I complete the preparation.

And, fortunately?, when I set wal_sync_method to open_sync, the problem was
reproduced in the linux, too. The cause is that the data that is written by
walreceiver is not aligned, even if O_DIRECT is used. On win32, O_DIRECT is
used by default. So the problem always happened on win32.

I propose two solution ideas:

1. O_DIRECT is somewhat harmful in the standby since the data written by
   walreceiver is read by the startup process immediately. So, how about
   not making only walreceiver use O_DIRECT?

2. Straightforwardly observe the alignment rule. Since the received WAL
   data might start at the middle of WAL block, walreceiver needs to keep
   the last half-written WAL block for alignment. OTOH since the received
   data might end at the middle of WAL block, walreceiver needs zero-padding.
   As a result, walreceiver writes the set of the last WAL block, received
   data and zero-padding.

Which is better? Or do you have another better idea?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Magnus Hagander
On Wed, Feb 17, 2010 at 06:55, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Feb 17, 2010 at 6:28 AM, Magnus Hagander mag...@hagander.net wrote:
 If you send me your amazon id, I can get you premissions on my private
 image. I plan to clean it up and make it public, just haven't gotten
 around to it yet...

 Thanks for your concern! I'll send the ID when I complete the preparation.

ok.


 And, fortunately?, when I set wal_sync_method to open_sync, the problem was
 reproduced in the linux, too. The cause is that the data that is written by

Ah, that's good. It always helps if it's a cross-platform issue -
particularly in that it's not one of the funky win32 specific things
we did :)


 walreceiver is not aligned, even if O_DIRECT is used. On win32, O_DIRECT is
 used by default. So the problem always happened on win32.

Ahh. I see.


 I propose two solution ideas:

 1. O_DIRECT is somewhat harmful in the standby since the data written by
   walreceiver is read by the startup process immediately. So, how about
   not making only walreceiver use O_DIRECT?

In that case, O_DIRECT would be counterproductive, no? It maps to
FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
cache. So the read in the startup proc is actually guaranteed to
reuqire a physical read - of something we just wrote, so it'll almost
certainly end up waiting for a rotation, no?

Seems like getting rid of O_DIRECT here is the right thing to do,
regardless of this.


 2. Straightforwardly observe the alignment rule. Since the received WAL
   data might start at the middle of WAL block, walreceiver needs to keep
   the last half-written WAL block for alignment. OTOH since the received
   data might end at the middle of WAL block, walreceiver needs zero-padding.
   As a result, walreceiver writes the set of the last WAL block, received
   data and zero-padding.

May there be other reasons to d this as well?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Wed, Feb 17, 2010 at 06:55, Fujii Masao masao.fu...@gmail.com wrote:
 2. Straightforwardly observe the alignment rule. Since the received WAL
   data might start at the middle of WAL block, walreceiver needs to keep
   the last half-written WAL block for alignment. OTOH since the received
   data might end at the middle of WAL block, walreceiver needs zero-padding.
   As a result, walreceiver writes the set of the last WAL block, received
   data and zero-padding.

 May there be other reasons to d this as well?

Writing misaligned data is certain to be expensive even when it works...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Fujii Masao
On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander mag...@hagander.net wrote:
 In that case, O_DIRECT would be counterproductive, no? It maps to
 FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the
 cache. So the read in the startup proc is actually guaranteed to
 reuqire a physical read - of something we just wrote, so it'll almost
 certainly end up waiting for a rotation, no?

 Seems like getting rid of O_DIRECT here is the right thing to do,
 regardless of this.

Agreed. I'll remove O_DIRECT from walreceiver.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming replication on win32, still broken

2010-02-16 Thread Fujii Masao
On Wed, Feb 17, 2010 at 3:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 On Wed, Feb 17, 2010 at 06:55, Fujii Masao masao.fu...@gmail.com wrote:
 2. Straightforwardly observe the alignment rule. Since the received WAL
   data might start at the middle of WAL block, walreceiver needs to keep
   the last half-written WAL block for alignment. OTOH since the received
   data might end at the middle of WAL block, walreceiver needs zero-padding.
   As a result, walreceiver writes the set of the last WAL block, received
   data and zero-padding.

 May there be other reasons to d this as well?

 Writing misaligned data is certain to be expensive even when it works...

Yeah, right. After I remove O_DIRECT, I'll change walreceiver so as to
do an alignment correctly, and then I'll test the performance.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers