Re: [HACKERS] Streaming replication on win32, still broken
On Fri, Feb 19, 2010 at 7:54 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Heikki Linnakangas wrote: Magnus Hagander wrote: Well, it's going to make the process that reads the WAL cause actual physical I/O... That'll take a chunk out of your total available I/O, which is likely to push you to the limit of your I/O capacity much quicker. Right, doesn't seem sensible, though it would be nice to see a benchmark on that. Here's a patch to disable O_DIRECT when archiving or streaming is enabled. This is pretty hard to test, so any extra eyeballs would be nice.. Committed. Can you check that this fixed the PANIC you saw? Thanks! Yeah, SR works fine in my MinGW environment. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
Heikki Linnakangas wrote: Magnus Hagander wrote: Well, it's going to make the process that reads the WAL cause actual physical I/O... That'll take a chunk out of your total available I/O, which is likely to push you to the limit of your I/O capacity much quicker. Right, doesn't seem sensible, though it would be nice to see a benchmark on that. Here's a patch to disable O_DIRECT when archiving or streaming is enabled. This is pretty hard to test, so any extra eyeballs would be nice.. Committed. Can you check that this fixed the PANIC you saw? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
2010/2/18 Heikki Linnakangas heikki.linnakan...@enterprisedb.com: Fujii Masao wrote: On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: If I'm reading the patch correctly, when wal_sync_method is 'open_sync', walreceiver nevertheless opens the WAL file without the O_DIRECT flag. When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(), issue_xlog_fsync() will do nothing because it assumes the write() synced it already. So the data written isn't being forced to disk at all. When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC flag. So I think that write() flushes the data to disk even if O_DIRECT flag is not given. Am I missing something? Ah, ok, you're right. Yes, I believe the difference is that with O_DIRECT it bypasses the cache completely. Without it, we still sync it out, but it also goes into the cache. O_DIRECT helps us when we're not going to read the file again, because we don't waste cache on it. If we are, which is the case here, it should be really bad for performance, since we actually have to do a physical read. Incidentally, that should also apply to general WAL when archive_mdoe is on. Do we optimize for that? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
Magnus Hagander wrote: O_DIRECT helps us when we're not going to read the file again, because we don't waste cache on it. If we are, which is the case here, it should be really bad for performance, since we actually have to do a physical read. Incidentally, that should also apply to general WAL when archive_mdoe is on. Do we optimize for that? Hmm, no we don't. We do take that into account so that we refrain from issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't disable O_DIRECT. Maybe we should.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Magnus Hagander wrote: O_DIRECT helps us when we're not going to read the file again, because we don't waste cache on it. If we are, which is the case here, it should be really bad for performance, since we actually have to do a physical read. Incidentally, that should also apply to general WAL when archive_mdoe is on. Do we optimize for that? Hmm, no we don't. We do take that into account so that we refrain from issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't disable O_DIRECT. Maybe we should.. Since the performance of WAL write is more important than that of WAL archiving in general, that optimization might offer little benefit. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
2010/2/18 Fujii Masao masao.fu...@gmail.com: On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Magnus Hagander wrote: O_DIRECT helps us when we're not going to read the file again, because we don't waste cache on it. If we are, which is the case here, it should be really bad for performance, since we actually have to do a physical read. Incidentally, that should also apply to general WAL when archive_mdoe is on. Do we optimize for that? Hmm, no we don't. We do take that into account so that we refrain from issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't disable O_DIRECT. Maybe we should.. Since the performance of WAL write is more important than that of WAL archiving in general, that optimization might offer little benefit. Well, it's going to make the process that reads the WAL cause actual physical I/O... That'll take a chunk out of your total available I/O, which is likely to push you to the limit of your I/O capacity much quicker. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
Magnus Hagander wrote: 2010/2/18 Fujii Masao masao.fu...@gmail.com: On Thu, Feb 18, 2010 at 7:04 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Magnus Hagander wrote: O_DIRECT helps us when we're not going to read the file again, because we don't waste cache on it. If we are, which is the case here, it should be really bad for performance, since we actually have to do a physical read. Incidentally, that should also apply to general WAL when archive_mdoe is on. Do we optimize for that? Hmm, no we don't. We do take that into account so that we refrain from issuing posix_fadvice(DONTNEED) if archive_mode is on, but we don't disable O_DIRECT. Maybe we should.. Since the performance of WAL write is more important than that of WAL archiving in general, that optimization might offer little benefit. Well, it's going to make the process that reads the WAL cause actual physical I/O... That'll take a chunk out of your total available I/O, which is likely to push you to the limit of your I/O capacity much quicker. Right, doesn't seem sensible, though it would be nice to see a benchmark on that. Here's a patch to disable O_DIRECT when archiving or streaming is enabled. This is pretty hard to test, so any extra eyeballs would be nice.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ? GNUmakefile ? a.patch ? b ? backend ? config.log ? config.status ? config.status.lineno ? configure.lineno ? gin-splay-1.patch ? gin-splay-2.patch ? gin-splay-3.patch ? include ? libpq ? log_unlogged_op_0115.patch ? md-1.c ? md-1.patch ? replication ? temp-file-resowner-2.patch ? contrib/pgbench/.deps ? contrib/pgbench/fsynctest ? contrib/pgbench/fsynctest.c ? contrib/pgbench/fsynctestfile ? contrib/pgbench/pgbench ? contrib/spi/.deps ? doc/src/sgml/HTML.index ? doc/src/sgml/bookindex.sgml ? doc/src/sgml/features-supported.sgml ? doc/src/sgml/features-unsupported.sgml ? doc/src/sgml/version.sgml ? src/Makefile.global ? src/backend/aaa.patch ? src/backend/postgres ? src/backend/access/common/.deps ? src/backend/access/gin/.deps ? src/backend/access/gist/.deps ? src/backend/access/hash/.deps ? src/backend/access/heap/.deps ? src/backend/access/index/.deps ? src/backend/access/nbtree/.deps ? src/backend/access/transam/.deps ? src/backend/bootstrap/.deps ? src/backend/catalog/.deps ? src/backend/commands/.deps ? src/backend/executor/.deps ? src/backend/foreign/.deps ? src/backend/foreign/dummy/.deps ? src/backend/foreign/postgresql/.deps ? src/backend/lib/.deps ? src/backend/libpq/.deps ? src/backend/main/.deps ? src/backend/nodes/.deps ? src/backend/optimizer/geqo/.deps ? src/backend/optimizer/path/.deps ? src/backend/optimizer/plan/.deps ? src/backend/optimizer/prep/.deps ? src/backend/optimizer/util/.deps ? src/backend/parser/.deps ? src/backend/po/af.mo ? src/backend/po/cs.mo ? src/backend/po/hr.mo ? src/backend/po/hu.mo ? src/backend/po/it.mo ? src/backend/po/ko.mo ? src/backend/po/nb.mo ? src/backend/po/nl.mo ? src/backend/po/pl.mo ? src/backend/po/ro.mo ? src/backend/po/ru.mo ? src/backend/po/sk.mo ? src/backend/po/sl.mo ? src/backend/po/sv.mo ? src/backend/po/zh_CN.mo ? src/backend/po/zh_TW.mo ? src/backend/port/.deps ? src/backend/postmaster/.deps ? src/backend/regex/.deps ? src/backend/replication/.deps ? src/backend/replication/libpqwalreceiver/.deps ? src/backend/rewrite/.deps ? src/backend/snowball/.deps ? src/backend/snowball/snowball_create.sql ? src/backend/storage/buffer/.deps ? src/backend/storage/file/.deps ? src/backend/storage/freespace/.deps ? src/backend/storage/ipc/.deps ? src/backend/storage/large_object/.deps ? src/backend/storage/lmgr/.deps ? src/backend/storage/page/.deps ? src/backend/storage/smgr/.deps ? src/backend/tcop/.deps ? src/backend/tsearch/.deps ? src/backend/utils/.deps ? src/backend/utils/probes.h ? src/backend/utils/adt/.deps ? src/backend/utils/cache/.deps ? src/backend/utils/error/.deps ? src/backend/utils/fmgr/.deps ? src/backend/utils/hash/.deps ? src/backend/utils/init/.deps ? src/backend/utils/mb/.deps ? src/backend/utils/mb/Unicode/BIG5.TXT ? src/backend/utils/mb/Unicode/CP950.TXT ? src/backend/utils/mb/conversion_procs/conversion_create.sql ? src/backend/utils/mb/conversion_procs/ascii_and_mic/.deps ? src/backend/utils/mb/conversion_procs/cyrillic_and_mic/.deps ? src/backend/utils/mb/conversion_procs/euc2004_sjis2004/.deps ? src/backend/utils/mb/conversion_procs/euc_cn_and_mic/.deps ? src/backend/utils/mb/conversion_procs/euc_jis_2004_and_shift_jis_2004/.deps ? src/backend/utils/mb/conversion_procs/euc_jp_and_sjis/.deps ? src/backend/utils/mb/conversion_procs/euc_kr_and_mic/.deps ? src/backend/utils/mb/conversion_procs/euc_tw_and_big5/.deps ? src/backend/utils/mb/conversion_procs/latin2_and_win1250/.deps ? src/backend/utils/mb/conversion_procs/latin_and_mic/.deps ? src/backend/utils/mb/conversion_procs/utf8_and_ascii/.deps ? src/backend/utils/mb/conversion_procs/utf8_and_big5/.deps ?
Re: [HACKERS] Streaming replication on win32, still broken
On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander mag...@hagander.net wrote: In that case, O_DIRECT would be counterproductive, no? It maps to FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the cache. So the read in the startup proc is actually guaranteed to reuqire a physical read - of something we just wrote, so it'll almost certainly end up waiting for a rotation, no? Seems like getting rid of O_DIRECT here is the right thing to do, regardless of this. Agreed. I'll remove O_DIRECT from walreceiver. Here is the patch to do that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 1627,1633 XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch) /* create/use new log file */ use_existent = true; openLogFile = XLogFileInit(openLogId, openLogSeg, ! use_existent, true); openLogOff = 0; } --- 1627,1633 /* create/use new log file */ use_existent = true; openLogFile = XLogFileInit(openLogId, openLogSeg, ! use_existent, true, true); openLogOff = 0; } *** *** 2184,2189 XLogNeedsFlush(XLogRecPtr record) --- 2184,2195 * place. This should be TRUE except during bootstrap log creation. The * caller must *not* hold the lock at call. * + * allow_direct_io: if TRUE, allow a WAL write to bypass the kernel cache + * by using PG_O_DIRECT for opening a file. Otherwise, PG_O_DIRECT is + * forcibly removed from the sync flag of open(). This should be FALSE + * only when walreceiver process writes WAL data because it's read + * immediately by the startup process. + * * Returns FD of opened file. * * Note: errors here are ERROR not PANIC because we might or might not be *** *** 2193,2199 XLogNeedsFlush(XLogRecPtr record) */ int XLogFileInit(uint32 log, uint32 seg, ! bool *use_existent, bool use_lock) { char path[MAXPGPATH]; char tmppath[MAXPGPATH]; --- 2199,2205 */ int XLogFileInit(uint32 log, uint32 seg, ! bool *use_existent, bool use_lock, bool allow_direct_io) { char path[MAXPGPATH]; char tmppath[MAXPGPATH]; *** *** 2203,2208 XLogFileInit(uint32 log, uint32 seg, --- 2209,2217 int max_advance; int fd; int nbytes; + int sync_bit; + + sync_bit = get_sync_bit(sync_method) (allow_direct_io ? 0 : ~PG_O_DIRECT); XLogFilePath(path, ThisTimeLineID, log, seg); *** *** 2211,2217 XLogFileInit(uint32 log, uint32 seg, */ if (*use_existent) { ! fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method), S_IRUSR | S_IWUSR); if (fd 0) { --- 2220,2226 */ if (*use_existent) { ! fd = BasicOpenFile(path, O_RDWR | PG_BINARY | sync_bit, S_IRUSR | S_IWUSR); if (fd 0) { *** *** 2237,2243 XLogFileInit(uint32 log, uint32 seg, unlink(tmppath); ! /* do not use get_sync_bit() here --- want to fsync only at end of fill */ fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, S_IRUSR | S_IWUSR); if (fd 0) --- 2246,2252 unlink(tmppath); ! /* do not use sync_bit here --- want to fsync only at end of fill */ fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, S_IRUSR | S_IWUSR); if (fd 0) *** *** 2317,2323 XLogFileInit(uint32 log, uint32 seg, *use_existent = false; /* Now open original target segment (might not be file I just made) */ ! fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method), S_IRUSR | S_IWUSR); if (fd 0) ereport(ERROR, --- 2326,2332 *use_existent = false; /* Now open original target segment (might not be file I just made) */ ! fd = BasicOpenFile(path, O_RDWR | PG_BINARY | sync_bit, S_IRUSR | S_IWUSR); if (fd 0) ereport(ERROR, *** *** 3121,3127 PreallocXlogFiles(XLogRecPtr endptr) { NextLogSeg(_logId, _logSeg); use_existent = true; ! lf = XLogFileInit(_logId, _logSeg, use_existent, true); close(lf); if (!use_existent) CheckpointStats.ckpt_segs_added++; --- 3130,3136 { NextLogSeg(_logId, _logSeg); use_existent = true; ! lf = XLogFileInit(_logId, _logSeg, use_existent, true, true); close(lf); if (!use_existent) CheckpointStats.ckpt_segs_added++; *** *** 4794,4800 BootStrapXLOG(void) /* Create first XLOG segment file */ use_existent = false; ! openLogFile = XLogFileInit(0, 0, use_existent, false); /* Write the first page with the initial record */ errno = 0; --- 4803,4809 /* Create first XLOG segment file */ use_existent = false;
Re: [HACKERS] Streaming replication on win32, still broken
On Wed, Feb 17, 2010 at 6:00 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander mag...@hagander.net wrote: In that case, O_DIRECT would be counterproductive, no? It maps to FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the cache. So the read in the startup proc is actually guaranteed to reuqire a physical read - of something we just wrote, so it'll almost certainly end up waiting for a rotation, no? Seems like getting rid of O_DIRECT here is the right thing to do, regardless of this. Agreed. I'll remove O_DIRECT from walreceiver. Here is the patch to do that. Ooops! I found the bug in the patch. Here is the updated version. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 1627,1633 XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch) /* create/use new log file */ use_existent = true; openLogFile = XLogFileInit(openLogId, openLogSeg, ! use_existent, true); openLogOff = 0; } --- 1627,1633 /* create/use new log file */ use_existent = true; openLogFile = XLogFileInit(openLogId, openLogSeg, ! use_existent, true, true); openLogOff = 0; } *** *** 2184,2189 XLogNeedsFlush(XLogRecPtr record) --- 2184,2195 * place. This should be TRUE except during bootstrap log creation. The * caller must *not* hold the lock at call. * + * allow_direct_io: if TRUE, allow a WAL write to bypass the kernel cache + * by using PG_O_DIRECT for opening a file. Otherwise, PG_O_DIRECT is + * forcibly removed from the sync flag of open(). This should be FALSE + * only when walreceiver process writes WAL data because it's read + * immediately by the startup process. + * * Returns FD of opened file. * * Note: errors here are ERROR not PANIC because we might or might not be *** *** 2193,2199 XLogNeedsFlush(XLogRecPtr record) */ int XLogFileInit(uint32 log, uint32 seg, ! bool *use_existent, bool use_lock) { char path[MAXPGPATH]; char tmppath[MAXPGPATH]; --- 2199,2205 */ int XLogFileInit(uint32 log, uint32 seg, ! bool *use_existent, bool use_lock, bool allow_direct_io) { char path[MAXPGPATH]; char tmppath[MAXPGPATH]; *** *** 2203,2208 XLogFileInit(uint32 log, uint32 seg, --- 2209,2219 int max_advance; int fd; int nbytes; + int sync_bit; + + sync_bit = get_sync_bit(sync_method); + if (!allow_direct_io) + sync_bit = ~PG_O_DIRECT; XLogFilePath(path, ThisTimeLineID, log, seg); *** *** 2211,2217 XLogFileInit(uint32 log, uint32 seg, */ if (*use_existent) { ! fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method), S_IRUSR | S_IWUSR); if (fd 0) { --- ,2228 */ if (*use_existent) { ! fd = BasicOpenFile(path, O_RDWR | PG_BINARY | sync_bit, S_IRUSR | S_IWUSR); if (fd 0) { *** *** 2237,2243 XLogFileInit(uint32 log, uint32 seg, unlink(tmppath); ! /* do not use get_sync_bit() here --- want to fsync only at end of fill */ fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, S_IRUSR | S_IWUSR); if (fd 0) --- 2248,2254 unlink(tmppath); ! /* do not use sync_bit here --- want to fsync only at end of fill */ fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, S_IRUSR | S_IWUSR); if (fd 0) *** *** 2317,2323 XLogFileInit(uint32 log, uint32 seg, *use_existent = false; /* Now open original target segment (might not be file I just made) */ ! fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method), S_IRUSR | S_IWUSR); if (fd 0) ereport(ERROR, --- 2328,2334 *use_existent = false; /* Now open original target segment (might not be file I just made) */ ! fd = BasicOpenFile(path, O_RDWR | PG_BINARY | sync_bit, S_IRUSR | S_IWUSR); if (fd 0) ereport(ERROR, *** *** 3121,3127 PreallocXlogFiles(XLogRecPtr endptr) { NextLogSeg(_logId, _logSeg); use_existent = true; ! lf = XLogFileInit(_logId, _logSeg, use_existent, true); close(lf); if (!use_existent) CheckpointStats.ckpt_segs_added++; --- 3132,3138 { NextLogSeg(_logId, _logSeg); use_existent = true; ! lf = XLogFileInit(_logId, _logSeg, use_existent, true, true); close(lf); if (!use_existent) CheckpointStats.ckpt_segs_added++; *** *** 4794,4800 BootStrapXLOG(void) /* Create first XLOG segment file */ use_existent = false; ! openLogFile = XLogFileInit(0, 0, use_existent, false);
Re: [HACKERS] Streaming replication on win32, still broken
Fujii Masao wrote: On Wed, Feb 17, 2010 at 6:00 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Feb 17, 2010 at 4:07 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander mag...@hagander.net wrote: In that case, O_DIRECT would be counterproductive, no? It maps to FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the cache. So the read in the startup proc is actually guaranteed to reuqire a physical read - of something we just wrote, so it'll almost certainly end up waiting for a rotation, no? Seems like getting rid of O_DIRECT here is the right thing to do, regardless of this. Agreed. I'll remove O_DIRECT from walreceiver. Here is the patch to do that. Ooops! I found the bug in the patch. Here is the updated version. If I'm reading the patch correctly, when wal_sync_method is 'open_sync', walreceiver nevertheless opens the WAL file without the O_DIRECT flag. When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(), issue_xlog_fsync() will do nothing because it assumes the write() synced it already. So the data written isn't being forced to disk at all. How about just forcing sync_method to 'fsync' in walreceiver? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: If I'm reading the patch correctly, when wal_sync_method is 'open_sync', walreceiver nevertheless opens the WAL file without the O_DIRECT flag. When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(), issue_xlog_fsync() will do nothing because it assumes the write() synced it already. So the data written isn't being forced to disk at all. When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC flag. So I think that write() flushes the data to disk even if O_DIRECT flag is not given. Am I missing something? How about just forcing sync_method to 'fsync' in walreceiver? In win32, O_DSYNC seems to be preferred to 'fsync' so far. So I'm not sure if reshuffling of priority is harmless. http://archives.postgresql.org/pgsql-hackers-win32/2005-03/msg00148.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
Fujii Masao wrote: On Thu, Feb 18, 2010 at 5:28 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: If I'm reading the patch correctly, when wal_sync_method is 'open_sync', walreceiver nevertheless opens the WAL file without the O_DIRECT flag. When it later flushes it in XLogWalRcvFlush() by issue_xlog_fsync(), issue_xlog_fsync() will do nothing because it assumes the write() synced it already. So the data written isn't being forced to disk at all. When 'open_sync' is chosen, the WAL file is opened with O_SYNC or O_FSYNC flag. So I think that write() flushes the data to disk even if O_DIRECT flag is not given. Am I missing something? Ah, ok, you're right. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander mag...@hagander.net wrote: With the libpq fixes, I get further (more on that fix later, btw), but now I get stuck in this. When I do something on the master that generates WAL, such as insert a record, and then try to query this on the slave, the walreceiver process crashes with: PANIC: XX000: could not write to log file 0, segment 9 at offset 0, length 160: Invalid argument LOCATION: XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487 I'll keep digging at the details, but if somebody has a good idea here.. ;) Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too. Though I've not idenfied the cause yet, I guess that it derives from wrong use of the type of local variables in XLogWalRcvWrite(). I'll continue investigation of it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
2010/2/16 Fujii Masao masao.fu...@gmail.com: On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander mag...@hagander.net wrote: With the libpq fixes, I get further (more on that fix later, btw), but now I get stuck in this. When I do something on the master that generates WAL, such as insert a record, and then try to query this on the slave, the walreceiver process crashes with: PANIC: XX000: could not write to log file 0, segment 9 at offset 0, length 160: Invalid argument LOCATION: XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487 I'll keep digging at the details, but if somebody has a good idea here.. ;) Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too. Though I've not idenfied the cause yet, I guess that it derives from wrong use of the type of local variables in XLogWalRcvWrite(). I'll continue investigation of it. Thanks! I will be somewhat spottily available over the next two days due to on-site work with clients. Let me know if you would be helped by some details of how to get a (somewhat faster) EC2 image up and running with MSVC to test on :-) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
On Tue, Feb 16, 2010 at 7:20 PM, Magnus Hagander mag...@hagander.net wrote: 2010/2/16 Fujii Masao masao.fu...@gmail.com: On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander mag...@hagander.net wrote: With the libpq fixes, I get further (more on that fix later, btw), but now I get stuck in this. When I do something on the master that generates WAL, such as insert a record, and then try to query this on the slave, the walreceiver process crashes with: PANIC: XX000: could not write to log file 0, segment 9 at offset 0, length 160: Invalid argument LOCATION: XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487 I'll keep digging at the details, but if somebody has a good idea here.. ;) Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too. Though I've not idenfied the cause yet, I guess that it derives from wrong use of the type of local variables in XLogWalRcvWrite(). I'll continue investigation of it. Thanks! I will be somewhat spottily available over the next two days due to on-site work with clients. Let me know if you would be helped by some details of how to get a (somewhat faster) EC2 image up and running with MSVC to test on :-) Thanks! I can probably use the EC2 image by reading your great blog post. http://blog.hagander.net/archives/151-Testing-PostgreSQL-patches-on-Windows-using-Amazon-EC2.html But it might take some time to make my sysadmin open the port for rdesktop for some reasons... Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
2010/2/16 Fujii Masao masao.fu...@gmail.com: On Tue, Feb 16, 2010 at 7:20 PM, Magnus Hagander mag...@hagander.net wrote: 2010/2/16 Fujii Masao masao.fu...@gmail.com: On Tue, Feb 16, 2010 at 12:37 AM, Magnus Hagander mag...@hagander.net wrote: With the libpq fixes, I get further (more on that fix later, btw), but now I get stuck in this. When I do something on the master that generates WAL, such as insert a record, and then try to query this on the slave, the walreceiver process crashes with: PANIC: XX000: could not write to log file 0, segment 9 at offset 0, length 160: Invalid argument LOCATION: XLogWalRcvWrite, .\src\backend\replication\walreceiver.c:487 I'll keep digging at the details, but if somebody has a good idea here.. ;) Yeah, this problem was reproduced in my (very slow :-( ) MinGW environment, too. Though I've not idenfied the cause yet, I guess that it derives from wrong use of the type of local variables in XLogWalRcvWrite(). I'll continue investigation of it. Thanks! I will be somewhat spottily available over the next two days due to on-site work with clients. Let me know if you would be helped by some details of how to get a (somewhat faster) EC2 image up and running with MSVC to test on :-) Thanks! I can probably use the EC2 image by reading your great blog post. http://blog.hagander.net/archives/151-Testing-PostgreSQL-patches-on-Windows-using-Amazon-EC2.html Actually, that one deosn't work anymore, because I managed to break the image :-) If you send me your amazon id, I can get you premissions on my private image. I plan to clean it up and make it public, just haven't gotten around to it yet... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
On Wed, Feb 17, 2010 at 6:28 AM, Magnus Hagander mag...@hagander.net wrote: If you send me your amazon id, I can get you premissions on my private image. I plan to clean it up and make it public, just haven't gotten around to it yet... Thanks for your concern! I'll send the ID when I complete the preparation. And, fortunately?, when I set wal_sync_method to open_sync, the problem was reproduced in the linux, too. The cause is that the data that is written by walreceiver is not aligned, even if O_DIRECT is used. On win32, O_DIRECT is used by default. So the problem always happened on win32. I propose two solution ideas: 1. O_DIRECT is somewhat harmful in the standby since the data written by walreceiver is read by the startup process immediately. So, how about not making only walreceiver use O_DIRECT? 2. Straightforwardly observe the alignment rule. Since the received WAL data might start at the middle of WAL block, walreceiver needs to keep the last half-written WAL block for alignment. OTOH since the received data might end at the middle of WAL block, walreceiver needs zero-padding. As a result, walreceiver writes the set of the last WAL block, received data and zero-padding. Which is better? Or do you have another better idea? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
On Wed, Feb 17, 2010 at 06:55, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Feb 17, 2010 at 6:28 AM, Magnus Hagander mag...@hagander.net wrote: If you send me your amazon id, I can get you premissions on my private image. I plan to clean it up and make it public, just haven't gotten around to it yet... Thanks for your concern! I'll send the ID when I complete the preparation. ok. And, fortunately?, when I set wal_sync_method to open_sync, the problem was reproduced in the linux, too. The cause is that the data that is written by Ah, that's good. It always helps if it's a cross-platform issue - particularly in that it's not one of the funky win32 specific things we did :) walreceiver is not aligned, even if O_DIRECT is used. On win32, O_DIRECT is used by default. So the problem always happened on win32. Ahh. I see. I propose two solution ideas: 1. O_DIRECT is somewhat harmful in the standby since the data written by walreceiver is read by the startup process immediately. So, how about not making only walreceiver use O_DIRECT? In that case, O_DIRECT would be counterproductive, no? It maps to FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the cache. So the read in the startup proc is actually guaranteed to reuqire a physical read - of something we just wrote, so it'll almost certainly end up waiting for a rotation, no? Seems like getting rid of O_DIRECT here is the right thing to do, regardless of this. 2. Straightforwardly observe the alignment rule. Since the received WAL data might start at the middle of WAL block, walreceiver needs to keep the last half-written WAL block for alignment. OTOH since the received data might end at the middle of WAL block, walreceiver needs zero-padding. As a result, walreceiver writes the set of the last WAL block, received data and zero-padding. May there be other reasons to d this as well? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
Magnus Hagander mag...@hagander.net writes: On Wed, Feb 17, 2010 at 06:55, Fujii Masao masao.fu...@gmail.com wrote: 2. Straightforwardly observe the alignment rule. Since the received WAL data might start at the middle of WAL block, walreceiver needs to keep the last half-written WAL block for alignment. OTOH since the received data might end at the middle of WAL block, walreceiver needs zero-padding. As a result, walreceiver writes the set of the last WAL block, received data and zero-padding. May there be other reasons to d this as well? Writing misaligned data is certain to be expensive even when it works... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
On Wed, Feb 17, 2010 at 3:03 PM, Magnus Hagander mag...@hagander.net wrote: In that case, O_DIRECT would be counterproductive, no? It maps to FILE_FLAG_NOI_BUFFERING, which makes sure it doesn't go into the cache. So the read in the startup proc is actually guaranteed to reuqire a physical read - of something we just wrote, so it'll almost certainly end up waiting for a rotation, no? Seems like getting rid of O_DIRECT here is the right thing to do, regardless of this. Agreed. I'll remove O_DIRECT from walreceiver. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Streaming replication on win32, still broken
On Wed, Feb 17, 2010 at 3:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Wed, Feb 17, 2010 at 06:55, Fujii Masao masao.fu...@gmail.com wrote: 2. Straightforwardly observe the alignment rule. Since the received WAL data might start at the middle of WAL block, walreceiver needs to keep the last half-written WAL block for alignment. OTOH since the received data might end at the middle of WAL block, walreceiver needs zero-padding. As a result, walreceiver writes the set of the last WAL block, received data and zero-padding. May there be other reasons to d this as well? Writing misaligned data is certain to be expensive even when it works... Yeah, right. After I remove O_DIRECT, I'll change walreceiver so as to do an alignment correctly, and then I'll test the performance. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers