Re: [HACKERS] Freeze avoidance of very large table.

2016-03-11 Thread Joshua D. Drake

On 03/11/2016 09:48 AM, Masahiko Sawada wrote:



Thank you so much!
What I wanted deal with in thread is almost done. I'm going to more
test the feature for 9.6 releasing.


Nicely done!



Regards,

--
Masahiko Sawada





--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-11 Thread Masahiko Sawada
On Sat, Mar 12, 2016 at 2:37 AM, Robert Haas  wrote:
> On Thu, Mar 10, 2016 at 10:47 PM, Masahiko Sawada  
> wrote:
>>> Thanks.  I adopted some of your suggested, rejected others, fixed a
>>> few minor things that I missed previously, and committed this.  If you
>>> think any of the changes that I rejected still have merit, please
>>> resubmit those changes as separate patches.
>>
>> Thank you for your effort to this feature and committing it.
>> I guess that I couldn't do good work to this feature at final stage,
>> but I really appreciate all your advice and suggestion.
>
> Don't feel bad, you put a lot of work on this, and if you were getting
> a little tired towards the end, that's very understandable.  This
> extremely important feature was largely driven by you, and that's a
> big accomplishment.
>
>> I got your point.
>> Attached latest patch can skip to write the last part of last old page
>> if it's empty.
>> Please review it.
>
> Committed.
>
> Which I think just about brings us to the end of this epic journey,
> except for any cleanup of what's already been committed that needs to
> be done.  Thanks so much for your hard work!
>

Thank you so much!
What I wanted deal with in thread is almost done. I'm going to more
test the feature for 9.6 releasing.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-11 Thread Robert Haas
On Thu, Mar 10, 2016 at 10:47 PM, Masahiko Sawada  wrote:
>> Thanks.  I adopted some of your suggested, rejected others, fixed a
>> few minor things that I missed previously, and committed this.  If you
>> think any of the changes that I rejected still have merit, please
>> resubmit those changes as separate patches.
>
> Thank you for your effort to this feature and committing it.
> I guess that I couldn't do good work to this feature at final stage,
> but I really appreciate all your advice and suggestion.

Don't feel bad, you put a lot of work on this, and if you were getting
a little tired towards the end, that's very understandable.  This
extremely important feature was largely driven by you, and that's a
big accomplishment.

> I got your point.
> Attached latest patch can skip to write the last part of last old page
> if it's empty.
> Please review it.

Committed.

Which I think just about brings us to the end of this epic journey,
except for any cleanup of what's already been committed that needs to
be done.  Thanks so much for your hard work!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-10 Thread Masahiko Sawada
On Fri, Mar 11, 2016 at 6:16 AM, Robert Haas  wrote:
> On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada  
> wrote:
>> On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas  wrote:
>>> This 001 patch looks so little like what I was expecting that I
>>> decided to start over from scratch.  The new version I wrote is
>>> attached here.  I don't understand why your version tinkers with the
>>> logic for setting the all-frozen bit; I thought that what I already
>>> committed dealt with that already, and in any case, your version
>>> doesn't even compile against latest sources.  Your version also leaves
>>> the scan_all terminology intact even though it's not accurate any
>>> more, and I am not very convinced that the updates to the
>>> page-skipping logic are actually correct.  Please have a look over
>>> this version and see what you think.
>>
>> Thank you for your advise.
>> Sorry, optimising logic of previous patch was old by mistake.
>> Attached latest patch incorporated your suggestions with a little revising.
>
> Thanks.  I adopted some of your suggested, rejected others, fixed a
> few minor things that I missed previously, and committed this.  If you
> think any of the changes that I rejected still have merit, please
> resubmit those changes as separate patches.
>

Thank you for your effort to this feature and committing it.
I guess that I couldn't do good work to this feature at final stage,
but I really appreciate all your advice and suggestion.

> I think what I really want is some logic so that if we have a 1-page
> visibility map in the old cluster and the second half of that page is
> all zeroes, we only create a 1-page visibility map in the new cluster
> rather than a 2-page visibility map.
>
> Or more generally, if the old VM is N pages, but the last half of the
> last page is empty, then let the output VM be 2*N-1 pages instead of
> 2*N pages.
>

I got your point.
Attached latest patch can skip to write the last part of last old page
if it's empty.
Please review it.

Regards,

--
Masahiko Sawada
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..7783b8a 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,10 +9,16 @@
 
 #include "postgres_fe.h"
 
+#include "access/visibilitymap.h"
 #include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
 
+#include 
 #include 
 
+#define BITS_PER_HEAPBLOCK_OLD 1
 
 
 #ifndef WIN32
@@ -138,6 +144,156 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
 #endif
 
 
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table.  So, instead, we rewrite the old
+ * visibility maps in the new format.  That way, the all-visible bit
+ * remains set for the pages for which it was set previously.  The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+	int			src_fd = 0;
+	int			dst_fd = 0;
+	char		buffer[BLCKSZ];
+	ssize_t		bytesRead;
+	ssize_t		src_filesize;
+	int			rewriteVmBytesPerPage;
+	BlockNumber new_blkno = 0;
+	struct stat	statbuf;
+
+	/* Compute we need how many old page bytes to rewrite a new page */
+	rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+	if ((fromfile == NULL) || (tofile == NULL))
+		return "Invalid old file or new file";
+
+	if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+		return getErrorText();
+
+	if (fstat(src_fd, ) != 0)
+	{
+		close(src_fd);
+		return getErrorText();
+	}
+
+	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+	{
+		close(src_fd);
+		return getErrorText();
+	}
+
+	/* Save old file size */
+	src_filesize = statbuf.st_size;
+
+	/*
+	 * Turn each visibility map page into 2 pages one by one. Each new page
+	 * has the same page header as the old one.  If last section of last page
+	 * is empty, we skip to write it. That is, more generally the old visibility
+	 * map is N pages, but the last part of the last page is empty, this routine
+	 * ouputs (BITS_PER_HEAPBLOCK / BITS_PER_HEAPBLOCK_OLD) * N - 1 pages instead
+	 * of (BITS_PER_HEAPBLOCK / BITS_PER_HEAPBLOCK_OLD) * N pages.
+	 */
+	while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+	{
+		char		*old_cur;
+		char		*old_break;
+		char		*old_blkend;
+		PageHeaderData pageheader;
+		bool		old_lastblk = ((BLCKSZ * (new_blkno + 1)) == src_filesize);
+
+		/* Save the page header data */
+		memcpy(, buffer, 

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-10 Thread Robert Haas
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada  wrote:
> On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas  wrote:
>> This 001 patch looks so little like what I was expecting that I
>> decided to start over from scratch.  The new version I wrote is
>> attached here.  I don't understand why your version tinkers with the
>> logic for setting the all-frozen bit; I thought that what I already
>> committed dealt with that already, and in any case, your version
>> doesn't even compile against latest sources.  Your version also leaves
>> the scan_all terminology intact even though it's not accurate any
>> more, and I am not very convinced that the updates to the
>> page-skipping logic are actually correct.  Please have a look over
>> this version and see what you think.
>
> Thank you for your advise.
> Sorry, optimising logic of previous patch was old by mistake.
> Attached latest patch incorporated your suggestions with a little revising.

Thanks.  I adopted some of your suggested, rejected others, fixed a
few minor things that I missed previously, and committed this.  If you
think any of the changes that I rejected still have merit, please
resubmit those changes as separate patches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-10 Thread Robert Haas
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada  wrote:
> On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas  wrote:
>> This 001 patch looks so little like what I was expecting that I
>> decided to start over from scratch.  The new version I wrote is
>> attached here.  I don't understand why your version tinkers with the
>> logic for setting the all-frozen bit; I thought that what I already
>> committed dealt with that already, and in any case, your version
>> doesn't even compile against latest sources.  Your version also leaves
>> the scan_all terminology intact even though it's not accurate any
>> more, and I am not very convinced that the updates to the
>> page-skipping logic are actually correct.  Please have a look over
>> this version and see what you think.
>
> Thank you for your advise.
> Sorry, optimising logic of previous patch was old by mistake.
> Attached latest patch incorporated your suggestions with a little revising.

OK, I'll have a look.  Thanks.

>> I think that's kind of pointless.  We need to test that this
>> conversion code works, but once it does, I don't think we should make
>> everybody pay the overhead of retesting that.  Anyway, the test code
>> could have bugs, too.
>>
>> Here's an updated version of your patch with that code removed and
>> some cosmetic cleanups like fixing typos and stuff like that.  I think
>> this is mostly ready to commit, but I noticed one problem: your
>> conversion code always produces two output pages for each input page
>> even if one of them would be empty.  In particular, if you have a
>> large number of small relations and run pg_upgrade, all of their
>> visibility maps will go from 8kB to 16kB.  That isn't the end of the
>> world, maybe, but I think you should see if you can't fix it
>> somehow
>
> Thank you for updating patch.
> To deal with this problem, I've changed it so that pg_upgrade checks
> file size before conversion.
> And if fork file does not exist or size is 0 (empty), ignore.
> Attached latest patch.

I think what I really want is some logic so that if we have a 1-page
visibility map in the old cluster and the second half of that page is
all zeroes, we only create a 1-page visibility map in the new cluster
rather than a 2-page visibility map.

Or more generally, if the old VM is N pages, but the last half of the
last page is empty, then let the output VM be 2*N-1 pages instead of
2*N pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-10 Thread Masahiko Sawada
On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas  wrote:
> This 001 patch looks so little like what I was expecting that I
> decided to start over from scratch.  The new version I wrote is
> attached here.  I don't understand why your version tinkers with the
> logic for setting the all-frozen bit; I thought that what I already
> committed dealt with that already, and in any case, your version
> doesn't even compile against latest sources.  Your version also leaves
> the scan_all terminology intact even though it's not accurate any
> more, and I am not very convinced that the updates to the
> page-skipping logic are actually correct.  Please have a look over
> this version and see what you think.

Thank you for your advise.
Sorry, optimising logic of previous patch was old by mistake.
Attached latest patch incorporated your suggestions with a little revising.

>
> I think that's kind of pointless.  We need to test that this
> conversion code works, but once it does, I don't think we should make
> everybody pay the overhead of retesting that.  Anyway, the test code
> could have bugs, too.
>
> Here's an updated version of your patch with that code removed and
> some cosmetic cleanups like fixing typos and stuff like that.  I think
> this is mostly ready to commit, but I noticed one problem: your
> conversion code always produces two output pages for each input page
> even if one of them would be empty.  In particular, if you have a
> large number of small relations and run pg_upgrade, all of their
> visibility maps will go from 8kB to 16kB.  That isn't the end of the
> world, maybe, but I think you should see if you can't fix it
> somehow

Thank you for updating patch.
To deal with this problem, I've changed it so that pg_upgrade checks
file size before conversion.
And if fork file does not exist or size is 0 (empty), ignore.
Attached latest patch.

Regards,

--
Masahiko Sawada


001_optimize_vacuum_by_frozen_bit_v40.patch
Description: Binary data


000_pgupgrade_rewrite_vm_v42.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-10 Thread Robert Haas
On Thu, Mar 10, 2016 at 8:51 AM, Masahiko Sawada  wrote:
> After some further thought, I thought that it's better to add check
> logic for result of rewriting visibility map to upgrading logic rather
> than regression test in order to ensure that rewriting visibility map
> has been successfully done.
> As a draft, attached patch checks the result of rewriting visibility
> map after rewrote for each relation as a routine of pg_upgrade.
> The disadvantage point of this is that we need to scan each visibility
> map page for 2 times.
> But since visibility map size would not be so large, it would not bad.
> Thoughts?

I think that's kind of pointless.  We need to test that this
conversion code works, but once it does, I don't think we should make
everybody pay the overhead of retesting that.  Anyway, the test code
could have bugs, too.

Here's an updated version of your patch with that code removed and
some cosmetic cleanups like fixing typos and stuff like that.  I think
this is mostly ready to commit, but I noticed one problem: your
conversion code always produces two output pages for each input page
even if one of them would be empty.  In particular, if you have a
large number of small relations and run pg_upgrade, all of their
visibility maps will go from 8kB to 16kB.  That isn't the end of the
world, maybe, but I think you should see if you can't fix it
somehow

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..34e1451 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,10 +9,15 @@
 
 #include "postgres_fe.h"
 
+#include "access/visibilitymap.h"
 #include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
 
 #include 
 
+#define BITS_PER_HEAPBLOCK_OLD 1
 
 
 #ifndef WIN32
@@ -138,6 +143,130 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
 #endif
 
 
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table.  So, instead, we rewrite the old
+ * visibility maps in the new format.  That way, the all-visible bit
+ * remains set for the pages for which it was set previously.  The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+	int			src_fd = 0;
+	int			dst_fd = 0;
+	char		buffer[BLCKSZ];
+	ssize_t		bytesRead;
+	int			rewriteVmBytesPerPage;
+	BlockNumber blkno = 0;
+
+	/* Compute we need how many old page bytes to rewrite a new page */
+	rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+	if ((fromfile == NULL) || (tofile == NULL))
+		return "Invalid old file or new file";
+
+	if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+		return getErrorText();
+
+	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+	{
+		close(src_fd);
+		return getErrorText();
+	}
+
+	/*
+	 * Turn each visibility map page into 2 pages one by one. Each new page
+	 * has the same page header as the old one.
+	 */
+	while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+	{
+		char	   *old_cur,
+   *old_break,
+   *old_blkend;
+		PageHeaderData pageheader;
+
+		/* Save the page header data */
+		memcpy(, buffer, SizeOfPageHeaderData);
+
+		/*
+		 * These old_* variables point to old visibility map page. old_cur
+		 * points to current position on old page. old_blkend points to end of
+		 * old block. old_break points to old page break position for rewriting
+		 * a new page. After wrote a new page, old_break proceeds
+		 * rewriteVmBytesPerPage bytes.
+		 */
+		old_cur = buffer + SizeOfPageHeaderData;
+		old_blkend = buffer + bytesRead;
+		old_break = old_cur + rewriteVmBytesPerPage;
+
+		while (old_blkend >= old_break)
+		{
+			char		vmbuf[BLCKSZ];
+			char	   *new_cur = vmbuf;
+
+			/* Copy page header in advance */
+			memcpy(vmbuf, , SizeOfPageHeaderData);
+
+			new_cur += SizeOfPageHeaderData;
+
+			/*
+			 * Process old page bytes one by one, and turn it into new page.
+			 */
+			while (old_break > old_cur)
+			{
+uint16		new_vmbits = 0;
+int			i;
+
+/* Generate new format bits while keeping old information */
+for (i = 0; i < BITS_PER_BYTE; i++)
+{
+	uint8	byte = * (uint8 *) old_cur;
+
+	if (((byte & (1 << (BITS_PER_HEAPBLOCK_OLD * i != 0)
+		new_vmbits |= 1 << (BITS_PER_HEAPBLOCK * i);
+}
+
+/* Copy new visibility map bit to new format page */
+

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-10 Thread Masahiko Sawada
On Thu, Mar 10, 2016 at 3:27 PM, Masahiko Sawada  wrote:
> Thank you for reviewing!
> Attached updated patch.
>
>
> On Thu, Mar 10, 2016 at 3:37 AM, Robert Haas  wrote:
>> On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
>>  wrote: Attached latest 2 patches.
>>> * 000 patch : Incorporated the review comments and made rewriting
>>> logic more clearly.
>>
>> That's better, thanks.  But your comments don't survive pgindent.
>> After running pgindent, I get this:
>>
>> +   /*
>> +* These old_* variables point to old visibility map page.
>> +*
>> +* cur_old: Points to current position on old
>> page. blkend_old :
>> +* Points to end of old block. break_old  : Points to
>> old page break
>> +* position for rewriting a new page. After wrote a
>> new page, old_end
>> +* proceeds rewriteVmBytesPerPgae bytes.
>> +*/
>>
>> You need to either surround this sort of thing with dashes to make
>> pgindent ignore it, or, probably better, rewrite it using complete
>> sentences that together form a paragraph.
>
> Fixed.
>
>>
>> +   Oid pg_database_oid;/* OID of
>> pg_database relation */
>>
>> Not used anywhere?
>
> Fixed.
>
>> Instead of vm_need_rewrite, how about vm_must_add_frozenbit?
>
> Fixed.
>
>> Can you explain the changes to test.sh?
>
> Current regression test scenario is,
> 1. Do 'make check' on pre-upgrade cluster
> 2. Dump relallvisible values of all relation in pre-upgrade cluster to
> vm_test1.txt
> 3. Do pg_upgrade
> 4. Do analyze (not vacuum), dump relallvisibile values of all relation
> in post-upgrade cluster to vm_test2.txt
> 5. Compare between vm_test1.txt and vm_test2.txt
>
> That is, regression test compares between relallvisible values in
> pre-upgrade cluster and post-upgrade cluster.
> But because test.sh always uses pre/post clusters with same catalog
> version, I realized that we cannot ensure that visibility map
> rewriting is processed successfully on test.sh framework.
> Rewriting visibility map never be executed.
> We might need to have another framework for rewriting visibility map page..
>

After some further thought, I thought that it's better to add check
logic for result of rewriting visibility map to upgrading logic rather
than regression test in order to ensure that rewriting visibility map
has been successfully done.
As a draft, attached patch checks the result of rewriting visibility
map after rewrote for each relation as a routine of pg_upgrade.
The disadvantage point of this is that we need to scan each visibility
map page for 2 times.
But since visibility map size would not be so large, it would not bad.
Thoughts?

Regards,


-- 
Regards,

--
Masahiko Sawada
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..6fd1460 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,10 +9,15 @@
 
 #include "postgres_fe.h"
 
+#include "access/visibilitymap.h"
 #include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
 
 #include 
 
+#define BITS_PER_HEAPBLOCK_OLD 1
 
 
 #ifndef WIN32
@@ -21,6 +26,7 @@ static int	copy_file(const char *fromfile, const char *tofile, bool force);
 static int	win32_pghardlink(const char *src, const char *dst);
 #endif
 
+static bool checkRewriteVisibilityMap(const char *oldfile, const char *newfile);
 
 /*
  * copyFile()
@@ -138,6 +144,235 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
 #endif
 
 
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table.  So, instead, we rewrite the old
+ * visibility maps in the new format.  That way, the all-visible bit
+ * remains set for the pages for which it was set previously.  The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+	int			src_fd = 0;
+	int			dst_fd = 0;
+	char		buffer[BLCKSZ];
+	ssize_t 	bytesRead;
+	int			rewriteVmBytesPerPage;
+	BlockNumber	blkno = 0;
+
+	/* Compute we need how many old page bytes to rewrite a new page */
+	rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+	if ((fromfile == NULL) || (tofile == NULL))
+		return "Invalid old file or new file";
+
+	if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+		return getErrorText();
+
+	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-10 Thread Robert Haas
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada  wrote:
> * 001 patch : Incorporated the documentation suggestions and updated
> logic a little.

This 001 patch looks so little like what I was expecting that I
decided to start over from scratch.  The new version I wrote is
attached here.  I don't understand why your version tinkers with the
logic for setting the all-frozen bit; I thought that what I already
committed dealt with that already, and in any case, your version
doesn't even compile against latest sources.  Your version also leaves
the scan_all terminology intact even though it's not accurate any
more, and I am not very convinced that the updates to the
page-skipping logic are actually correct.  Please have a look over
this version and see what you think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..2f72633 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5984,12 +5984,15 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an aggressive scan if the table's
 pg_class.relfrozenxid field has reached
-the age specified by this setting.  The default is 150 million
-transactions.  Although users can set this value anywhere from zero to
-two billions, VACUUM will silently limit the effective value
-to 95% of , so that a
+the age specified by this setting.  An aggressive scan differs from
+a regular VACUUM in that it visits every page that might
+contain unfrozen XIDs or MXIDs, not just those that might contain dead
+tuples.  The default is 150 million transactions.  Although users can
+set this value anywhere from zero to two billions, VACUUM
+will silently limit the effective value to 95% of
+, so that a
 periodical manual VACUUM has a chance to run before an
 anti-wraparound autovacuum is launched for the table. For more
 information see
@@ -6028,9 +6031,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an aggressive scan if the table's
 pg_class.relminmxid field has reached
-the age specified by this setting.  The default is 150 million multixacts.
+the age specified by this setting.  An aggressive scan differs from
+a regular VACUUM in that it visits every page that might
+contain unfrozen XIDs or MXIDs, not just those that might contain dead
+tuples.  The default is 150 million multixacts.
 Although users can set this value anywhere from zero to two billions,
 VACUUM will silently limit the effective value to 95% of
 , so that a
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..d742ec9 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -438,22 +438,27 @@

 

-VACUUM normally skips pages that don't have any dead row
-versions, but those pages might still have row versions with old XID
-values.  To ensure all old row versions have been frozen, a
-scan of the whole table is needed.
- controls when
-VACUUM does that: a whole table sweep is forced if
-the table hasn't been fully scanned for vacuum_freeze_table_age
-minus vacuum_freeze_min_age transactions. Setting it to 0
-forces VACUUM to always scan all pages, effectively ignoring
-the visibility map.
+VACUUM uses the visibility map
+to determine which pages of a relation must be scanned.  Normally, it
+will skips pages that don't have any dead row versions even if those pages
+might still have row versions with old XID values.  Therefore, normal
+scans won't succeed in freezing every row version in the table.
+Periodically, VACUUM will perform an aggressive
+vacuum, skipping only those pages which contain neither dead rows nor
+any unfrozen XID or MXID values.
+
+controls when VACUUM does that: all-visible but not all-frozen
+pages are scanned if the number of transactions that have passed since the
+last such scan is greater than vacuum_freeze_table_age minus
+vacuum_freeze_min_age. Setting
+vacuum_freeze_table_age to 0 forces VACUUM to
+use this more aggressive strategy for all scans.

 

 The maximum time that a table can go unvacuumed is two billion
 transactions minus the vacuum_freeze_min_age value at
-the time VACUUM last scanned the whole table.  If it were to go
+the time of the last aggressive vacuum. If it were to go
 unvacuumed for longer than
 that, data loss could result.  To ensure that this does not happen,
 autovacuum is 

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-09 Thread Masahiko Sawada
Thank you for reviewing!
Attached updated patch.


On Thu, Mar 10, 2016 at 3:37 AM, Robert Haas  wrote:
> On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
>  wrote: Attached latest 2 patches.
>> * 000 patch : Incorporated the review comments and made rewriting
>> logic more clearly.
>
> That's better, thanks.  But your comments don't survive pgindent.
> After running pgindent, I get this:
>
> +   /*
> +* These old_* variables point to old visibility map page.
> +*
> +* cur_old: Points to current position on old
> page. blkend_old :
> +* Points to end of old block. break_old  : Points to
> old page break
> +* position for rewriting a new page. After wrote a
> new page, old_end
> +* proceeds rewriteVmBytesPerPgae bytes.
> +*/
>
> You need to either surround this sort of thing with dashes to make
> pgindent ignore it, or, probably better, rewrite it using complete
> sentences that together form a paragraph.

Fixed.

>
> +   Oid pg_database_oid;/* OID of
> pg_database relation */
>
> Not used anywhere?

Fixed.

> Instead of vm_need_rewrite, how about vm_must_add_frozenbit?

Fixed.

> Can you explain the changes to test.sh?

Current regression test scenario is,
1. Do 'make check' on pre-upgrade cluster
2. Dump relallvisible values of all relation in pre-upgrade cluster to
vm_test1.txt
3. Do pg_upgrade
4. Do analyze (not vacuum), dump relallvisibile values of all relation
in post-upgrade cluster to vm_test2.txt
5. Compare between vm_test1.txt and vm_test2.txt

That is, regression test compares between relallvisible values in
pre-upgrade cluster and post-upgrade cluster.
But because test.sh always uses pre/post clusters with same catalog
version, I realized that we cannot ensure that visibility map
rewriting is processed successfully on test.sh framework.
Rewriting visibility map never be executed.
We might need to have another framework for rewriting visibility map page..

Regards,

--
Masahiko Sawada
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..7c5bfa6 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
 
 #include "postgres_fe.h"
 
+#include "access/visibilitymap.h"
 #include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
 
 #include 
 
@@ -138,6 +142,130 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
 #endif
 
 
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table.  So, instead, we rewrite the old
+ * visibility maps in the new format.  That way, the all-visible bit
+ * remains set for the pages for which it was set previously.  The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+#define BITS_PER_HEAPBLOCK_OLD 1
+
+	int			src_fd = 0;
+	int			dst_fd = 0;
+	char		buffer[BLCKSZ];
+	ssize_t 	bytesRead;
+	int			rewriteVmBytesPerPage;
+	BlockNumber	blkno = 0;
+
+	/* Compute we need how many old page bytes to rewrite a new page */
+	rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+	/* Reset errno */
+	errno = 0;
+
+	if ((fromfile == NULL) || (tofile == NULL))
+		return getErrorText();
+
+	if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+		goto err;
+
+	if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+		goto err;
+
+	/*
+	 * Turn each visibility map page into 2 pages one by one.
+	 * Rewritten 2 pages have same page header as old page had.
+	 */
+	while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+	{
+		char	*cur_old, *break_old, *blkend_old;
+		PageHeaderData	pageheader;
+
+		/* Save the page header data */
+		memcpy(, buffer, SizeOfPageHeaderData);
+
+		/*
+		 * These old_* variables point to old visibility map page.
+		 * cur_old points to current potision on old page. blend_old
+		 * points to end of old block. break_old points to old page
+		 * break position for rewritin a new page. After wrote a new
+		 * page, break_old proceeds rewriteVmBytesPerPage bytes.
+		 */
+		cur_old = buffer + SizeOfPageHeaderData;
+		blkend_old = buffer + bytesRead;
+		break_old = cur_old + rewriteVmBytesPerPage;
+
+		while (blkend_old >= break_old)
+		{
+			char	vmbuf[BLCKSZ];
+			char	*cur_new = vmbuf;
+
+			/* Copy page header in advance */
+			memcpy(vmbuf, , SizeOfPageHeaderData);
+
+			

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-09 Thread Robert Haas
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
 wrote: Attached latest 2 patches.
> * 000 patch : Incorporated the review comments and made rewriting
> logic more clearly.

That's better, thanks.  But your comments don't survive pgindent.
After running pgindent, I get this:

+   /*
+* These old_* variables point to old visibility map page.
+*
+* cur_old: Points to current position on old
page. blkend_old :
+* Points to end of old block. break_old  : Points to
old page break
+* position for rewriting a new page. After wrote a
new page, old_end
+* proceeds rewriteVmBytesPerPgae bytes.
+*/

You need to either surround this sort of thing with dashes to make
pgindent ignore it, or, probably better, rewrite it using complete
sentences that together form a paragraph.

+   Oid pg_database_oid;/* OID of
pg_database relation */

Not used anywhere?

Instead of vm_need_rewrite, how about vm_must_add_frozenbit?

Can you explain the changes to test.sh?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-09 Thread Masahiko Sawada
On Wed, Mar 9, 2016 at 3:38 AM, Robert Haas  wrote:
> On Tue, Mar 8, 2016 at 12:59 PM, Masahiko Sawada  
> wrote:
>>> How about instead changing things so that we specifically reject
>>> indexes?  And maybe some kind of a check that will reject anything
>>> that lacks a relfilnode?  That seems like it would be more on point.
>>
>> I agree, I don't have strong opinion about this.
>> It would be good to add condition for rejecting only indexes.
>> Attached patches are,
>>  - Change heap2 rmgr description
>>  - Add condition to pg_visibility
>>  - Fix typo in pgvisibility.sgml
>> (Sorry for the late notice..)
>
> OK, committed the first and last of those.  I think the other one
> needs some work yet; the error message doesn't seem like it is quite
> our usual style, and if we're going to do something here we should
> probably also insert a check to throw a better error when there is no
> relfilenode.
>

Thank you for your advising and suggestion!

Attached latest 2 patches.
* 000 patch : Incorporated the review comments and made rewriting
logic more clearly.
* 001 patch : Incorporated the documentation suggestions and updated
logic a little.

Please review them.

Regards,

--
Masahiko Sawada


000_pgupgrade_rewrites_vm_v38.patch
Description: Binary data


001_optimize_vacuum_by_frozen_bit_v38.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Robert Haas
On Tue, Mar 8, 2016 at 12:59 PM, Masahiko Sawada  wrote:
>> How about instead changing things so that we specifically reject
>> indexes?  And maybe some kind of a check that will reject anything
>> that lacks a relfilnode?  That seems like it would be more on point.
>
> I agree, I don't have strong opinion about this.
> It would be good to add condition for rejecting only indexes.
> Attached patches are,
>  - Change heap2 rmgr description
>  - Add condition to pg_visibility
>  - Fix typo in pgvisibility.sgml
> (Sorry for the late notice..)

OK, committed the first and last of those.  I think the other one
needs some work yet; the error message doesn't seem like it is quite
our usual style, and if we're going to do something here we should
probably also insert a check to throw a better error when there is no
relfilenode.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Robert Haas
On Tue, Mar 8, 2016 at 12:49 PM, Tom Lane  wrote:
>> However, after some further thought, I think we might actually be OK.
>> If a page goes from all-frozen to not-all-frozen while VACUUM is
>> running, any new XID added to the page must be newer than the
>> oldestXmin value computed by vacuum_set_xid_limits(), so it won't
>> affect the value to which we can safely set relfrozenxid.  Similarly,
>> any MXID added to the page will be newer than GetOldestMultiXactId(),
>> so setting relminmxid is still safe for similar reasons.
>
> Yeah, I agree with this, as long as the issue is only that the visibility
> map result is slightly stale and not that it's, say, not crash-safe.

If the visibility map isn't crash safe, we've got big problems even
without this patch, but we dealt with that when index-only scans went
in.  Maybe this patch introduces more stringent requirements in this
area, but I can't think of any reason why that should be true.  If
anything occurs to you (or anyone else), it would be good to mention
that before I go further destroy the world.

> We can reasonably assume that any newly-added XID must be one that was
> in progress while VACUUM was running, and hence will be after the xmin
> horizon we computed earlier.  This requires the existence of a read
> barrier somewhere between computing xmin horizon and inspecting the
> visibility map, but I find it hard to believe there aren't plenty.

I'll check that, but I agree that it should be OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Masahiko Sawada
On Wed, Mar 9, 2016 at 1:23 AM, Jeff Janes  wrote:
> On Tue, Mar 8, 2016 at 5:30 AM, Robert Haas  wrote:

> I left out the relkind check from the final commit because, for one
> thing, the check you added isn't actually right: toast relations can
> also have a visibility map.  And also, I'm sort of wondering what the
> point of that check is.  What does it protect us from?  It doesn't
> seem very future-proof ... what if we add a new relkind in the future?
>  Do we really want to have to update this?
>
> How about instead changing things so that we specifically reject
> indexes?  And maybe some kind of a check that will reject anything
> that lacks a relfilnode?  That seems like it would be more on point.
>

I agree, I don't have strong opinion about this.
It would be good to add condition for rejecting only indexes.
Attached patches are,
 - Change heap2 rmgr description
 - Add condition to pg_visibility
 - Fix typo in pgvisibility.sgml
(Sorry for the late notice..)

Regards,

--
Masahiko Sawada
diff --git a/doc/src/sgml/pgvisibility.sgml b/doc/src/sgml/pgvisibility.sgml
index 6a98b55..cdd6a6f 100644
--- a/doc/src/sgml/pgvisibility.sgml
+++ b/doc/src/sgml/pgvisibility.sgml
@@ -21,7 +21,7 @@
   until such time as a tuple is inserted, updated, deleted, or locked on
   that page.  The page-level PD_ALL_VISIBLE bit has the
   same meaning as the all-visible bit in the visibility map, but is stored
-  within the data page itself rather than a separate data tructure.  These
+  within the data page itself rather than a separate data structure.  These
   will normally agree, but the page-level bit can sometimes be set while the
   visibility map bit is clear after a crash recovery; or they can disagree
   because of a change which occurs after pg_visibility examines
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5e5c7cc..c916d0d 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -56,6 +56,13 @@ pg_visibility_map(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("invalid block number")));
 
+	/* Check for relation type */
+	if (rel->rd_rel->relkind == RELKIND_INDEX)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot use for index \"%s\"",
+		RelationGetRelationName(rel;
+
 	tupdesc = pg_visibility_tupdesc(false, false);
 	MemSet(nulls, 0, sizeof(nulls));
 
@@ -95,6 +102,13 @@ pg_visibility(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("invalid block number")));
 
+	/* Check for relation type */
+	if (rel->rd_rel->relkind == RELKIND_INDEX)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot use for index \"%s\"",
+		RelationGetRelationName(rel;
+
 	tupdesc = pg_visibility_tupdesc(false, true);
 	MemSet(nulls, 0, sizeof(nulls));
 
@@ -226,6 +240,13 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
 	rel = relation_open(relid, AccessShareLock);
 	nblocks = RelationGetNumberOfBlocks(rel);
 
+	/* Check for relation type */
+	if (rel->rd_rel->relkind == RELKIND_INDEX)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot use for index \"%s\"",
+		RelationGetRelationName(rel;
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -300,6 +321,13 @@ collect_visibility_data(Oid relid, bool include_pd)
 
 	rel = relation_open(relid, AccessShareLock);
 
+	/* Check for relation type */
+	if (rel->rd_rel->relkind == RELKIND_INDEX)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot use for index \"%s\"",
+		RelationGetRelationName(rel;
+
 	nblocks = RelationGetNumberOfBlocks(rel);
 	info = palloc0(offsetof(vbits, bits) + nblocks);
 	info->next = 0;
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index a63162c..2b31ea4 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -125,7 +125,8 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
 
-		appendStringInfo(buf, "cutoff xid %u", xlrec->cutoff_xid);
+		appendStringInfo(buf, "cutoff xid %u flags %d",
+		 xlrec->cutoff_xid, xlrec->flags);
 	}
 	else if (info == XLOG_HEAP2_MULTI_INSERT)
 	{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Tom Lane
Robert Haas  writes:
> The patch makes some attempt to update the comment mechanically, but
> that's not nearly enough.  That comment is explaining that you *can't*
> rely on the visibility map to tell you *for sure* that a page does not
> require vacuuming.  For current uses, that's OK, because if we miss a
> page we'll pick it up later.  But now we want to skip vacuuming pages
> for relfrozenxid/relminmxid advancement, that rationale doesn't apply.
> Missing pages that need to be frozen and advancing relfrozenxid anyway
> would be _bad_.

Check.

> However, after some further thought, I think we might actually be OK.
> If a page goes from all-frozen to not-all-frozen while VACUUM is
> running, any new XID added to the page must be newer than the
> oldestXmin value computed by vacuum_set_xid_limits(), so it won't
> affect the value to which we can safely set relfrozenxid.  Similarly,
> any MXID added to the page will be newer than GetOldestMultiXactId(),
> so setting relminmxid is still safe for similar reasons.

Yeah, I agree with this, as long as the issue is only that the visibility
map result is slightly stale and not that it's, say, not crash-safe.
We can reasonably assume that any newly-added XID must be one that was
in progress while VACUUM was running, and hence will be after the xmin
horizon we computed earlier.  This requires the existence of a read
barrier somewhere between computing xmin horizon and inspecting the
visibility map, but I find it hard to believe there aren't plenty.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Robert Haas
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada  wrote:
> Attached latest version optimisation patch.
> I'm still consider regarding pg_upgrade regression test code, so I
> will submit that patch later.

I just spent some time looking at this and I'm a bit worried about the
following (existing) comment in vacuumlazy.c:

 * Note: The value returned by visibilitymap_get_status could be slightly
 * out-of-date, since we make this test before reading the corresponding
 * heap page or locking the buffer.  This is OK.  If we mistakenly think
 * that the page is all-visible when in fact the flag's just been cleared,
 * we might fail to vacuum the page.  But it's OK to skip pages when
 * scan_all is not set, so no great harm done; the next vacuum will find
 * them.  If we make the reverse mistake and vacuum a page unnecessarily,
 * it'll just be a no-op.

The patch makes some attempt to update the comment mechanically, but
that's not nearly enough.  That comment is explaining that you *can't*
rely on the visibility map to tell you *for sure* that a page does not
require vacuuming.  For current uses, that's OK, because if we miss a
page we'll pick it up later.  But now we want to skip vacuuming pages
for relfrozenxid/relminmxid advancement, that rationale doesn't apply.
Missing pages that need to be frozen and advancing relfrozenxid anyway
would be _bad_.

However, after some further thought, I think we might actually be OK.
If a page goes from all-frozen to not-all-frozen while VACUUM is
running, any new XID added to the page must be newer than the
oldestXmin value computed by vacuum_set_xid_limits(), so it won't
affect the value to which we can safely set relfrozenxid.  Similarly,
any MXID added to the page will be newer than GetOldestMultiXactId(),
so setting relminmxid is still safe for similar reasons.

I'd appreciate it if any other senior hackers could review that chain
of reasoning.  It would be really bad to get this wrong.

On another note, I didn't really like the way you updated the
documentation.  "eager freezing" doesn't seem like a great term to me,
and I think your changes were a little too localized.  Here's a draft
alternative where I used the term "aggressive vacuum" to describe
freezing all of the pages except for those already known to be
all-frozen.  Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..2f72633 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5984,12 +5984,15 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an aggressive scan if the table's
 pg_class.relfrozenxid field has reached
-the age specified by this setting.  The default is 150 million
-transactions.  Although users can set this value anywhere from zero to
-two billions, VACUUM will silently limit the effective value
-to 95% of , so that a
+the age specified by this setting.  An aggressive scan differs from
+a regular VACUUM in that it visits every page that might
+contain unfrozen XIDs or MXIDs, not just those that might contain dead
+tuples.  The default is 150 million transactions.  Although users can
+set this value anywhere from zero to two billions, VACUUM
+will silently limit the effective value to 95% of
+, so that a
 periodical manual VACUUM has a chance to run before an
 anti-wraparound autovacuum is launched for the table. For more
 information see
@@ -6028,9 +6031,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an aggressive scan if the table's
 pg_class.relminmxid field has reached
-the age specified by this setting.  The default is 150 million multixacts.
+the age specified by this setting.  An aggressive scan differs from
+a regular VACUUM in that it visits every page that might
+contain unfrozen XIDs or MXIDs, not just those that might contain dead
+tuples.  The default is 150 million multixacts.
 Although users can set this value anywhere from zero to two billions,
 VACUUM will silently limit the effective value to 95% of
 , so that a
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..d742ec9 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -438,22 +438,27 @@

 

-VACUUM normally skips pages that don't have any dead row
-versions, but those pages might still have row versions with old XID
-values.  To ensure all old row versions have been frozen, a
-scan of the whole 

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Jeff Janes
On Tue, Mar 8, 2016 at 5:30 AM, Robert Haas  wrote:
> On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada  wrote:
>> Regarding pg_visibility module, I'd like to share some bugs and
>> propose to add a relation type condition to each functions.
>
> OK, thanks.
>
>> Including it, I've attached remaining  2 patches; one is removing page
>> conversion code from pg_upgarde, and another is supporting pg_upgrade
>> for frozen bit.
>
> Committed 001 with minor tweaks.
>
> I find rewrite_vm_table to be pretty opaque.  There's not even a
> comment explaining what it is supposed to do.  And I wonder why we
> really need to be this efficient about it anyway.  Like, would it be
> too expensive to just do this:
>
> for (i = 0; i < BITS_PER_BYTE; ++i)
> if ((old & (1 << i)) != 0)
> new |= 1 << (2 * i);
>
> And how about adding some more comments explaining why we are doing
> this rewriting, like this:
>
> In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
> visibility map included one bit per heap page; it now includes two.
> When upgrading a cluster from before that time to a current PostgreSQL
> version, we could refuse to copy visibility maps from the old cluster
> to the new cluster; the next VACUUM would recreate them, but at the
> price of scanning the entire table.  So, instead, we rewrite the old
> visibility maps in the new format.  That way, the all-visible bit
> remains set for the pages for which it was set previously.  The
> all-frozen bit is never set by this conversion; we leave that to
> VACUUM.
>
> Also, I'm slightly perplexed by the fact that I can't see how this
> code succeeds in turning each page into two pages, which is something
> that it seems like it would need to do.  Wouldn't we need to write out
> the old page header twice, one for the first of the two new pages and
> again for the second?  I probably need more caffeine here, so please
> tell me what I'm missing.

I think that this loop:

   while (blkend >= end)

Executes exactly twice for each iteration of the outer loop.  I'd
rather see it written as a loop which explicitly executes twice,
rather looking like it might execute a dynamic number of times.  I
can't imagine that this code needs to be future-proof.  If we change
the format again in the future, surely we can't just change this code,
we would have to write new code for the new format.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Robert Haas
On Tue, Mar 8, 2016 at 8:30 AM, Robert Haas  wrote:
> On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada  wrote:
>> Regarding pg_visibility module, I'd like to share some bugs and
>> propose to add a relation type condition to each functions.
>
> OK, thanks.

I left out the relkind check from the final commit because, for one
thing, the check you added isn't actually right: toast relations can
also have a visibility map.  And also, I'm sort of wondering what the
point of that check is.  What does it protect us from?  It doesn't
seem very future-proof ... what if we add a new relkind in the future?
 Do we really want to have to update this?

How about instead changing things so that we specifically reject
indexes?  And maybe some kind of a check that will reject anything
that lacks a relfilnode?  That seems like it would be more on point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Robert Haas
On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada  wrote:
> Regarding pg_visibility module, I'd like to share some bugs and
> propose to add a relation type condition to each functions.

OK, thanks.

> Including it, I've attached remaining  2 patches; one is removing page
> conversion code from pg_upgarde, and another is supporting pg_upgrade
> for frozen bit.

Committed 001 with minor tweaks.

I find rewrite_vm_table to be pretty opaque.  There's not even a
comment explaining what it is supposed to do.  And I wonder why we
really need to be this efficient about it anyway.  Like, would it be
too expensive to just do this:

for (i = 0; i < BITS_PER_BYTE; ++i)
if ((old & (1 << i)) != 0)
new |= 1 << (2 * i);

And how about adding some more comments explaining why we are doing
this rewriting, like this:

In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
visibility map included one bit per heap page; it now includes two.
When upgrading a cluster from before that time to a current PostgreSQL
version, we could refuse to copy visibility maps from the old cluster
to the new cluster; the next VACUUM would recreate them, but at the
price of scanning the entire table.  So, instead, we rewrite the old
visibility maps in the new format.  That way, the all-visible bit
remains set for the pages for which it was set previously.  The
all-frozen bit is never set by this conversion; we leave that to
VACUUM.

Also, I'm slightly perplexed by the fact that I can't see how this
code succeeds in turning each page into two pages, which is something
that it seems like it would need to do.  Wouldn't we need to write out
the old page header twice, one for the first of the two new pages and
again for the second?  I probably need more caffeine here, so please
tell me what I'm missing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-08 Thread Masahiko Sawada
On Tue, Mar 8, 2016 at 1:20 PM, Kyotaro HORIGUCHI
 wrote:
> Hello, thank you for updating this tool.
>
> At Mon, 7 Mar 2016 14:03:08 -0500, Robert Haas  wrote 
> in 
>> On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada  
>> wrote:
>> > Attached latest version optimisation patch.
>> > I'm still consider regarding pg_upgrade regression test code, so I
>> > will submit that patch later.
>>
> I was thinking more about this today and I think that we don't
> actually need the PD_ALL_FROZEN page-level bit for anything.  It's
> enough that the bit is present in the visibility map.  The only point
> of PD_ALL_VISIBLE is that it tells us that we need to clear the
> visibility map bit, but that bit is enough to tell us to clear both
> visibility map bits.  So I propose the attached cleanup patch.

Thank you for updating tool and proposing it.
I agree with you, and the patch you attached looks good to me except
for Horiguchi-san's comment.

Regarding pg_visibility module, I'd like to share some bugs and
propose to add a relation type condition to each functions.
Including it, I've attached remaining  2 patches; one is removing page
conversion code from pg_upgarde, and another is supporting pg_upgrade
for frozen bit.

Please have a look at them.

Regards,

--
Masahiko Sawada
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
index 9616e1f..da511e5 100644
--- a/contrib/pg_visibility/pg_visibility--1.0.sql
+++ b/contrib/pg_visibility/pg_visibility--1.0.sql
@@ -12,7 +12,7 @@ AS 'MODULE_PATHNAME', 'pg_visibility_map'
 LANGUAGE C STRICT;
 
 -- Show visibility map and page-level visibility information.
-CREATE FUNCTION pg_visibility(regclass, blkno, bigint,
+CREATE FUNCTION pg_visibility(regclass, blkno bigint,
 			  all_visible OUT boolean,
 			  all_frozen OUT boolean,
 			  pd_all_visible OUT boolean)
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index d4336ce..2993bcb 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -14,6 +14,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "utils/rel.h"
 
 PG_MODULE_MAGIC;
 
@@ -55,6 +56,14 @@ pg_visibility_map(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("invalid block number")));
 
+	/* Check for relation type */
+	if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+		  rel->rd_rel->relkind == RELKIND_MATVIEW))
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table or materialized view",
+		RelationGetRelationName(rel;
+
 	tupdesc = pg_visibility_tupdesc(false, false);
 	MemSet(nulls, 0, sizeof(nulls));
 
@@ -94,6 +103,14 @@ pg_visibility(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("invalid block number")));
 
+	/* Check for relation type */
+	if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+		  rel->rd_rel->relkind == RELKIND_MATVIEW))
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table or materialized view",
+		RelationGetRelationName(rel;
+
 	tupdesc = pg_visibility_tupdesc(false, true);
 	MemSet(nulls, 0, sizeof(nulls));
 
@@ -147,9 +164,10 @@ pg_visibility_map_rel(PG_FUNCTION_ARGS)
 		HeapTuple	tuple;
 
 		MemSet(nulls, 0, sizeof(nulls));
-		values[0] = Int64GetDatum(info->next++);
+		values[0] = Int64GetDatum(info->next);
 		values[1] = BoolGetDatum((info->bits[info->next] & (1 << 0)) != 0);
 		values[2] = BoolGetDatum((info->bits[info->next] & (1 << 1)) != 0);
+		info->next++;
 
 		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
 		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
@@ -190,10 +208,11 @@ pg_visibility_rel(PG_FUNCTION_ARGS)
 		HeapTuple	tuple;
 
 		MemSet(nulls, 0, sizeof(nulls));
-		values[0] = Int64GetDatum(info->next++);
+		values[0] = Int64GetDatum(info->next);
 		values[1] = BoolGetDatum((info->bits[info->next] & (1 << 0)) != 0);
 		values[2] = BoolGetDatum((info->bits[info->next] & (1 << 1)) != 0);
 		values[3] = BoolGetDatum((info->bits[info->next] & (1 << 2)) != 0);
+		info->next++;
 
 		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
 		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
@@ -223,6 +242,14 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
 	rel = relation_open(relid, AccessShareLock);
 	nblocks = RelationGetNumberOfBlocks(rel);
 
+	/* Check for relation type */
+	if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+		  rel->rd_rel->relkind == RELKIND_MATVIEW))
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table or materialized view",
+		RelationGetRelationName(rel;
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -296,6 +323,15 @@ 

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-07 Thread Kyotaro HORIGUCHI
Hello, thank you for updating this tool.

At Mon, 7 Mar 2016 14:03:08 -0500, Robert Haas  wrote in 

> On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada  
> wrote:
> > Attached latest version optimisation patch.
> > I'm still consider regarding pg_upgrade regression test code, so I
> > will submit that patch later.
> 
> I was thinking more about this today and I think that we don't
> actually need the PD_ALL_FROZEN page-level bit for anything.  It's
> enough that the bit is present in the visibility map.  The only point
> of PD_ALL_VISIBLE is that it tells us that we need to clear the
> visibility map bit, but that bit is enough to tell us to clear both
> visibility map bits.  So I propose the attached cleanup patch.

It seems reasonable to me.  Although I haven't played it (or even
it didn't apply for me for now), but at a glance,
PD_VALID_FLAG_BITS seems should be changed to 0x0007 since
PD_ALL_FROZEN has been removed.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-07 Thread Peter Geoghegan
On Mon, Mar 7, 2016 at 4:50 PM, Robert Haas  wrote:
> Here's an updated patch with an API that I think is much more
> reasonable to expose to users, and documentation!  It assumes that the
> patch I posted a few hours ago to remove PD_ALL_FROZEN will be
> accepted; if that falls apart for some reason, I'll update this.  I
> plan to push this RSN if nobody objects.

Thanks for making the effort to make the tool generally available.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-07 Thread Robert Haas
On Sat, Mar 5, 2016 at 9:25 AM, Masahiko Sawada  wrote:
>> I actually think end-users might well want to use it.  Also, I created
>> it by hacking up pg_freespacemap, so it may make sense to have it in
>> the same place.
>> I would also be tempted to add an additional C functions that scan the
>> entire visibility map and return counts of the total number of bits of
>> each type that are set, and similarly for the page level bits.
>> Presumably that would be much faster than
>
> +1.
>
>> I am also tempted to change the API to be a bit more friendly,
>> although I am not sure exactly how.  This was a quick and dirty hack
>> so that I could test, but the hardest thing about making it not a
>> quick and dirty hack is probably deciding on a good UI.
>
> Does it mean visibility map API in visibilitymap.c?

Here's an updated patch with an API that I think is much more
reasonable to expose to users, and documentation!  It assumes that the
patch I posted a few hours ago to remove PD_ALL_FROZEN will be
accepted; if that falls apart for some reason, I'll update this.  I
plan to push this RSN if nobody objects.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/Makefile b/contrib/Makefile
index bd251f6..d12dd63 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -37,6 +37,7 @@ SUBDIRS = \
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
+		pg_visibility	\
 		postgres_fdw	\
 		seg		\
 		spi		\
diff --git a/contrib/pg_visibility/Makefile b/contrib/pg_visibility/Makefile
new file mode 100644
index 000..fbbaa2e
--- /dev/null
+++ b/contrib/pg_visibility/Makefile
@@ -0,0 +1,19 @@
+# contrib/pg_visibility/Makefile
+
+MODULE_big = pg_visibility
+OBJS = pg_visibility.o $(WIN32RES)
+
+EXTENSION = pg_visibility
+DATA = pg_visibility--1.0.sql
+PGFILEDESC = "pg_visibility - page visibility information"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_visibility
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
new file mode 100644
index 000..9616e1f
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.0.sql
@@ -0,0 +1,52 @@
+/* contrib/pg_visibility/pg_visibility--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
+
+-- Show visibility map information.
+CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
+  all_visible OUT boolean,
+  all_frozen OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information.
+CREATE FUNCTION pg_visibility(regclass, blkno, bigint,
+			  all_visible OUT boolean,
+			  all_frozen OUT boolean,
+			  pd_all_visible OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility'
+LANGUAGE C STRICT;
+
+-- Show visibility map information for each block in a relation.
+CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
+  all_visible OUT boolean,
+  all_frozen OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information for each block.
+CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
+			  all_visible OUT boolean,
+			  all_frozen OUT boolean,
+			  pd_all_visible OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_rel'
+LANGUAGE C STRICT;
+
+-- Show summary of visibility map bits for a relation.
+CREATE FUNCTION pg_visibility_map_summary(regclass,
+OUT all_visible bigint, OUT all_frozen bigint)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
+LANGUAGE C STRICT;
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
new file mode 100644
index 000..d4336ce
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -0,0 +1,346 @@
+/*-
+ *
+ * pg_visibility.c
+ *	  display visibility map information and page-level visibility bits
+ *
+ *	  contrib/pg_visibility/pg_visibility.c
+ *-
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include 

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-07 Thread Robert Haas
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada  wrote:
> Attached latest version optimisation patch.
> I'm still consider regarding pg_upgrade regression test code, so I
> will submit that patch later.

I was thinking more about this today and I think that we don't
actually need the PD_ALL_FROZEN page-level bit for anything.  It's
enough that the bit is present in the visibility map.  The only point
of PD_ALL_VISIBLE is that it tells us that we need to clear the
visibility map bit, but that bit is enough to tell us to clear both
visibility map bits.  So I propose the attached cleanup patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8a64321..34ba385 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7855,10 +7855,7 @@ heap_xlog_visible(XLogReaderState *record)
 		 */
 		page = BufferGetPage(buffer);
 
-		if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
-			PageSetAllVisible(page);
-		if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
-			PageSetAllFrozen(page);
+		PageSetAllVisible(page);
 
 		MarkBufferDirty(buffer);
 	}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 2e64fc3..eaab4be 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -39,15 +39,15 @@
  *
  * When we *set* a visibility map during VACUUM, we must write WAL.  This may
  * seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is all-visible or
- * all-frozen we just don't know that for certain.  The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE
- * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility
- * map bit.  If a crash occurs after the visibility map page makes it to disk
- * and before the updated heap page makes it to disk, redo must set the bit on
- * the heap page.  Otherwise, the next insert, update, or delete on the heap
- * page will fail to realize that the visibility map bit must be cleared,
- * possibly causing index-only scans to return wrong answers.
+ * it may still be the case that every tuple on the page is visible to all
+ * transactions; we just don't know that for certain.  The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
+ * on the page itself, and the visibility map bit.  If a crash occurs after the
+ * visibility map page makes it to disk and before the updated heap page makes
+ * it to disk, redo must set the bit on the heap page.  Otherwise, the next
+ * insert, update, or delete on the heap page will fail to realize that the
+ * visibility map bit must be cleared, possibly causing index-only scans to
+ * return wrong answers.
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
@@ -251,11 +251,10 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
  * to InvalidTransactionId when a page that is already all-visible is being
  * marked all-frozen.
  *
- * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
- * bit before calling this function. Except in recovery, caller should also
- * pass the heap buffer and flags which indicates what flag we want to set.
- * When checksums are enabled and we're not in recovery, we must add the heap
- * buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
+ * this function. Except in recovery, caller should also pass the heap
+ * buffer. When checksums are enabled and we're not in recovery, we must add
+ * the heap buffer to the WAL chain to protect it from being torn.
  *
  * You must pass a buffer containing the correct map page to this function.
  * Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -315,10 +314,8 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 {
 	Page		heapPage = BufferGetPage(heapBuf);
 
-	/* Caller is expected to set page-level bits first. */
-	Assert((flags & VISIBILITYMAP_ALL_VISIBLE) == 0 || PageIsAllVisible(heapPage));
-	Assert((flags & VISIBILITYMAP_ALL_FROZEN) == 0 || PageIsAllFrozen(heapPage));
-
+	/* caller is expected to set PD_ALL_VISIBLE first */
+	Assert(PageIsAllVisible(heapPage));
 	PageSetLSN(heapPage, recptr);
 }
 			}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8f7b248..363b2d0 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -766,7 +766,6 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	log_newpage_buffer(buf, true);
 
 

Re: [HACKERS] Freeze avoidance of very large table.

2016-03-07 Thread Masahiko Sawada
On Sat, Mar 5, 2016 at 11:25 PM, Masahiko Sawada  wrote:
> On Sat, Mar 5, 2016 at 1:25 AM, Robert Haas  wrote:
>> On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane  wrote:
>>> Jim Nasby  writes:
 On 3/2/16 4:21 PM, Peter Geoghegan wrote:
> I think you should commit this. The chances of anyone other than you
> and Masahiko recalling that you developed this tool in 3 years is
> essentially nil. I think that the cost of committing a developer-level
> debugging tool like this is very low. Modules like pg_freespacemap
> currently already have no chance of being of use to ordinary users.
> All you need to do is restrict the functions to throw an error when
> called by non-superusers, out of caution.
>
> It's a problem that modules like pg_stat_statements and
> pg_freespacemap are currently lumped together in the documentation,
> but we all know that.
>>>
 +1.
>>>
>>> Would it make any sense to stick it under src/test/modules/ instead of
>>> contrib/ ?  That would help make it clear that it's a debugging tool
>>> and not something we expect end users to use.
>>
>> I actually think end-users might well want to use it.  Also, I created
>> it by hacking up pg_freespacemap, so it may make sense to have it in
>> the same place.
>> I would also be tempted to add an additional C functions that scan the
>> entire visibility map and return counts of the total number of bits of
>> each type that are set, and similarly for the page level bits.
>> Presumably that would be much faster than
>
> +1.
>
>>
>> I am also tempted to change the API to be a bit more friendly,
>> although I am not sure exactly how.  This was a quick and dirty hack
>> so that I could test, but the hardest thing about making it not a
>> quick and dirty hack is probably deciding on a good UI.
>>
>
> Does it mean visibility map API in visibilitymap.c?
>

Attached latest version optimisation patch.
I'm still consider regarding pg_upgrade regression test code, so I
will submit that patch later.

Regards,

--
Masahiko Sawada


000_optimize_vacuum_using_freezemap_v37.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-05 Thread Masahiko Sawada
On Sat, Mar 5, 2016 at 1:25 AM, Robert Haas  wrote:
> On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane  wrote:
>> Jim Nasby  writes:
>>> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
 I think you should commit this. The chances of anyone other than you
 and Masahiko recalling that you developed this tool in 3 years is
 essentially nil. I think that the cost of committing a developer-level
 debugging tool like this is very low. Modules like pg_freespacemap
 currently already have no chance of being of use to ordinary users.
 All you need to do is restrict the functions to throw an error when
 called by non-superusers, out of caution.

 It's a problem that modules like pg_stat_statements and
 pg_freespacemap are currently lumped together in the documentation,
 but we all know that.
>>
>>> +1.
>>
>> Would it make any sense to stick it under src/test/modules/ instead of
>> contrib/ ?  That would help make it clear that it's a debugging tool
>> and not something we expect end users to use.
>
> I actually think end-users might well want to use it.  Also, I created
> it by hacking up pg_freespacemap, so it may make sense to have it in
> the same place.
> I would also be tempted to add an additional C functions that scan the
> entire visibility map and return counts of the total number of bits of
> each type that are set, and similarly for the page level bits.
> Presumably that would be much faster than

+1.

>
> I am also tempted to change the API to be a bit more friendly,
> although I am not sure exactly how.  This was a quick and dirty hack
> so that I could test, but the hardest thing about making it not a
> quick and dirty hack is probably deciding on a good UI.
>

Does it mean visibility map API in visibilitymap.c?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-04 Thread Robert Haas
On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane  wrote:
> Jim Nasby  writes:
>> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
>>> I think you should commit this. The chances of anyone other than you
>>> and Masahiko recalling that you developed this tool in 3 years is
>>> essentially nil. I think that the cost of committing a developer-level
>>> debugging tool like this is very low. Modules like pg_freespacemap
>>> currently already have no chance of being of use to ordinary users.
>>> All you need to do is restrict the functions to throw an error when
>>> called by non-superusers, out of caution.
>>>
>>> It's a problem that modules like pg_stat_statements and
>>> pg_freespacemap are currently lumped together in the documentation,
>>> but we all know that.
>
>> +1.
>
> Would it make any sense to stick it under src/test/modules/ instead of
> contrib/ ?  That would help make it clear that it's a debugging tool
> and not something we expect end users to use.

I actually think end-users might well want to use it.  Also, I created
it by hacking up pg_freespacemap, so it may make sense to have it in
the same place.

I would also be tempted to add an additional C functions that scan the
entire visibility map and return counts of the total number of bits of
each type that are set, and similarly for the page level bits.
Presumably that would be much faster than

I am also tempted to change the API to be a bit more friendly,
although I am not sure exactly how.  This was a quick and dirty hack
so that I could test, but the hardest thing about making it not a
quick and dirty hack is probably deciding on a good UI.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-02 Thread Kyotaro HORIGUCHI
At Wed, 2 Mar 2016 17:57:27 -0600, Jim Nasby  wrote 
in <56d77de7.7080...@bluetreble.com>
> On 3/2/16 5:41 PM, Tom Lane wrote:
> > Jim Nasby  writes:
> >> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
> >>> I think you should commit this. The chances of anyone other than you
> >>> and Masahiko recalling that you developed this tool in 3 years is
> >>> essentially nil. I think that the cost of committing a developer-level
> >>> debugging tool like this is very low. Modules like pg_freespacemap
> >>> currently already have no chance of being of use to ordinary users.
> >>> All you need to do is restrict the functions to throw an error when
> >>> called by non-superusers, out of caution.
> >>>
> >>> It's a problem that modules like pg_stat_statements and
> >>> pg_freespacemap are currently lumped together in the documentation,
> >>> but we all know that.
> >
> >> +1.
> >
> > Would it make any sense to stick it under src/test/modules/ instead of
> > contrib/ ?  That would help make it clear that it's a debugging tool
> > and not something we expect end users to use.
> 
> I haven't looked at it in detail; is there something inherently
> dangerous about it?

I don't see any danger but the interface doesn't seem to fit use
by DBAs.

> When I'm forced to wear a DBA hat, I'd really love to be able to find
> out what VM status for a large table is. If it's in contrib they'll
> know the tool is there; if it's under src then there's about 0 chance
> of that. I'd think SU-only and any appropriate warnings would be
> enough heads-up for DBAs to be careful with it.

It looks to expose nothing about table contents. At lesast
anybody who can see the name of a table are safely allowable to
use this on it.

A possible usage (for me) of this would be directly couting
(un)vacuumed or (un)freezed pages in a relation. It would be
convenient that the 'freezed' and 'vacuumed' bits are in separate
integers. It would be usable even stats values for these bits are
shown in statistics views.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-02 Thread Jim Nasby

On 3/2/16 5:41 PM, Tom Lane wrote:

Jim Nasby  writes:

On 3/2/16 4:21 PM, Peter Geoghegan wrote:

I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.

It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.



+1.


Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ?  That would help make it clear that it's a debugging tool
and not something we expect end users to use.


I haven't looked at it in detail; is there something inherently 
dangerous about it?


When I'm forced to wear a DBA hat, I'd really love to be able to find 
out what VM status for a large table is. If it's in contrib they'll know 
the tool is there; if it's under src then there's about 0 chance of 
that. I'd think SU-only and any appropriate warnings would be enough 
heads-up for DBAs to be careful with it.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-02 Thread Tom Lane
Jim Nasby  writes:
> On 3/2/16 4:21 PM, Peter Geoghegan wrote:
>> I think you should commit this. The chances of anyone other than you
>> and Masahiko recalling that you developed this tool in 3 years is
>> essentially nil. I think that the cost of committing a developer-level
>> debugging tool like this is very low. Modules like pg_freespacemap
>> currently already have no chance of being of use to ordinary users.
>> All you need to do is restrict the functions to throw an error when
>> called by non-superusers, out of caution.
>> 
>> It's a problem that modules like pg_stat_statements and
>> pg_freespacemap are currently lumped together in the documentation,
>> but we all know that.

> +1.

Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ?  That would help make it clear that it's a debugging tool
and not something we expect end users to use.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-02 Thread Jim Nasby

On 3/2/16 4:21 PM, Peter Geoghegan wrote:

I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.

It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.


+1.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-02 Thread Peter Geoghegan
On Tue, Mar 1, 2016 at 6:51 PM, Robert Haas  wrote:
> I removed the pgstat stuff.  I'm not sure we want that stuff in that
> form; it doesn't seem to fit with the rest of what's in that view, and
> it wasn't reliable in my testing.  I did however throw together a
> little contrib module for testing, which I attach here.  I'm not sure
> we want to commit this, and at the least someone would need to write
> documentation.  But it's certainly handy for checking whether this
> works.

I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.

It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-02 Thread Kyotaro HORIGUCHI
Thank you for revising and commiting this.

At Tue, 1 Mar 2016 21:51:55 -0500, Robert Haas  wrote in 

> On Thu, Feb 18, 2016 at 3:45 AM, Masahiko Sawada  
> wrote:
> > Attached updated 5 patches.
> > I would like to explain these patch shortly again here to make
> > reviewing more easier.
> >
> > We can divided these patches into 2 purposes.
> >
> > 1. Freeze map
> > 000_ patch adds additional frozen bit into visibility map, but doesn't
> > include the logic for improve freezing performance.
> > 001_ patch gets rid of page-conversion code from pg_upgrade. (This
> > patch doesn't related to this feature essentially, but is required by
> > 002_ patch.)
> > 002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression 
> > test.
> >
> > 2. Improve freezing logic
> > 003_ patch changes the VACUUM to optimize scans based on freeze map
> > (i.g., 000_ patch), and its regression test.
> > 004_ patch enhances debug messages in 
> > src/backend/access/heap/visibilitymap.c
> >
> > Please review them.
> 
> I have pushed 000 and part of 003, with substantial revisions to the
> 003 part and minor revisions to the 000 part.  This gets the basic
> infrastructure in place, but the vacuum optimization and pg_upgrade
> fixes still need to be done.
> 
> I discovered that make check-world failed with 000 applied, because
> the Assert()s added to visibilitymap_set were using | rather than & to
> test for a set bit.  I fixed that.

It looks reasonable as far as I can see.  Thank you for your
labor for the additional part.

> I revised the code in vacuumlazy.c that updates the new map bits
> rather heavily.  I hope I didn't break anything; please have a look
> and see if you spot any problems.  One big problem was that it's
> inadequate to judge whether a tuple needs freezing just by looking at
> xmin; xmax might need to be cleared, for example.

The new function heap_tuple_needs_eventual_freeze looks
reasonable for me in comparizon with heap_tuple_needs_freeze.

Looking the additional diff for lazy_vacuum_page, I noticed that
visibilitymap_set have a potential performance problem. (Though
it doesn't seem to occur for now.)

visibilitymap_set decides to modify vm bits by the following
code.

|   if (flags = (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
|   {
| START_CRIT_SECTION();
| 
| map[mapByte] |= (flags << mapBit);

This is effectively right and no problem but it runs the critical
section for the case of (vmbit = 11, flags = 01), which does not
need to do so. Please apply this if this looks reasonable.

==
diff --git a/src/backend/access/heap/visibilitymap.c 
b/src/backend/access/heap/visibilitymap.c
index 2e64fc3..87b7fc6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -292,7 +292,8 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer 
heapBuf,
map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
 
-   if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
+   /* modify vm bits only if any bit is necessary to be set  */
+   if (~flags & (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
==

> I removed the pgstat stuff.  I'm not sure we want that stuff in that
> form; it doesn't seem to fit with the rest of what's in that view, and
> it wasn't reliable in my testing.  I did however throw together a
> little contrib module for testing, which I attach here.  I'm not sure
> we want to commit this, and at the least someone would need to write
> documentation.  But it's certainly handy for checking whether this
> works.

I hanven't considered about the reliability but the
n_frozen_pages in the proposed patch surelly seems alien to the
columns around it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-03-01 Thread Robert Haas
On Thu, Feb 18, 2016 at 3:45 AM, Masahiko Sawada  wrote:
> Attached updated 5 patches.
> I would like to explain these patch shortly again here to make
> reviewing more easier.
>
> We can divided these patches into 2 purposes.
>
> 1. Freeze map
> 000_ patch adds additional frozen bit into visibility map, but doesn't
> include the logic for improve freezing performance.
> 001_ patch gets rid of page-conversion code from pg_upgrade. (This
> patch doesn't related to this feature essentially, but is required by
> 002_ patch.)
> 002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test.
>
> 2. Improve freezing logic
> 003_ patch changes the VACUUM to optimize scans based on freeze map
> (i.g., 000_ patch), and its regression test.
> 004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.c
>
> Please review them.

I have pushed 000 and part of 003, with substantial revisions to the
003 part and minor revisions to the 000 part.  This gets the basic
infrastructure in place, but the vacuum optimization and pg_upgrade
fixes still need to be done.

I discovered that make check-world failed with 000 applied, because
the Assert()s added to visibilitymap_set were using | rather than & to
test for a set bit.  I fixed that.

I revised the code in vacuumlazy.c that updates the new map bits
rather heavily.  I hope I didn't break anything; please have a look
and see if you spot any problems.  One big problem was that it's
inadequate to judge whether a tuple needs freezing just by looking at
xmin; xmax might need to be cleared, for example.

I removed the pgstat stuff.  I'm not sure we want that stuff in that
form; it doesn't seem to fit with the rest of what's in that view, and
it wasn't reliable in my testing.  I did however throw together a
little contrib module for testing, which I attach here.  I'm not sure
we want to commit this, and at the least someone would need to write
documentation.  But it's certainly handy for checking whether this
works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pg_visibilitymap-v1.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-16 Thread Masahiko Sawada
On Wed, Feb 17, 2016 at 4:29 PM, Masahiko Sawada  wrote:
> On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian  wrote:
>> On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
>>> Masahiko Sawada wrote:
>>> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian  wrote:
>>> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
>>> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
>>> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, 
>>> > >> > and I
>>> > >> > think they expected their closed-source fork of Postgres might need a
>>> > >> > custom page converter someday, but it never needed one, and at this
>>> > >> > point I think having the code in there is just making things more
>>> > >> > complex.  I see _no_ reason for community Postgres to use a plugin
>>> > >> > converter because we are going to need that code for every upgrade 
>>> > >> > from
>>> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need. 
>>> > >> >  We
>>> > >> > can remove it once 9.5 is end-of-life.
>>> > >> >
>>> > >>
>>> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and
>>> > >> page.c at 9.6?
>>> > >
>>> > > Yes.  I can do it if you wish.
>>> >
>>> > I see. I understand that page-converter code would be useful for some
>>> > future cases, but makes thing more complex.
>>>
>>> If we're not going to use it, let's get rid of it right away.  There's
>>> no point in having a feature that adds complexity just because we might
>>> find some hypothetical use of it in a not-yet-imagined future.
>>
>> Agreed.  We can always add it later if we need it.
>>
>
> Attached patch gets rid of page conversion code.
>

Sorry, previous patch is incorrect..
Fixed version patch attached.

Regards,

--
Masahiko Sawada


Remove_page_conversion_from_pg_upgrade_v2.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-16 Thread Masahiko Sawada
On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian  wrote:
> On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
>> Masahiko Sawada wrote:
>> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian  wrote:
>> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
>> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
>> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and 
>> > >> > I
>> > >> > think they expected their closed-source fork of Postgres might need a
>> > >> > custom page converter someday, but it never needed one, and at this
>> > >> > point I think having the code in there is just making things more
>> > >> > complex.  I see _no_ reason for community Postgres to use a plugin
>> > >> > converter because we are going to need that code for every upgrade 
>> > >> > from
>> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  
>> > >> > We
>> > >> > can remove it once 9.5 is end-of-life.
>> > >> >
>> > >>
>> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and
>> > >> page.c at 9.6?
>> > >
>> > > Yes.  I can do it if you wish.
>> >
>> > I see. I understand that page-converter code would be useful for some
>> > future cases, but makes thing more complex.
>>
>> If we're not going to use it, let's get rid of it right away.  There's
>> no point in having a feature that adds complexity just because we might
>> find some hypothetical use of it in a not-yet-imagined future.
>
> Agreed.  We can always add it later if we need it.
>

Attached patch gets rid of page conversion code.

Regards,

--
Masahiko Sawada


Remove_page_conversion_from_pg_upgrade.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-16 Thread Bruce Momjian
On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
> Masahiko Sawada wrote:
> > On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian  wrote:
> > > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
> > >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
> > >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
> > >> > think they expected their closed-source fork of Postgres might need a
> > >> > custom page converter someday, but it never needed one, and at this
> > >> > point I think having the code in there is just making things more
> > >> > complex.  I see _no_ reason for community Postgres to use a plugin
> > >> > converter because we are going to need that code for every upgrade from
> > >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  
> > >> > We
> > >> > can remove it once 9.5 is end-of-life.
> > >> >
> > >>
> > >> Hm, we should rather remove the source code around PAGE_CONVERSION and
> > >> page.c at 9.6?
> > >
> > > Yes.  I can do it if you wish.
> > 
> > I see. I understand that page-converter code would be useful for some
> > future cases, but makes thing more complex.
> 
> If we're not going to use it, let's get rid of it right away.  There's
> no point in having a feature that adds complexity just because we might
> find some hypothetical use of it in a not-yet-imagined future.

Agreed.  We can always add it later if we need it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-16 Thread Alvaro Herrera
Masahiko Sawada wrote:
> On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian  wrote:
> > On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
> >> > I agreed on ripping out the converter plugin ability of pg_upgrade.
> >> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
> >> > think they expected their closed-source fork of Postgres might need a
> >> > custom page converter someday, but it never needed one, and at this
> >> > point I think having the code in there is just making things more
> >> > complex.  I see _no_ reason for community Postgres to use a plugin
> >> > converter because we are going to need that code for every upgrade from
> >> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
> >> > can remove it once 9.5 is end-of-life.
> >> >
> >>
> >> Hm, we should rather remove the source code around PAGE_CONVERSION and
> >> page.c at 9.6?
> >
> > Yes.  I can do it if you wish.
> 
> I see. I understand that page-converter code would be useful for some
> future cases, but makes thing more complex.

If we're not going to use it, let's get rid of it right away.  There's
no point in having a feature that adds complexity just because we might
find some hypothetical use of it in a not-yet-imagined future.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-16 Thread Masahiko Sawada
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian  wrote:
> On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
>> > I agreed on ripping out the converter plugin ability of pg_upgrade.
>> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
>> > think they expected their closed-source fork of Postgres might need a
>> > custom page converter someday, but it never needed one, and at this
>> > point I think having the code in there is just making things more
>> > complex.  I see _no_ reason for community Postgres to use a plugin
>> > converter because we are going to need that code for every upgrade from
>> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
>> > can remove it once 9.5 is end-of-life.
>> >
>>
>> Hm, we should rather remove the source code around PAGE_CONVERSION and
>> page.c at 9.6?
>
> Yes.  I can do it if you wish.

I see. I understand that page-converter code would be useful for some
future cases, but makes thing more complex.
So I will post the patch without page-converter If no objection from
other hackers.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-16 Thread Bruce Momjian
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
> > I agreed on ripping out the converter plugin ability of pg_upgrade.
> > Remember pg_upgrade was originally written by EnterpriseDB staff, and I
> > think they expected their closed-source fork of Postgres might need a
> > custom page converter someday, but it never needed one, and at this
> > point I think having the code in there is just making things more
> > complex.  I see _no_ reason for community Postgres to use a plugin
> > converter because we are going to need that code for every upgrade from
> > pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
> > can remove it once 9.5 is end-of-life.
> >
> 
> Hm, we should rather remove the source code around PAGE_CONVERSION and
> page.c at 9.6?

Yes.  I can do it if you wish.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-16 Thread Masahiko Sawada
On Tue, Feb 16, 2016 at 6:13 AM, Bruce Momjian  wrote:
> On Wed, Feb 10, 2016 at 04:39:15PM +0900, Kyotaro HORIGUCHI wrote:
>> > I still agree with this plugin approach, but I felt it's still
>> > complicated a bit, and I'm concerned that patch size has been
>> > increased.
>> > Please give me feedbacks.
>>
>> Yeah, I feel the same. What make it worse, the plugin mechanism
>> will get further complex if we make it more flexible for possible
>> usage as I proposed above. It is apparently too complicated for
>> deciding whether to load *just one*, for now, converter
>> function. And no additional converter is in sight.
>>
>> I incline to pull out all the plugin stuff of pg_upgrade. We are
>> so prudent to make changes of file formats so this kind of events
>> will happen with several-years intervals. The plugin mechanism
>> would be valuable if we are encouraged to change file formats
>> more frequently and freely by providing it, but such situation
>> absolutely introduces more untoward things..
>
> I agreed on ripping out the converter plugin ability of pg_upgrade.
> Remember pg_upgrade was originally written by EnterpriseDB staff, and I
> think they expected their closed-source fork of Postgres might need a
> custom page converter someday, but it never needed one, and at this
> point I think having the code in there is just making things more
> complex.  I see _no_ reason for community Postgres to use a plugin
> converter because we are going to need that code for every upgrade from
> pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
> can remove it once 9.5 is end-of-life.
>

Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-15 Thread Bruce Momjian
On Wed, Feb 10, 2016 at 04:39:15PM +0900, Kyotaro HORIGUCHI wrote:
> > I still agree with this plugin approach, but I felt it's still
> > complicated a bit, and I'm concerned that patch size has been
> > increased.
> > Please give me feedbacks.
> 
> Yeah, I feel the same. What make it worse, the plugin mechanism
> will get further complex if we make it more flexible for possible
> usage as I proposed above. It is apparently too complicated for
> deciding whether to load *just one*, for now, converter
> function. And no additional converter is in sight.
> 
> I incline to pull out all the plugin stuff of pg_upgrade. We are
> so prudent to make changes of file formats so this kind of events
> will happen with several-years intervals. The plugin mechanism
> would be valuable if we are encouraged to change file formats
> more frequently and freely by providing it, but such situation
> absolutely introduces more untoward things..

I agreed on ripping out the converter plugin ability of pg_upgrade. 
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex.  I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need.  We
can remove it once 9.5 is end-of-life.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-14 Thread Robert Haas
On Sun, Feb 14, 2016 at 12:19 AM, Masahiko Sawada  wrote:
> Thank you for reviewing this patch.
> I've divided 000 patch into two patches, and attached latest 4 patches in 
> total.

Thank you!  I'll go through this again as soon as I have a free moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-11 Thread Masahiko Sawada
Thank you for reviewing this patch!

On Wed, Feb 10, 2016 at 4:39 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada  
> wrote in 
>> On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera  
>> wrote:
>> > Masahiko Sawada wrote:
>> >> I think we have two options.
>> >>
>> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
>> >> it and then converts only VM files.
>> >> 2. Change pg_upgrade plugin mechanism so that it can handle other name
>> >> conversion plugins (e.g., convertLayout_vm_to_vfm)
>> >>
>> >> I think #2 is better. Thought?
>> >
>> > My vote is for #2 as well.  Maybe we just didn't have forks when this
>> > functionality was invented; maybe the author just didn't think hard
>> > enough about what would be the right interface to do it.
>>
>> I've almost wrote up the very rough patch. (it can pass regression test)
>> Windows supporting is not yet, and Makefile is not correct.
>>
>> I've divided the main patch into two patches; add frozen bit patch and
>> pg_upgrade support patch.
>> 000 patch is almost  same as previous code. (includes small fix)
>> 001 patch provides rewriting visibility map as a pageConverter routine.
>> 002 patch is for enhancement debug message in visibilitymap.c
>
> Thanks, it becomes easy to read.
>
>> In order to support pageConvert plugin, I made the following changes.
>> * Main changes
>> - Remove PAGE_CONVERSION
>> - pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
>> - Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.
>
> These seem fair.
>
>> - Add new page-converter plugin function for visibility map.
>> - Current code doesn't allow us to use link mode (-k) in the case
>> where page-converter is required.
>>
>>   But I changed it so that if page-converter for fork file is
>> specified, we convert it actually even when link mode.
>>
>> * Interface designe
>> convertFile() and convertPage() are plugin function for main relation
>> file, and these functions are dynamically loaded by
>> loadConvertPlugin().
>
> Though I haven't looked this so closer, loadConverterPlugin looks
> to continue deciding what plugin to load using old and new page
> layout versions. Currently the acutually possible versions are 4
> and if we increment it now, 5.
>
> On the other hand, _vm came at the *catalog version* 201107031
> (9.1 release) and _fsm came at 8.4 release. Both of them are of
> page layout version 4. Are we allowed to increment page layout
> version for this reason? And is this framework under
> reconstruction is flexible enough for this kind of changes in
> future? I don't think so.

Yeah, I also think that page layout version should not be increased by
this layout change of vm.
This patch checks catalog version at first, and then decides what
plugin to load.
In this case, only the format of VM has been changed, so pg_upgrade
loads a plugin for VM and convert them.
pg_upgrade doesn't load for other plugin file, and other files are just copied.

>
>
> We have added _vm and _fsm so far so we must use a version number
> that can determine when _vm, _fsm and _vfm are introduced. I'm
> afraid that it is out of purpose, catalog version seems to be
> most usable, it is already used to know when the crash safe VM
> has been introduced.
>
> Using the catalog version, the plugin we provide first would be
> converteLayout_201105231_201602071.so which has only a converter
> from _vm to _vfm. This plugin is loaded for the combination of
> the source cluster with the catalog version of 201105231(when VM
> has been introduced) or later and the destination cluster with
> that *before* 201602071 (this version).
>
> If we change the format of fsm (vm no longer exists), we would
> have a new plugin maybe named
> converteLayout_200904091_2017x.so which has an, maybe,
> inplace file converter for fsm. It will be loaded when a source
> database is of the catalog version of 200904091(when FSM has been
> introduced) or later and a destination before 2017x(the
> version). Catalog version seems to work fine.

I think that it's not good idea to use catalog version for plugin name.
Because, even if catalog version is used for plugin file name as you
suggested, pg_upgrade still needs to decide what plugin name to load
by itself.
Also, the plugin file having catalog version would not easy to
understand what plugin does actually. It's not developer friendly.
The advanteage of using page layout version as plugin name is that
pg_upgrade can decide automatically which plugin should be loaded.

>
> So far, I assumed that we can safely assume that the name of
> files to be converted is [fork_name] so the possible types
> of conversions would be the following.
>
>  - per-page conversion
>  - per-file conversion between the files with the same fork name.

Re: [HACKERS] Freeze avoidance of very large table.

2016-02-11 Thread Robert Haas
On Wed, Feb 3, 2016 at 12:32 PM, Masahiko Sawada  wrote:
> I've divided the main patch into two patches; add frozen bit patch and
> pg_upgrade support patch.
> 000 patch is almost  same as previous code. (includes small fix)
> 001 patch provides rewriting visibility map as a pageConverter routine.
> 002 patch is for enhancement debug message in visibilitymap.c

I'd like to suggest splitting 000 into two patches.  The first one
would change the format of the visibility map, and the second one
would change VACUUM to optimize scans based on the new format.  I
think that would make it easier to get this reviewed and committed.

I think this patch churns a bunch of things that don't really need to
be churned.  For example, consider this hunk:

 /*
  * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O.  That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O.  That's a bit unfortunate, but hopefully shouldn't happen often.
  */

Since the page can't become all-frozen without also becoming
all-visible, the original text is still 100% accurate, and the change
doesn't seem to add any useful clarity.  Let's think about which
things really need to be changed and not just mechanically change
everything.

-Assert(PageIsAllVisible(heapPage));
+/*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) &&
PageIsAllVisible(heapPage)) ||
+   ((flags | VISIBILITYMAP_ALL_FROZEN) &&
PageIsAllFrozen(heapPage)));

I think this would be more clear as two separate assertions.

Your 000 patch has a little bit of whitespace damage:

[rhaas pgsql]$ git diff --check
src/backend/commands/vacuumlazy.c:1951: indent with spaces.
+bool *all_visible, bool
*all_frozen)
src/include/access/heapam_xlog.h:393: indent with spaces.
+Buffer vm_buffer, TransactionId
cutoff_xid, uint8 flags);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-09 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada  
wrote in 
> On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera  
> wrote:
> > Masahiko Sawada wrote:
> >> I think we have two options.
> >>
> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
> >> it and then converts only VM files.
> >> 2. Change pg_upgrade plugin mechanism so that it can handle other name
> >> conversion plugins (e.g., convertLayout_vm_to_vfm)
> >>
> >> I think #2 is better. Thought?
> >
> > My vote is for #2 as well.  Maybe we just didn't have forks when this
> > functionality was invented; maybe the author just didn't think hard
> > enough about what would be the right interface to do it.
> 
> I've almost wrote up the very rough patch. (it can pass regression test)
> Windows supporting is not yet, and Makefile is not correct.
> 
> I've divided the main patch into two patches; add frozen bit patch and
> pg_upgrade support patch.
> 000 patch is almost  same as previous code. (includes small fix)
> 001 patch provides rewriting visibility map as a pageConverter routine.
> 002 patch is for enhancement debug message in visibilitymap.c

Thanks, it becomes easy to read.

> In order to support pageConvert plugin, I made the following changes.
> * Main changes
> - Remove PAGE_CONVERSION
> - pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
> - Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.

These seem fair.

> - Add new page-converter plugin function for visibility map.
> - Current code doesn't allow us to use link mode (-k) in the case
> where page-converter is required.
> 
>   But I changed it so that if page-converter for fork file is
> specified, we convert it actually even when link mode.
> 
> * Interface designe
> convertFile() and convertPage() are plugin function for main relation
> file, and these functions are dynamically loaded by
> loadConvertPlugin().

Though I haven't looked this so closer, loadConverterPlugin looks
to continue deciding what plugin to load using old and new page
layout versions. Currently the acutually possible versions are 4
and if we increment it now, 5.

On the other hand, _vm came at the *catalog version* 201107031
(9.1 release) and _fsm came at 8.4 release. Both of them are of
page layout version 4. Are we allowed to increment page layout
version for this reason? And is this framework under
reconstruction is flexible enough for this kind of changes in
future? I don't think so.


We have added _vm and _fsm so far so we must use a version number
that can determine when _vm, _fsm and _vfm are introduced. I'm
afraid that it is out of purpose, catalog version seems to be
most usable, it is already used to know when the crash safe VM
has been introduced.

Using the catalog version, the plugin we provide first would be
converteLayout_201105231_201602071.so which has only a converter
from _vm to _vfm. This plugin is loaded for the combination of
the source cluster with the catalog version of 201105231(when VM
has been introduced) or later and the destination cluster with
that *before* 201602071 (this version).

If we change the format of fsm (vm no longer exists), we would
have a new plugin maybe named
converteLayout_200904091_2017x.so which has an, maybe,
inplace file converter for fsm. It will be loaded when a source
database is of the catalog version of 200904091(when FSM has been
introduced) or later and a destination before 2017x(the
version). Catalog version seems to work fine.


So far, I assumed that we can safely assume that the name of
files to be converted is [fork_name] so the possible types
of conversions would be the following.

 - per-page conversion
 - per-file conversion between the files with the same fork name.
 - per-file conversion between the files with different fork names.

Since the plugin filename doesn't tell such things, they should
be told by the plugin itself. So a plugin is to provide the
following interface,

typedef struct ConverterTable
{
  char *src_fork_name;
  char *dst_fork_name;
  FileConverterFunc file_conveter;
  PageConverterFunc page_conveter;
} ConverterTable[];

Following such name convention of plugins, we may load multiple
plugins at once, so we collect all entries of the table of all
loaded plugins and check if any src_fork_name from them don't
duplicate.

Here, we have sufficient information to choose what conveter to
invoke and execute conversion like this.

  for (fork_name in all_fork_names_including_"" )
  {
find a converter comparing fork_name with src_fork_name.
check dst_fork_name and rename the target file if needed.
invoke the converter.
  }

If we need to convert clogs or similars and need to prepare for
such events, the ConverterTable might have an additional member
and change the meaning of some of existing members.

typedef struct ConverterTable

Re: [HACKERS] Freeze avoidance of very large table.

2016-02-03 Thread Masahiko Sawada
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera  wrote:
> Masahiko Sawada wrote:
>
>> I misunderstood. Sorry for noise.
>> I agree with adding conversion method as a pageConverter routine.
>
> \o/
>
>> This patch doesn't change page layout actually, but pageConverter
>> routine checks only the page layout.
>> And we have to plugin named convertLayout_X_to_Y.
>>
>> I think we have two options.
>>
>> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
>> it and then converts only VM files.
>> 2. Change pg_upgrade plugin mechanism so that it can handle other name
>> conversion plugins (e.g., convertLayout_vm_to_vfm)
>>
>> I think #2 is better. Thought?
>
> My vote is for #2 as well.  Maybe we just didn't have forks when this
> functionality was invented; maybe the author just didn't think hard
> enough about what would be the right interface to do it.

I've almost wrote up the very rough patch. (it can pass regression test)
Windows supporting is not yet, and Makefile is not correct.

I've divided the main patch into two patches; add frozen bit patch and
pg_upgrade support patch.
000 patch is almost  same as previous code. (includes small fix)
001 patch provides rewriting visibility map as a pageConverter routine.
002 patch is for enhancement debug message in visibilitymap.c

In order to support pageConvert plugin, I made the following changes.
* Main changes
- Remove PAGE_CONVERSION
- pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
- Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.
- Add new page-converter plugin function for visibility map.
- Current code doesn't allow us to use link mode (-k) in the case
where page-converter is required.
  But I changed it so that if page-converter for fork file is
specified, we convert it actually even when link mode.

* Interface designe
convertFile() and convertPage() are plugin function for main relation
file, and these functions are dynamically loaded by
loadConvertPlugin().
I added a new pageConvert plugin function converVMFile() for
visibility map (fork file).
If layout of CLOG, FSM or etc will be changed in the future, we could
expand some new pageConvert plugin functions like convertCLOGFile() or
convertFSMFile(), and these functions are dynamically loaded by
loadAdditionalConvertPlugin().
It means that main file and other fork file conversion are executed
independently, and conversion for fork file are executed even if link
mode is specified.
Each conversion plugin is loaded and used only when it's required.

I still agree with this plugin approach, but I felt it's still
complicated a bit, and I'm concerned that patch size has been
increased.
Please give me feedbacks.
If there are not objections about this, I'm going to spend time to improve it.

Regards,

--
Masahiko Sawada
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 001988b..5d08c73 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
 		 * If the page has only visible tuples, then we can find out the free
 		 * space from the FSM and move on.
 		 */
-		if (visibilitymap_test(rel, blkno, ))
+		if (VM_ALL_VISIBLE(rel, blkno, ))
 		{
 			freespace = GetRecordedFreeSpace(rel, blkno);
 			stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an eager freezing if the table's
 pg_class.relfrozenxid field has reached
 the age specified by this setting.  The default is 150 million
 transactions.  Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an eager freezing if the table's
 pg_class.relminmxid field has reached
 the age specified by this setting.  The default is 150 million multixacts.
 Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..7cc975d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
 Vacuum maintains a visibility map for each
 table to keep track of which pages contain only tuples that are known to be
 visible to all active transactions (and all future transactions, until the
-page is again modified).  This has two purposes.  First, vacuum
-itself can skip such pages on the next run, since there is nothing to
-clean 

Re: [HACKERS] Freeze avoidance of very large table.

2016-02-02 Thread Masahiko Sawada
On Tue, Feb 2, 2016 at 11:42 AM, Masahiko Sawada  wrote:
> On Tue, Feb 2, 2016 at 10:15 AM, Jim Nasby  wrote:
>> On 2/1/16 4:59 PM, Alvaro Herrera wrote:
>>>
>>> Masahiko Sawada wrote:
>>>
 Attached updated version patch.
 Please review it.
>>>
>>>
>>> In pg_upgrade, the "page convert" functionality is there to abstract
>>> rewrites of pages being copied; your patch is circumventing it and
>>> AFAICS it makes the interface more complicated for no good reason.  I
>>> think the real way to do that is to write your rewriteVisibilityMap as a
>>> pageConverter routine.  That should reduce some duplication there.
>>
>
> This means that user always have to set pageConverter plug-in when upgrading?
> I was thinking that this conversion is required for all user who wants
> to upgrade to 9.6, so we should support it in core, not as a plug-in.

I misunderstood. Sorry for noise.
I agree with adding conversion method as a pageConverter routine.

This patch doesn't change page layout actually, but pageConverter
routine checks only the page layout.
And we have to plugin named convertLayout_X_to_Y.

I think we have two options.

1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)

I think #2 is better. Thought?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-02 Thread Alvaro Herrera
Masahiko Sawada wrote:

> I misunderstood. Sorry for noise.
> I agree with adding conversion method as a pageConverter routine.

\o/

> This patch doesn't change page layout actually, but pageConverter
> routine checks only the page layout.
> And we have to plugin named convertLayout_X_to_Y.
> 
> I think we have two options.
> 
> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
> it and then converts only VM files.
> 2. Change pg_upgrade plugin mechanism so that it can handle other name
> conversion plugins (e.g., convertLayout_vm_to_vfm)
> 
> I think #2 is better. Thought?

My vote is for #2 as well.  Maybe we just didn't have forks when this
functionality was invented; maybe the author just didn't think hard
enough about what would be the right interface to do it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-02 Thread Masahiko Sawada
On Tue, Feb 2, 2016 at 10:05 PM, Alvaro Herrera
 wrote:
> This patch has gotten its fair share of feedback in this fest.  I moved
> it to the next commitfest.  Please do keep working on it and reviewers
> that have additional comments are welcome.

Thanks!

On Tue, Feb 2, 2016 at 8:59 PM, Kyotaro HORIGUCHI
 wrote:
> Since the destination version is fixed, the advantage of the
> plugin mechanism for pg_upgade would be capability to choose a
> plugin to load according to some characteristics of the source
> database. What do you think the trigger characteristics for
> convertLayoutVM_add_frozenbit.so or similars? If it is hard-coded
> like what transfer_single_new_db is doing for fsm and vm, I
> suppose the module is not necessary to be a plugin.

Sorry, I couldn't get it.
You meant that we should use rewriteVisibilityMap as a function (not
dynamically load)?
The destination version is not fixed, it depends on new cluster version.
I'm planning that convertLayoutVM_add_frozenbit.so is dynamically
loaded and used only when rewriting of VM is required.
If layout of VM will be changed again in the future, we could add
other libraries for convert

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-02 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 2 Feb 2016 20:25:23 +0900, Masahiko Sawada  
wrote in 

Re: [HACKERS] Freeze avoidance of very large table.

2016-02-02 Thread Masahiko Sawada
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera  wrote:
> Masahiko Sawada wrote:
>
>> I misunderstood. Sorry for noise.
>> I agree with adding conversion method as a pageConverter routine.
>
> \o/
>
>> This patch doesn't change page layout actually, but pageConverter
>> routine checks only the page layout.
>> And we have to plugin named convertLayout_X_to_Y.
>>
>> I think we have two options.
>>
>> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
>> it and then converts only VM files.
>> 2. Change pg_upgrade plugin mechanism so that it can handle other name
>> conversion plugins (e.g., convertLayout_vm_to_vfm)
>>
>> I think #2 is better. Thought?
>
> My vote is for #2 as well.  Maybe we just didn't have forks when this
> functionality was invented; maybe the author just didn't think hard
> enough about what would be the right interface to do it.

Thanks.

I'm planning to change as follows.
- pageCnvCtx struct has new function pointer convertVMFile().
  If the layout of other relation such as FSM, CLOG in the future, we
could add convertFSMFile() and convertCLOGFile().
- Create new library convertLayoutVM_add_frozenbit.c that has
convertVMFile() function which converts only visibilitymap.
  When rewriting of VM is required, convertLayoutVM_add_frozenbit.so
is dynamically loaded.
  convertLayout_X_to_Y converts other relation files.
  That is, converting VM and converting other relations are independently done.
- Current plugin mechanism puts conversion library (*.so) into
${bin}/plugins (i.g., new plugin directory is required), but I'm
thinking to puts it into ${libdir}.

Please give me feedbacks.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-02 Thread Alvaro Herrera
This patch has gotten its fair share of feedback in this fest.  I moved
it to the next commitfest.  Please do keep working on it and reviewers
that have additional comments are welcome.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-01 Thread Alvaro Herrera
Masahiko Sawada wrote:

> Attached updated version patch.
> Please review it.

In pg_upgrade, the "page convert" functionality is there to abstract
rewrites of pages being copied; your patch is circumventing it and
AFAICS it makes the interface more complicated for no good reason.  I
think the real way to do that is to write your rewriteVisibilityMap as a
pageConverter routine.  That should reduce some duplication there.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-01 Thread Masahiko Sawada
On Tue, Feb 2, 2016 at 10:15 AM, Jim Nasby  wrote:
> On 2/1/16 4:59 PM, Alvaro Herrera wrote:
>>
>> Masahiko Sawada wrote:
>>
>>> Attached updated version patch.
>>> Please review it.
>>
>>
>> In pg_upgrade, the "page convert" functionality is there to abstract
>> rewrites of pages being copied; your patch is circumventing it and
>> AFAICS it makes the interface more complicated for no good reason.  I
>> think the real way to do that is to write your rewriteVisibilityMap as a
>> pageConverter routine.  That should reduce some duplication there.
>

This means that user always have to set pageConverter plug-in when upgrading?
I was thinking that this conversion is required for all user who wants
to upgrade to 9.6, so we should support it in core, not as a plug-in.

>
> IIRC this is about the third problem that's been found with pg_upgrade in
> this patch. That concerns me given the potential for disaster if freeze bits
> are set incorrectly.

Yeah, I tend to have diagnostic tool for visibilitymap, and to exactly
compare VM between old one and new one after upgrading postgres
server.
I've implemented a such tool that is in my github repository[1].
I'm thinking to add this tool into core(e.g., pg_upgrade directory,
not contrib module) as testing function.

[1] https://github.com/MasahikoSawada/pg_visibilitymap

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-02-01 Thread Jim Nasby

On 2/1/16 4:59 PM, Alvaro Herrera wrote:

Masahiko Sawada wrote:


Attached updated version patch.
Please review it.


In pg_upgrade, the "page convert" functionality is there to abstract
rewrites of pages being copied; your patch is circumventing it and
AFAICS it makes the interface more complicated for no good reason.  I
think the real way to do that is to write your rewriteVisibilityMap as a
pageConverter routine.  That should reduce some duplication there.


IIRC this is about the third problem that's been found with pg_upgrade 
in this patch. That concerns me given the potential for disaster if 
freeze bits are set incorrectly.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-01-17 Thread Masahiko Sawada
On Wed, Jan 13, 2016 at 12:16 AM, Masahiko Sawada  wrote:
> On Mon, Dec 28, 2015 at 6:38 PM, Masahiko Sawada  
> wrote:
>> On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas  wrote:
>>> On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
>>>  wrote:
 Hello,

 At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas  
 wrote in 
 
> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>  wrote:
> > I am not really getting the meaning of this sentence. Shouldn't this
> > be reworded something like:
> > "Freezing occurs on the whole table once all pages of this relation 
> > require it."
>
> That statement isn't remotely true, and I don't think this patch
> changes that.  Freezing occurs on the whole table once relfrozenxid is
> old enough that we think there might be at least one page in the table
> that requires it.

 I doubt I can explain this accurately, but I took the original
 phrase as that if and only if all pages of the table are marked
 as "requires freezing" by accident, all pages are frozen. It's
 quite obvious but it is what I think "happen to require freezing"
 means. Does this make sense?

 The phrase might not be necessary if this is correct.
>>>
>>> Maybe you are trying to say something like "only those pages which
>>> require freezing are frozen?".
>>>
>>
>> I was thinking the same as what Horiguchi-san said.
>> That is, even if relfrozenxid is old enough, freezing on the whole
>> table is not required if the table are marked as "not requires
>> freezing".
>> In other word, only those pages which are marked as "not frozen" are frozen.
>>
>
> The recently changes to HEAD conflicts with freeze map patch, so I've
> updated and attached latest freeze map patch.
> The another patch that enhances the debug log message of visibilitymap
> is attached to previous mail.
> .
>
> Please review it.
>

Attached updated version patch.
Please review it.

Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v33.patch
Description: binary/octet-stream


001_enhance_visibilitymap_debug_messages_v1.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2016-01-12 Thread Masahiko Sawada
On Mon, Dec 28, 2015 at 6:38 PM, Masahiko Sawada  wrote:
> On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas  wrote:
>> On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
>>  wrote:
>>> Hello,
>>>
>>> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas  
>>> wrote in 
>>> 
 On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
  wrote:
 > I am not really getting the meaning of this sentence. Shouldn't this
 > be reworded something like:
 > "Freezing occurs on the whole table once all pages of this relation 
 > require it."

 That statement isn't remotely true, and I don't think this patch
 changes that.  Freezing occurs on the whole table once relfrozenxid is
 old enough that we think there might be at least one page in the table
 that requires it.
>>>
>>> I doubt I can explain this accurately, but I took the original
>>> phrase as that if and only if all pages of the table are marked
>>> as "requires freezing" by accident, all pages are frozen. It's
>>> quite obvious but it is what I think "happen to require freezing"
>>> means. Does this make sense?
>>>
>>> The phrase might not be necessary if this is correct.
>>
>> Maybe you are trying to say something like "only those pages which
>> require freezing are frozen?".
>>
>
> I was thinking the same as what Horiguchi-san said.
> That is, even if relfrozenxid is old enough, freezing on the whole
> table is not required if the table are marked as "not requires
> freezing".
> In other word, only those pages which are marked as "not frozen" are frozen.
>

The recently changes to HEAD conflicts with freeze map patch, so I've
updated and attached latest freeze map patch.
The another patch that enhances the debug log message of visibilitymap
is attached to previous mail.
.

Please review it.

Regards,

--
Masahiko Sawada
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 001988b..5d08c73 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
 		 * If the page has only visible tuples, then we can find out the free
 		 * space from the FSM and move on.
 		 */
-		if (visibilitymap_test(rel, blkno, ))
+		if (VM_ALL_VISIBLE(rel, blkno, ))
 		{
 			freespace = GetRecordedFreeSpace(rel, blkno);
 			stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an eager freezing if the table's
 pg_class.relfrozenxid field has reached
 the age specified by this setting.  The default is 150 million
 transactions.  Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an eager freezing if the table's
 pg_class.relminmxid field has reached
 the age specified by this setting.  The default is 150 million multixacts.
 Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..7cc975d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
 Vacuum maintains a visibility map for each
 table to keep track of which pages contain only tuples that are known to be
 visible to all active transactions (and all future transactions, until the
-page is again modified).  This has two purposes.  First, vacuum
-itself can skip such pages on the next run, since there is nothing to
-clean up.
+page is again modified), and pages contain only frozen tuples.
+This has two purposes.  First, vacuum itself can skip such pages
+on the next run, since there is nothing to clean up.

 

@@ -438,28 +438,25 @@

 

-VACUUM normally skips pages that don't have any dead row
-versions, but those pages might still have row versions with old XID
-values.  To ensure all old row versions have been frozen, a
-scan of the whole table is needed.
+VACUUM skips pages that don't have any dead row
+versions, and pages that have only frozen rows. To ensure all old
+row versions have been frozen, a scan of all unfrozen pages is needed.
  controls when
-VACUUM does that: a 

Re: [HACKERS] Freeze avoidance of very large table.

2015-12-28 Thread Masahiko Sawada
On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas  wrote:
> On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
>  wrote:
>> Hello,
>>
>> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas  
>> wrote in 
>>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>>>  wrote:
>>> > I am not really getting the meaning of this sentence. Shouldn't this
>>> > be reworded something like:
>>> > "Freezing occurs on the whole table once all pages of this relation 
>>> > require it."
>>>
>>> That statement isn't remotely true, and I don't think this patch
>>> changes that.  Freezing occurs on the whole table once relfrozenxid is
>>> old enough that we think there might be at least one page in the table
>>> that requires it.
>>
>> I doubt I can explain this accurately, but I took the original
>> phrase as that if and only if all pages of the table are marked
>> as "requires freezing" by accident, all pages are frozen. It's
>> quite obvious but it is what I think "happen to require freezing"
>> means. Does this make sense?
>>
>> The phrase might not be necessary if this is correct.
>
> Maybe you are trying to say something like "only those pages which
> require freezing are frozen?".
>

I was thinking the same as what Horiguchi-san said.
That is, even if relfrozenxid is old enough, freezing on the whole
table is not required if the table are marked as "not requires
freezing".
In other word, only those pages which are marked as "not frozen" are frozen.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-22 Thread Bruce Momjian
On Thu, Dec 17, 2015 at 06:44:46AM +, Simon Riggs wrote:
> >> Thank you for having a look.
> >
> > I would not bother mentioning this detail in the pg_upgrade manual page:
> >
> > +   Since the format of visibility map has been changed in version 9.6,
> > +   pg_upgrade creates and rewrite new '_vm' literal>
> > +   file even if upgrading from 9.5 or before to 9.6 or later with link
> mode (-k).
> 
> Really?  I know we don't always document things like this, but it
> seems like a good idea to me that we do so.
> 
> 
> Agreed.
> 
> For me, rewriting the visibility map is a new data loss bug waiting to happen.
> I am worried that the group is not taking seriously the potential for
> catastrophe here. I think we can do it, but I think it needs these things
> 
> * Clear notice that it is happening unconditionally and unavoidably
> * Log files showing it has happened, action by action
> * Very clear mechanism for resolving an incomplete or interrupted upgrade
> process. Which VMs got upgraded? Which didn't?
> * Ability to undo an upgrade attempt, somehow, ideally automatically by 
> default
> * Ability to restart a failed upgrade attempt without doing a "double 
> upgrade",
> i.e. ensure transformation is immutable

Have you forgotten how pg_upgrade works?  This new vm file is only
created on the new cluster, which is not usable if pg_upgrade doesn't
complete successfully.  pg_upgrade never modifies the old cluster.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-21 Thread Kyotaro HORIGUCHI
Hello,

At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas  wrote 
in 
> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>  wrote:
> > I am not really getting the meaning of this sentence. Shouldn't this
> > be reworded something like:
> > "Freezing occurs on the whole table once all pages of this relation require 
> > it."
> 
> That statement isn't remotely true, and I don't think this patch
> changes that.  Freezing occurs on the whole table once relfrozenxid is
> old enough that we think there might be at least one page in the table
> that requires it.

I doubt I can explain this accurately, but I took the original
phrase as that if and only if all pages of the table are marked
as "requires freezing" by accident, all pages are frozen. It's
quite obvious but it is what I think "happen to require freezing"
means. Does this make sense?

The phrase might not be necessary if this is correct.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-21 Thread Robert Haas
On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas  wrote 
> in 
>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>>  wrote:
>> > I am not really getting the meaning of this sentence. Shouldn't this
>> > be reworded something like:
>> > "Freezing occurs on the whole table once all pages of this relation 
>> > require it."
>>
>> That statement isn't remotely true, and I don't think this patch
>> changes that.  Freezing occurs on the whole table once relfrozenxid is
>> old enough that we think there might be at least one page in the table
>> that requires it.
>
> I doubt I can explain this accurately, but I took the original
> phrase as that if and only if all pages of the table are marked
> as "requires freezing" by accident, all pages are frozen. It's
> quite obvious but it is what I think "happen to require freezing"
> means. Does this make sense?
>
> The phrase might not be necessary if this is correct.

Maybe you are trying to say something like "only those pages which
require freezing are frozen?".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-18 Thread Robert Haas
On Thu, Dec 17, 2015 at 2:26 AM, Andres Freund  wrote:
> On 2015-12-17 16:22:24 +0900, Michael Paquier wrote:
>> On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund  wrote:
>> > On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
>> >> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs  
>> >> wrote:
>> >> > For me, rewriting the visibility map is a new data loss bug waiting to
>> >> > happen. I am worried that the group is not taking seriously the 
>> >> > potential
>> >> > for catastrophe here.
>> >>
>> >> FWIW, I'm following this line and merging the vm file into a single
>> >> unit looks like a ticking bomb.
>> >
>> > And what are those risks?
>>
>> Incorrect vm file rewrite after a pg_upgrade run.
>
> If we can't manage to rewrite a file, replacing a binary b1 with a b10,
> then we shouldn't be working on a database. And if we screw up, recovery
> i is an rm *_vm away. I can't imagine that this is going to be the
> actually complicated part of this feature.

Yeah.  If that part of this feature isn't right, the chances that the
rest of the patch are robust enough to commit seem extremely low.
That is, as Andres says, not the hard part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-18 Thread Robert Haas
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
 wrote:
> I am not really getting the meaning of this sentence. Shouldn't this
> be reworded something like:
> "Freezing occurs on the whole table once all pages of this relation require 
> it."

That statement isn't remotely true, and I don't think this patch
changes that.  Freezing occurs on the whole table once relfrozenxid is
old enough that we think there might be at least one page in the table
that requires it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-17 Thread Masahiko Sawada
On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier
 wrote:
> On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas  wrote:
>> On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian  wrote:
>>> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
 On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
 > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
 > wrote:
 >>
 >> Yeah, we need to consider to compute checksum if enabled.
 >> I've changed the patch, and attached.
 >> Please review it.
 >
 > Thanks for the update.  This now conflicts with the updates doesn to
 > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
 > conflict in order to do some testing, but I'd like to get an updated
 > patch from the author in case I did it wrong.  I don't want to find
 > bugs that I just introduced myself.
 >

 Thank you for having a look.
>>>
>>> I would not bother mentioning this detail in the pg_upgrade manual page:
>>>
>>> +   Since the format of visibility map has been changed in version 9.6,
>>> +   pg_upgrade creates and rewrite new 
>>> '_vm'
>>> +   file even if upgrading from 9.5 or before to 9.6 or later with link 
>>> mode (-k).
>>
>> Really?  I know we don't always document things like this, but it
>> seems like a good idea to me that we do so.
>
> Just going though v30...
>
> +frozen. The whole-table freezing is occuerred only when all pages happen 
> to
> +require freezing to freeze rows. In other cases such as where
>
> I am not really getting the meaning of this sentence. Shouldn't this
> be reworded something like:
> "Freezing occurs on the whole table once all pages of this relation require 
> it."
>
> +relfrozenxid is more than
> vacuum_freeze_table_age
> +transcations old, where VACUUM's FREEZE
> option is used,
> +VACUUM can skip the pages that all tuples on the page
> itself are
> +marked as frozen.
> +When all pages of table are eventually marked as frozen by
> VACUUM,
> +after it's finished age(relfrozenxid) should be a little more
> +than the vacuum_freeze_min_age setting that was used (more by
> +the number of transcations started since the VACUUM started).
> +If the advancing of relfrozenxid is not happend until
> +autovacuum_freeze_max_age is reached, an autovacuum will soon
> +be forced for the table.
>
> s/transcations/transactions.
>
> + n_frozen_page
> + integer
> + Number of frozen pages
> n_frozen_pages?
>
> make check with pg_upgrade is failing for me:
> Visibility map rewriting test failed
> make: *** [check] Error 1

make check with pg_upgrade is done successfully on my environment.
Could you give me more information about this?

> -GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
> +GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
> This looks like an unrelated change.
>
>   * Clearing a visibility map bit is not separately WAL-logged.  The callers
>   * must make sure that whenever a bit is cleared, the bit is cleared on WAL
> - * replay of the updating operation as well.
> + * replay of the updating operation as well.  And all-frozen bit must be
> + * cleared with all-visible at the same time.
> This could be reformulated. This is just an addition on top of the
> existing paragraph.
>
> + * The visibility map has the all-frozen bit which indicates all tuples on
> + * corresponding page has been completely frozen, so the visibility map is 
> also
> "have been completely frozen".
>
> -/* Mapping from heap block number to the right bit in the visibility map */
> -#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
> -#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) /
> HEAPBLOCKS_PER_BYTE)
> Those two declarations are just noise in the patch: those definitions
> are unchanged.
>
> -   elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
> +   elog(DEBUG1, "vm_clear %s block %d",
> RelationGetRelationName(rel), heapBlk);
> This may be better as a separate patch.

I've attached 001 patch for this separately.

>
> -visibilitymap_count(Relation rel)
> +visibilitymap_count(Relation rel, BlockNumber *all_frozen)
> I think that this routine would gain in clarity if reworked as follows:
> visibilitymap_count(Relation rel, BlockNumber *all_visible,
> BlockNumber *all_frozen)
>
> +   /*
> +* Report ANALYZE to the stats collector, too.
> However, if doing
> +* inherited stats we shouldn't report, because the
> stats collector only
> +* tracks per-table stats.
> +*/
> +   if (!inh)
> +   pgstat_report_analyze(onerel, totalrows,
> totaldeadrows, relallfrozen);
> Here we already know that this is working in the non-inherited code
> path. As a whole, 

Re: [HACKERS] Freeze avoidance of very large table.

2015-12-17 Thread Michael Paquier
On Fri, Dec 18, 2015 at 3:17 AM, Masahiko Sawada  wrote:
> On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier  
> wrote:
>> make check with pg_upgrade is failing for me:
>> Visibility map rewriting test failed
>> make: *** [check] Error 1
>
> make check with pg_upgrade is done successfully on my environment.
> Could you give me more information about this?

Oh, well I see now after digging into your code. You are missing -X
when running psql, and until recently psql -c implied -X all the time.
The reason why it failed for me is that I have \timing enabled in
psqlrc.

Actually test.sh needs to be fixed as well, see the attached, this is
a separate bug. Could a kind committer look at that if this is
acceptable?

>> Sawada-san, are you planning to continue working on that? At this
>> stage it seems that this patch is not in commitable state and should
>> at best be moved to next CF, or at worst returned with feedback.
>
> Yes, of course.
> This patch should be marked as "Move to next CF".

OK, done so.
-- 
Michael


pgupgrade-fix.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-16 Thread Michael Paquier
On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas  wrote:
> On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian  wrote:
>> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
>>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
>>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
>>> > wrote:
>>> >>
>>> >> Yeah, we need to consider to compute checksum if enabled.
>>> >> I've changed the patch, and attached.
>>> >> Please review it.
>>> >
>>> > Thanks for the update.  This now conflicts with the updates doesn to
>>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>>> > conflict in order to do some testing, but I'd like to get an updated
>>> > patch from the author in case I did it wrong.  I don't want to find
>>> > bugs that I just introduced myself.
>>> >
>>>
>>> Thank you for having a look.
>>
>> I would not bother mentioning this detail in the pg_upgrade manual page:
>>
>> +   Since the format of visibility map has been changed in version 9.6,
>> +   pg_upgrade creates and rewrite new 
>> '_vm'
>> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode 
>> (-k).
>
> Really?  I know we don't always document things like this, but it
> seems like a good idea to me that we do so.

Just going though v30...

+frozen. The whole-table freezing is occuerred only when all pages happen to
+require freezing to freeze rows. In other cases such as where

I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."

+relfrozenxid is more than
vacuum_freeze_table_age
+transcations old, where VACUUM's FREEZE
option is used,
+VACUUM can skip the pages that all tuples on the page
itself are
+marked as frozen.
+When all pages of table are eventually marked as frozen by
VACUUM,
+after it's finished age(relfrozenxid) should be a little more
+than the vacuum_freeze_min_age setting that was used (more by
+the number of transcations started since the VACUUM started).
+If the advancing of relfrozenxid is not happend until
+autovacuum_freeze_max_age is reached, an autovacuum will soon
+be forced for the table.

s/transcations/transactions.

+ n_frozen_page
+ integer
+ Number of frozen pages
n_frozen_pages?

make check with pg_upgrade is failing for me:
Visibility map rewriting test failed
make: *** [check] Error 1

-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
This looks like an unrelated change.

  * Clearing a visibility map bit is not separately WAL-logged.  The callers
  * must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well.  And all-frozen bit must be
+ * cleared with all-visible at the same time.
This could be reformulated. This is just an addition on top of the
existing paragraph.

+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
"have been completely frozen".

-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) /
HEAPBLOCKS_PER_BYTE)
Those two declarations are just noise in the patch: those definitions
are unchanged.

-   elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+   elog(DEBUG1, "vm_clear %s block %d",
RelationGetRelationName(rel), heapBlk);
This may be better as a separate patch.

-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
I think that this routine would gain in clarity if reworked as follows:
visibilitymap_count(Relation rel, BlockNumber *all_visible,
BlockNumber *all_frozen)

+   /*
+* Report ANALYZE to the stats collector, too.
However, if doing
+* inherited stats we shouldn't report, because the
stats collector only
+* tracks per-table stats.
+*/
+   if (!inh)
+   pgstat_report_analyze(onerel, totalrows,
totaldeadrows, relallfrozen);
Here we already know that this is working in the non-inherited code
path. As a whole, why that? Why isn't relallfrozen passed as an
argument of vac_update_relstats and then inserted in pg_class? Maybe I
am missing something..

+* mxid full-table scan limit. During full scan, we could skip some pags
+* according to all-frozen bit of visibility map.
s/pags/pages

+* Also, skipping even a single page accorinding to all-visible bit of
s/accorinding/according.

So, lazy_scan_heap is the central and 

Re: [HACKERS] Freeze avoidance of very large table.

2015-12-16 Thread Michael Paquier
On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs  wrote:
> For me, rewriting the visibility map is a new data loss bug waiting to
> happen. I am worried that the group is not taking seriously the potential
> for catastrophe here.

FWIW, I'm following this line and merging the vm file into a single
unit looks like a ticking bomb. We may really want a separate _vm
file, like _vmf to track this new bit flag but this has already been
mentioned largely upthread...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-16 Thread Simon Riggs
On 9 December 2015 at 18:31, Robert Haas  wrote:

> On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian  wrote:
> > On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
> >> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes 
> wrote:
> >> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <
> sawada.m...@gmail.com> wrote:
> >> >>
> >> >> Yeah, we need to consider to compute checksum if enabled.
> >> >> I've changed the patch, and attached.
> >> >> Please review it.
> >> >
> >> > Thanks for the update.  This now conflicts with the updates doesn to
> >> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
> >> > conflict in order to do some testing, but I'd like to get an updated
> >> > patch from the author in case I did it wrong.  I don't want to find
> >> > bugs that I just introduced myself.
> >> >
> >>
> >> Thank you for having a look.
> >
> > I would not bother mentioning this detail in the pg_upgrade manual page:
> >
> > +   Since the format of visibility map has been changed in version 9.6,
> > +   pg_upgrade creates and rewrite new
> '_vm'
> > +   file even if upgrading from 9.5 or before to 9.6 or later with link
> mode (-k).
>
> Really?  I know we don't always document things like this, but it
> seems like a good idea to me that we do so.
>

Agreed.

For me, rewriting the visibility map is a new data loss bug waiting to
happen. I am worried that the group is not taking seriously the potential
for catastrophe here. I think we can do it, but I think it needs these
things

* Clear notice that it is happening unconditionally and unavoidably
* Log files showing it has happened, action by action
* Very clear mechanism for resolving an incomplete or interrupted upgrade
process. Which VMs got upgraded? Which didn't?
* Ability to undo an upgrade attempt, somehow, ideally automatically by
default
* Ability to restart a failed upgrade attempt without doing a "double
upgrade", i.e. ensure transformation is immutable

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-16 Thread Andres Freund
On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs  wrote:
> > For me, rewriting the visibility map is a new data loss bug waiting to
> > happen. I am worried that the group is not taking seriously the potential
> > for catastrophe here.
> 
> FWIW, I'm following this line and merging the vm file into a single
> unit looks like a ticking bomb.

And what are those risks?

> We may really want a separate _vm file, like _vmf to track this new
> bit flag but this has already been mentioned largely upthread...

That'd double the overhead when those bits get unset.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-16 Thread Michael Paquier
On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund  wrote:
> On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
>> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs  wrote:
>> > For me, rewriting the visibility map is a new data loss bug waiting to
>> > happen. I am worried that the group is not taking seriously the potential
>> > for catastrophe here.
>>
>> FWIW, I'm following this line and merging the vm file into a single
>> unit looks like a ticking bomb.
>
> And what are those risks?

Incorrect vm file rewrite after a pg_upgrade run.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-16 Thread Andres Freund
On 2015-12-17 16:22:24 +0900, Michael Paquier wrote:
> On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund  wrote:
> > On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
> >> On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs  wrote:
> >> > For me, rewriting the visibility map is a new data loss bug waiting to
> >> > happen. I am worried that the group is not taking seriously the potential
> >> > for catastrophe here.
> >>
> >> FWIW, I'm following this line and merging the vm file into a single
> >> unit looks like a ticking bomb.
> >
> > And what are those risks?
> 
> Incorrect vm file rewrite after a pg_upgrade run.

If we can't manage to rewrite a file, replacing a binary b1 with a b10,
then we shouldn't be working on a database. And if we screw up, recovery
i is an rm *_vm away. I can't imagine that this is going to be the
actually complicated part of this feature.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-09 Thread Robert Haas
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian  wrote:
> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
>> > wrote:
>> >>
>> >> Yeah, we need to consider to compute checksum if enabled.
>> >> I've changed the patch, and attached.
>> >> Please review it.
>> >
>> > Thanks for the update.  This now conflicts with the updates doesn to
>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>> > conflict in order to do some testing, but I'd like to get an updated
>> > patch from the author in case I did it wrong.  I don't want to find
>> > bugs that I just introduced myself.
>> >
>>
>> Thank you for having a look.
>
> I would not bother mentioning this detail in the pg_upgrade manual page:
>
> +   Since the format of visibility map has been changed in version 9.6,
> +   pg_upgrade creates and rewrite new 
> '_vm'
> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode 
> (-k).

Really?  I know we don't always document things like this, but it
seems like a good idea to me that we do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-04 Thread Masahiko Sawada
On Fri, Dec 4, 2015 at 9:51 PM, Jeff Janes  wrote:
> On Tue, Dec 1, 2015 at 10:40 AM, Masahiko Sawada  
> wrote:
>> On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes  wrote:
>>> On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada  
>>> wrote:
 On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
> wrote:
>>
>> Yeah, we need to consider to compute checksum if enabled.
>> I've changed the patch, and attached.
>> Please review it.
>
> Thanks for the update.  This now conflicts with the updates doesn to
> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
> conflict in order to do some testing, but I'd like to get an updated
> patch from the author in case I did it wrong.  I don't want to find
> bugs that I just introduced myself.
>

 Thank you for having a look.

 Attached updated v28 patch.
 Please review it.

 Regards,
>>>
>>> After running pg_upgrade, if I manually vacuum a table a start getting 
>>> warnings:
>>>
>>> WARNING:  page is not marked all-visible (and all-frozen) but
>>> visibility map bit(s) is set in relation "foo" page 32756
>>> WARNING:  page is not marked all-visible (and all-frozen) but
>>> visibility map bit(s) is set in relation "foo" page 32756
>>> WARNING:  page is not marked all-visible (and all-frozen) but
>>> visibility map bit(s) is set in relation "foo" page 32757
>>> WARNING:  page is not marked all-visible (and all-frozen) but
>>> visibility map bit(s) is set in relation "foo" page 32757
>>>
>>> The warnings are right where the blocks would start using the 2nd page
>>> of the _vm, so I think the problem is there.  And looking at the code,
>>> I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
>>> be correct.  We can't skip a header in the current (old) block each
>>> time we reach the end of the new block.  The thing we are skipping in
>>> the current block is half the time not a header, but the data at the
>>> halfway point through the block.
>>>
>>
>> Thank you for reviewing.
>>
>> You're right, it's not necessary.
>> Attached latest v29 patch which removes the mention in pg_upgrade 
>> documentation.
>
> I could successfully upgrade with this patch, with the link option and
> without.  After the update the tables seemed to have their correct
> visibility status, and after a VACUUM FREEZE then had the correct
> freeze status as well.

Thank you for tesing!

> Then I manually corrupted the vm file, just to make sure a corrupted
> one would get detected.  And much to my surprise, I didn't get any
> errors or warning when starting it back up and running vacuum freeze
> (unless I had page checksums turned on, then I got warnings and zeroed
> out pages).  But I guess this is not considered a warnable condition
> for bits to be off when they should be on, only the opposite.

How did you break the vm file?
The inconsistent flags state (set all-frozen but not set all-visible)
will be detected in visibility map code.But the vm file has
consecutive bits simply after its page header, so detecting its
corruption would be difficult unless whole page is corrupted.

> Consecutive VACUUM FREEZE operations with no DML activity between were
> not sped up by as much as I thought they would be, because it still
> had to walk all the indexes even though it didn't touch the table at
> all.  In real-world usage there would almost always be some dead
> tuples that would require an index scan anyway for a normal vacuum.

The another reason why consecutive VACUUM FREEZE were not sped up is
the many pages of that table were on disk cache, right?
In case of very large database, vacuuming large table would engage the
total vacuum time, so it would be more effective.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-04 Thread Jeff Janes
On Tue, Dec 1, 2015 at 10:40 AM, Masahiko Sawada  wrote:
> On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes  wrote:
>> On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada  
>> wrote:
>>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
 On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
 wrote:
>
> Yeah, we need to consider to compute checksum if enabled.
> I've changed the patch, and attached.
> Please review it.

 Thanks for the update.  This now conflicts with the updates doesn to
 fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
 conflict in order to do some testing, but I'd like to get an updated
 patch from the author in case I did it wrong.  I don't want to find
 bugs that I just introduced myself.

>>>
>>> Thank you for having a look.
>>>
>>> Attached updated v28 patch.
>>> Please review it.
>>>
>>> Regards,
>>
>> After running pg_upgrade, if I manually vacuum a table a start getting 
>> warnings:
>>
>> WARNING:  page is not marked all-visible (and all-frozen) but
>> visibility map bit(s) is set in relation "foo" page 32756
>> WARNING:  page is not marked all-visible (and all-frozen) but
>> visibility map bit(s) is set in relation "foo" page 32756
>> WARNING:  page is not marked all-visible (and all-frozen) but
>> visibility map bit(s) is set in relation "foo" page 32757
>> WARNING:  page is not marked all-visible (and all-frozen) but
>> visibility map bit(s) is set in relation "foo" page 32757
>>
>> The warnings are right where the blocks would start using the 2nd page
>> of the _vm, so I think the problem is there.  And looking at the code,
>> I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
>> be correct.  We can't skip a header in the current (old) block each
>> time we reach the end of the new block.  The thing we are skipping in
>> the current block is half the time not a header, but the data at the
>> halfway point through the block.
>>
>
> Thank you for reviewing.
>
> You're right, it's not necessary.
> Attached latest v29 patch which removes the mention in pg_upgrade 
> documentation.

I could successfully upgrade with this patch, with the link option and
without.  After the update the tables seemed to have their correct
visibility status, and after a VACUUM FREEZE then had the correct
freeze status as well.

Then I manually corrupted the vm file, just to make sure a corrupted
one would get detected.  And much to my surprise, I didn't get any
errors or warning when starting it back up and running vacuum freeze
(unless I had page checksums turned on, then I got warnings and zeroed
out pages).  But I guess this is not considered a warnable condition
for bits to be off when they should be on, only the opposite.

Consecutive VACUUM FREEZE operations with no DML activity between were
not sped up by as much as I thought they would be, because it still
had to walk all the indexes even though it didn't touch the table at
all.  In real-world usage there would almost always be some dead
tuples that would require an index scan anyway for a normal vacuum.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-02 Thread Masahiko Sawada
On Wed, Dec 2, 2015 at 9:30 AM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
>> You're right, it's not necessary.
>> Attached latest v29 patch which removes the mention in pg_upgrade 
>> documentation.
>
> The changes looks to be correct but I haven't tested.
> And I have some additional random comments.
>

Thank you for revewing!
Fixed these following points, and attached latest patch.

> visibilitymap.c:
>
>   In visibilitymap_set, the followint lines.
>
> map = PageGetContents(page);
> ...
> if (flags != (map[mapByte] & (flags << mapBit)))
>
>   map is (char*), PageGetContents returns (char*) but flags is
>   uint8. I think that defining map as (uint8*) would be safer.

I agree with you.
Fixed.

>
>   In visibilitymap_set, the following lines does something
>   different from what to do.  Only right side of the inequality
>   gets shifted and what should be used in right side is not flags
>   but VISIBILITYMAP_VALID_BITS.
>
>   - if (!(map[mapByte] & (1 << mapBit)))
>   + if (flags != (map[mapByte] & (flags << mapBit)))
>
>   Something like the following will do the right thing.
>
>   + if (flags != (map[mapByte]>>mapBit & VISIBILITYMAP_VALID_BITS))
>

You're right.
Fixed.

> analyze.c:
>
>  In do_analyze_rel, the successive if (!inh) in the following
>  steps looks a bit odd. This would be emphasized by the first if
>  block you added:) These blocks should be enclosed by if (!inh)
>  {} block.
>
>
>  >   /* Calculate the number of all-visible and all-frozen bit */
>  >   if (!inh)
>  >   relallvisible = visibilitymap_count(onerel, );
>  >   if (!inh)
>  >   vac_update_relstats(onerel,
>  >   if (!inh && !(options & VACOPT_VACUUM))
>  >   {
>  >   for (ind = 0; ind < nindexes; ind++)
>  ...
>  >   }
>  >   if (!inh)
>  >   pgstat_report_analyze(onerel, totalrows, totaldeadrows, 
> relallfrozen);

Fixed.

>
> vacuum.c:
>
>   >- * relpages and relallvisible, we try to maintain certain lazily-updated
>   >- * DDL flags such as relhasindex, by clearing them if no longer correct.
>   >- * It's safe to do this in VACUUM, which can't run in parallel with
>   >- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
>   >- * However, it's *not* safe to do it in an ANALYZE that's within an
>
>   >+ * relpages, relallvisible, we try to maintain certain lazily-updated
>
> Why did you just drop the 'and' after relpages? And this seems
> the only change of this file except the additinally missing
> letter just below:p
>
>   >+ * DDL flags such as relhasindex, by clearing them if no onger correct.
>   >+ * It's safe to do this in VACUUM, which can't run in
>   >+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a 
> transaction
>   >+ * block.  However, it's *not* safe to do it in an ANALYZE that's within 
> an

Fixed.

>
> nodeIndexonlyscan.c:
>
>  A duplicate letters.  And the line exceeds right margin.
>
>  > - * Note on Memory Ordering Effects: visibilitymap_test does not lock
> -> + * Note on Memory Ordering Effects: visibilitymap_get_stattus does not 
> lock
> + * Note on Memory Ordering Effects: visibilitymap_get_status does not lock

Fixed.

>
>  The edited line exceeds right margin by removing a newline.
>
> - if (!visibilitymap_test(scandesc->heapRelation,
> - ItemPointerGetBlockNumber(tid),
> - >ioss_VMBuffer))
> + if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
> + >ioss_VMBuffer))
>

Fixed.

> costsize.c:
>
>  Duplicate words and it is the only change.
>
>  > - * pages for which the visibility map shows all tuples are visible.
> -> + * pages for which the visibility map map shows all tuples are visible.
> + * pages for which the visibility map shows all tuples are visible.

Fixed.

> pgstat.c:
>
>  The new parameter frozenpages of pgstat_report_vacuum() is
>  defined as int32, but its callers give BlockNumber(=uint32).  I
>  recommend to define the frozenpages as BlockNumber.
>  PgStat_MsgVacuum has a corresponding member defined as int32.

I agree with you.
Fixed.

> pg_upgrade.c:
>
>  BITS_PER_HEAPBLOCK is defined in two c files with the same
>  definition. This might be better to be merged into some header
>  file.

Fixed.
I moved these definition to visibilitymap.h.

>
> heapam_xlog.h, hio.h, execnodes.h:
>
>  Have we decided to rename vm to pim? Anyway it is inconsistent
>  with that of corresponding definition of the function body
>  remains as 'vm_buffer'. (I'm not confident on that, though.)
>
>  >-   Buffer vm_buffer, TransactionId cutoff_xid);
>  >+   Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
>

Fixed.

Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v30.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Freeze avoidance of very large table.

2015-12-01 Thread Kyotaro HORIGUCHI
Hello,

> You're right, it's not necessary.
> Attached latest v29 patch which removes the mention in pg_upgrade 
> documentation.

The changes looks to be correct but I haven't tested.
And I have some additional random comments.


visibilitymap.c:

  In visibilitymap_set, the followint lines.

map = PageGetContents(page);
...
if (flags != (map[mapByte] & (flags << mapBit)))

  map is (char*), PageGetContents returns (char*) but flags is
  uint8. I think that defining map as (uint8*) would be safer.


  In visibilitymap_set, the following lines does something
  different from what to do.  Only right side of the inequality
  gets shifted and what should be used in right side is not flags
  but VISIBILITYMAP_VALID_BITS.

  - if (!(map[mapByte] & (1 << mapBit)))
  + if (flags != (map[mapByte] & (flags << mapBit)))

  Something like the following will do the right thing.

  + if (flags != (map[mapByte]>>mapBit & VISIBILITYMAP_VALID_BITS))


analyze.c:

 In do_analyze_rel, the successive if (!inh) in the following
 steps looks a bit odd. This would be emphasized by the first if
 block you added:) These blocks should be enclosed by if (!inh)
 {} block.


 >   /* Calculate the number of all-visible and all-frozen bit */
 >   if (!inh)
 >   relallvisible = visibilitymap_count(onerel, );
 >   if (!inh)
 >   vac_update_relstats(onerel,
 >   if (!inh && !(options & VACOPT_VACUUM))
 >   {
 >   for (ind = 0; ind < nindexes; ind++)
 ...
 >   }
 >   if (!inh)
 >   pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
  

vacuum.c:

  >- * relpages and relallvisible, we try to maintain certain lazily-updated
  >- * DDL flags such as relhasindex, by clearing them if no longer correct.
  >- * It's safe to do this in VACUUM, which can't run in parallel with
  >- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
  >- * However, it's *not* safe to do it in an ANALYZE that's within an
  
  >+ * relpages, relallvisible, we try to maintain certain lazily-updated
  
Why did you just drop the 'and' after relpages? And this seems
the only change of this file except the additinally missing
letter just below:p
  
  >+ * DDL flags such as relhasindex, by clearing them if no onger correct.
  >+ * It's safe to do this in VACUUM, which can't run in
  >+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a 
transaction
  >+ * block.  However, it's *not* safe to do it in an ANALYZE that's within an


nodeIndexonlyscan.c:

 A duplicate letters.  And the line exceeds right margin.

 > - * Note on Memory Ordering Effects: visibilitymap_test does not lock
-> + * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not lock


 The edited line exceeds right margin by removing a newline.

- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- >ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ >ioss_VMBuffer))


costsize.c:

 Duplicate words and it is the only change.

 > - * pages for which the visibility map shows all tuples are visible.
-> + * pages for which the visibility map map shows all tuples are visible.
+ * pages for which the visibility map shows all tuples are visible.

pgstat.c:

 The new parameter frozenpages of pgstat_report_vacuum() is
 defined as int32, but its callers give BlockNumber(=uint32).  I
 recommend to define the frozenpages as BlockNumber.
 PgStat_MsgVacuum has a corresponding member defined as int32.

pg_upgrade.c:

 BITS_PER_HEAPBLOCK is defined in two c files with the same
 definition. This might be better to be merged into some header
 file.


heapam_xlog.h, hio.h, execnodes.h:

 Have we decided to rename vm to pim? Anyway it is inconsistent
 with that of corresponding definition of the function body
 remains as 'vm_buffer'. (I'm not confident on that, though.)

 >-   Buffer vm_buffer, TransactionId cutoff_xid);
 >+   Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);

regards,


At Wed, 2 Dec 2015 00:10:09 +0530, Masahiko Sawada  
wrote in 
> On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes  wrote:
> > On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada  
> > wrote:
> > After running pg_upgrade, if I manually vacuum a table a start getting 
> > warnings:
> >
> > WARNING:  page is not marked all-visible (and all-frozen) but
> > visibility map bit(s) is set in relation "foo" page 32756
> > WARNING:  page is not marked all-visible (and all-frozen) but
> > visibility map bit(s) is set in relation "foo" page 32756
...
> > The warnings are right where the blocks would start using the 2nd page
> > of the _vm, so I think the problem is there.  

Re: [HACKERS] Freeze avoidance of very large table.

2015-12-01 Thread Masahiko Sawada
On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes  wrote:
> On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada  
> wrote:
>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
>>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
>>> wrote:

 Yeah, we need to consider to compute checksum if enabled.
 I've changed the patch, and attached.
 Please review it.
>>>
>>> Thanks for the update.  This now conflicts with the updates doesn to
>>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>>> conflict in order to do some testing, but I'd like to get an updated
>>> patch from the author in case I did it wrong.  I don't want to find
>>> bugs that I just introduced myself.
>>>
>>
>> Thank you for having a look.
>>
>> Attached updated v28 patch.
>> Please review it.
>>
>> Regards,
>
> After running pg_upgrade, if I manually vacuum a table a start getting 
> warnings:
>
> WARNING:  page is not marked all-visible (and all-frozen) but
> visibility map bit(s) is set in relation "foo" page 32756
> WARNING:  page is not marked all-visible (and all-frozen) but
> visibility map bit(s) is set in relation "foo" page 32756
> WARNING:  page is not marked all-visible (and all-frozen) but
> visibility map bit(s) is set in relation "foo" page 32757
> WARNING:  page is not marked all-visible (and all-frozen) but
> visibility map bit(s) is set in relation "foo" page 32757
>
> The warnings are right where the blocks would start using the 2nd page
> of the _vm, so I think the problem is there.  And looking at the code,
> I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
> be correct.  We can't skip a header in the current (old) block each
> time we reach the end of the new block.  The thing we are skipping in
> the current block is half the time not a header, but the data at the
> halfway point through the block.
>

Thank you for reviewing.

You're right, it's not necessary.
Attached latest v29 patch which removes the mention in pg_upgrade documentation.


Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v29.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-30 Thread Jeff Janes
On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada  wrote:
> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
>> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
>> wrote:
>>>
>>> Yeah, we need to consider to compute checksum if enabled.
>>> I've changed the patch, and attached.
>>> Please review it.
>>
>> Thanks for the update.  This now conflicts with the updates doesn to
>> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>> conflict in order to do some testing, but I'd like to get an updated
>> patch from the author in case I did it wrong.  I don't want to find
>> bugs that I just introduced myself.
>>
>
> Thank you for having a look.
>
> Attached updated v28 patch.
> Please review it.
>
> Regards,

After running pg_upgrade, if I manually vacuum a table a start getting warnings:

WARNING:  page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING:  page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING:  page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757
WARNING:  page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757

The warnings are right where the blocks would start using the 2nd page
of the _vm, so I think the problem is there.  And looking at the code,
I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
be correct.  We can't skip a header in the current (old) block each
time we reach the end of the new block.  The thing we are skipping in
the current block is half the time not a header, but the data at the
halfway point through the block.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-30 Thread Masahiko Sawada
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
> On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
> wrote:
>>
>> Yeah, we need to consider to compute checksum if enabled.
>> I've changed the patch, and attached.
>> Please review it.
>
> Thanks for the update.  This now conflicts with the updates doesn to
> fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
> conflict in order to do some testing, but I'd like to get an updated
> patch from the author in case I did it wrong.  I don't want to find
> bugs that I just introduced myself.
>

Thank you for having a look.

Attached updated v28 patch.
Please review it.

Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v28.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-30 Thread Bruce Momjian
On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
> > wrote:
> >>
> >> Yeah, we need to consider to compute checksum if enabled.
> >> I've changed the patch, and attached.
> >> Please review it.
> >
> > Thanks for the update.  This now conflicts with the updates doesn to
> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
> > conflict in order to do some testing, but I'd like to get an updated
> > patch from the author in case I did it wrong.  I don't want to find
> > bugs that I just introduced myself.
> >
> 
> Thank you for having a look.

I would not bother mentioning this detail in the pg_upgrade manual page:

+   Since the format of visibility map has been changed in version 9.6,
+   pg_upgrade creates and rewrite new '_vm'
+   file even if upgrading from 9.5 or before to 9.6 or later with link mode 
(-k).

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-30 Thread Andres Freund
On 2015-11-30 12:58:43 -0500, Bruce Momjian wrote:
> I would not bother mentioning this detail in the pg_upgrade manual page:
> 
> +   Since the format of visibility map has been changed in version 9.6,
> +   pg_upgrade creates and rewrite new 
> '_vm'
> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode 
> (-k).

Might be worthwhile to keep as that influences the runtime for link mode
when migrating <9.6 -> 9.6.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-30 Thread Bruce Momjian
On Mon, Nov 30, 2015 at 07:05:21PM +0100, Andres Freund wrote:
> On 2015-11-30 12:58:43 -0500, Bruce Momjian wrote:
> > I would not bother mentioning this detail in the pg_upgrade manual page:
> > 
> > +   Since the format of visibility map has been changed in version 9.6,
> > +   pg_upgrade creates and rewrite new 
> > '_vm'
> > +   file even if upgrading from 9.5 or before to 9.6 or later with link 
> > mode (-k).
> 
> Might be worthwhile to keep as that influences the runtime for link mode
> when migrating <9.6 -> 9.6.

It is hard to see that it would have a measurable duration.  The
pg_upgrade docs are already very long and this detail doesn't seems
significant.  Can someone test the overhead?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-28 Thread Jeff Janes
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  wrote:
>
> Yeah, we need to consider to compute checksum if enabled.
> I've changed the patch, and attached.
> Please review it.

Thanks for the update.  This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong.  I don't want to find
bugs that I just introduced myself.

Thanks,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-24 Thread Masahiko Sawada
On Mon, Nov 23, 2015 at 6:27 AM, Jeff Janes  wrote:
> On Sun, Nov 22, 2015 at 8:16 AM, Masahiko Sawada  
> wrote:
>
>> Thank you for taking the time to review this patch!
>> The updated version patch is attached.
>
> I am skeptical about just copying the old page header to be two new
> page headers.  I don't know what the implications for this will be on
> pd_lsn.  Since pg_upgrade can only run on a cluster that was cleanly
> shutdown, I think that just copying it from the old page to both new
> pages it turns into might be fine.  But pd_checksum will certainly be
> wrong, breaking pg_upgrade for cases where checksums are turned on in.
> It needs to be recomputed on both new pages.  It looks like there is
> no precedence for doing that in pg_upgrade so this will be breaking
> new ground.
>

Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.

Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v27.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-22 Thread Masahiko Sawada
On Sat, Nov 21, 2015 at 6:50 AM, Jeff Janes  wrote:
> On Thu, Nov 19, 2015 at 6:44 AM, Masahiko Sawada  
> wrote:
>> On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes  wrote:
>>> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes  wrote:

 I get an error when running pg_upgrade from 9.4 to 9.6-this

 error while copying relation "mediawiki.archive"
 ("/tmp/data/base/16414/21043_vm" to
 "/tmp/data_fm/base/16400/21043_vm"): No such file or directory
>>>
>>> OK, so the problem seems to be that rewriteVisibilitymap can get
>>> called with errno already set to a nonzero value.
>>>
>>> It never clears it, and then fails at the end despite that no error
>>> has actually occurred.
>>>
>>> Just setting it to 0 at the top of the function seems to be correct
>>> thing to do.  Or does it need to save the old value and restore it?
>>
>> Thank you for testing!
>> I think that the former is better, so attached latest patch.
>>
>>> But now when I want to do the upgrade faster, I run into this:
>>>
>>> "This utility cannot upgrade from PostgreSQL version from 9.5 or
>>> before to 9.6 or later with link mode."
>>>
>>> Is this really an acceptable a tradeoff?  Surely we can arrange to
>>> link everything else and rewrite just the _vm, which is a tiny portion
>>> of the data directory.  I don't think that -k promises to link
>>> everything it possibly can.
>>
>> I agree.
>> I've changed the patch so that.
>> pg_upgarde creates new _vm file and rewrites it even if upgrading to
>> 9.6 with link mode.
>
>
> The rewrite code thinks that only the first page of a vm has a header
> of size SizeOfPageHeaderData, and the rest of the pages have a zero
> size header.  So the resulting _vm is corrupt.
>
> After pg_upgrade, doing a vacuum freeze verbose gives:
>
>
> WARNING:  invalid page in block 1 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 1 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 2 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 2 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 3 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 3 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 4 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 4 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 5 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 5 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 6 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 6 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 7 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 7 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 8 of relation base/16402/22430_vm;
> zeroing out page
> WARNING:  invalid page in block 8 of relation base/16402/22430_vm;
> zeroing out page
>

Thank you for taking the time to review this patch!
The updated version patch is attached.

Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v26.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-22 Thread Jeff Janes
On Sun, Nov 22, 2015 at 8:16 AM, Masahiko Sawada  wrote:

> Thank you for taking the time to review this patch!
> The updated version patch is attached.

I am skeptical about just copying the old page header to be two new
page headers.  I don't know what the implications for this will be on
pd_lsn.  Since pg_upgrade can only run on a cluster that was cleanly
shutdown, I think that just copying it from the old page to both new
pages it turns into might be fine.  But pd_checksum will certainly be
wrong, breaking pg_upgrade for cases where checksums are turned on in.
It needs to be recomputed on both new pages.  It looks like there is
no precedence for doing that in pg_upgrade so this will be breaking
new ground.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-20 Thread Jeff Janes
On Thu, Nov 19, 2015 at 6:44 AM, Masahiko Sawada  wrote:
> On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes  wrote:
>> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes  wrote:
>>>
>>> I get an error when running pg_upgrade from 9.4 to 9.6-this
>>>
>>> error while copying relation "mediawiki.archive"
>>> ("/tmp/data/base/16414/21043_vm" to
>>> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory
>>
>> OK, so the problem seems to be that rewriteVisibilitymap can get
>> called with errno already set to a nonzero value.
>>
>> It never clears it, and then fails at the end despite that no error
>> has actually occurred.
>>
>> Just setting it to 0 at the top of the function seems to be correct
>> thing to do.  Or does it need to save the old value and restore it?
>
> Thank you for testing!
> I think that the former is better, so attached latest patch.
>
>> But now when I want to do the upgrade faster, I run into this:
>>
>> "This utility cannot upgrade from PostgreSQL version from 9.5 or
>> before to 9.6 or later with link mode."
>>
>> Is this really an acceptable a tradeoff?  Surely we can arrange to
>> link everything else and rewrite just the _vm, which is a tiny portion
>> of the data directory.  I don't think that -k promises to link
>> everything it possibly can.
>
> I agree.
> I've changed the patch so that.
> pg_upgarde creates new _vm file and rewrites it even if upgrading to
> 9.6 with link mode.


The rewrite code thinks that only the first page of a vm has a header
of size SizeOfPageHeaderData, and the rest of the pages have a zero
size header.  So the resulting _vm is corrupt.

After pg_upgrade, doing a vacuum freeze verbose gives:


WARNING:  invalid page in block 1 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 1 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 2 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 2 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 3 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 3 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 4 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 4 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 5 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 5 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 6 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 6 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 7 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 7 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 8 of relation base/16402/22430_vm;
zeroing out page
WARNING:  invalid page in block 8 of relation base/16402/22430_vm;
zeroing out page

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-19 Thread Masahiko Sawada
On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes  wrote:
> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes  wrote:
>>
>> I get an error when running pg_upgrade from 9.4 to 9.6-this
>>
>> error while copying relation "mediawiki.archive"
>> ("/tmp/data/base/16414/21043_vm" to
>> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory
>
> OK, so the problem seems to be that rewriteVisibilitymap can get
> called with errno already set to a nonzero value.
>
> It never clears it, and then fails at the end despite that no error
> has actually occurred.
>
> Just setting it to 0 at the top of the function seems to be correct
> thing to do.  Or does it need to save the old value and restore it?

Thank you for testing!
I think that the former is better, so attached latest patch.

> But now when I want to do the upgrade faster, I run into this:
>
> "This utility cannot upgrade from PostgreSQL version from 9.5 or
> before to 9.6 or later with link mode."
>
> Is this really an acceptable a tradeoff?  Surely we can arrange to
> link everything else and rewrite just the _vm, which is a tiny portion
> of the data directory.  I don't think that -k promises to link
> everything it possibly can.

I agree.
I've changed the patch so that.
pg_upgarde creates new _vm file and rewrites it even if upgrading to
9.6 with link mode.

Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v25.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-18 Thread Jeff Janes
On Tue, Nov 17, 2015 at 10:32 PM, Masahiko Sawada  wrote:
>
> Attached latest v24 patch.
> I've changed patch so that just adding frozen bit into visibility map.
> So the size of patch is almost half of previous one.
>

Should there be an Assert in visibilitymap_get_status (or elsewhere)
against the impossible state of being all frozen but not all visible?

I get an error when running pg_upgrade from 9.4 to 9.6-this

error while copying relation "mediawiki.archive"
("/tmp/data/base/16414/21043_vm" to
"/tmp/data_fm/base/16400/21043_vm"): No such file or directory


Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-18 Thread Jeff Janes
On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes  wrote:
>
> I get an error when running pg_upgrade from 9.4 to 9.6-this
>
> error while copying relation "mediawiki.archive"
> ("/tmp/data/base/16414/21043_vm" to
> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory

OK, so the problem seems to be that rewriteVisibilitymap can get
called with errno already set to a nonzero value.

It never clears it, and then fails at the end despite that no error
has actually occurred.

Just setting it to 0 at the top of the function seems to be correct
thing to do.  Or does it need to save the old value and restore it?

But now when I want to do the upgrade faster, I run into this:

"This utility cannot upgrade from PostgreSQL version from 9.5 or
before to 9.6 or later with link mode."

Is this really an acceptable a tradeoff?  Surely we can arrange to
link everything else and rewrite just the _vm, which is a tiny portion
of the data directory.  I don't think that -k promises to link
everything it possibly can.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Thom Brown
On 17 November 2015 at 10:29, Masahiko Sawada  wrote:
>
>
> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas  wrote:
>> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila 
>> wrote:
>>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund 
>>> wrote:
 On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
 > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs 
 > wrote:
 > >
 > > On 1 October 2015 at 23:30, Josh Berkus  wrote:
 > >>
 > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
 > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
 > >> > 
 > wrote:
 > >> >> I wonder how much it's worth renaming only the file extension
 > >> >> while
 > >> >> there are many places where "visibility map" and "vm" are used,
 > >> >> for example, log messages, function names, variables, etc.
 > >> >
 > >> > I'd be inclined to keep calling it the visibility map (vm) even
 > >> > if
 > >> > it
 > >> > also contains freeze information.
 > >> >
 >
 > What is your main worry about changing the name of this map, is it
 > about more code churn or is it about that we might introduce new
 > issues
 > or is it about that people are already accustomed to call this map as
 > visibility map?

 Several:
 * Visibility map is rather descriptive, none of the replacement terms
   imo come close. Few people will know what a 'freeze' map is.
 * It increases the size of the patch considerably
 * It forces tooling that knows about the layout of the database
   directory to change their tools

>>>
>>> All these points are legitimate.
>>>
 On the benfit side the only argument I've heard so far is that it allows
 to disambiguate the format. But, uh, a look at the major version does
 that just as well, for far less trouble.

 > It seems to me quite logical for understanding purpose as well.  Any
 > new
 > person who wants to work in this area or is looking into it will
 > always
 > wonder why this map is named as visibility map even though it contains
 > information about visibility of page as well as frozen state of page.

 Being frozen is about visibility as well.

>>>
>>> OTOH being visible doesn't mean page is frozen.  I understand that frozen
>>> is
>>> related to visibility, but still it is a separate state of page and used
>>> for
>>> different
>>> purpose.  I think this is a subjective point and we could go either way,
>>> it
>>> is
>>> just a matter in which way more people are comfortable.
>>
>> I'm stickin' with what I said before, and what I think Andres is
>> saying too: renaming the map is a horrible idea.  It produces lots of
>> code churn for no real benefit.  We usually avoid such changes, and I
>> think we should do so here, too.
>
> I understood.
> I'm going to turn the patch back to visibility map, and just add the logic
> of enhancement of VACUUM FREEZE.
> If we want to add the other status not related to visibility into visibility
> map in the future, it would be worth to consider.

Could someone post a TL;DR summary of what the current plan looks
like?  I can see there is a huge amount of discussion to trawl back
through.  I can see it's something to do with the visibility map.  And
does it avoid freezing very large tables like the title originally
sought?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Thom Brown
On 17 November 2015 at 15:43, Jim Nasby  wrote:
> On 11/17/15 4:41 AM, Thom Brown wrote:
>>
>> Could someone post a TL;DR summary of what the current plan looks
>> like?  I can see there is a huge amount of discussion to trawl back
>> through.  I can see it's something to do with the visibility map.  And
>> does it avoid freezing very large tables like the title originally
>> sought?
>
>
> Basically, it follows the same pattern that all-visible bits do, except
> instead of indicating a page is all-visible, the bit shows that all tuples
> on the page are frozen. That allows a scan_all vacuum to skip those pages.

So the visibility map is being repurposed?  And if a row on a frozen
page is modified, what happens to the visibility of all other rows on
that page, as the bit will be set back to 0?  I think I'm missing a
critical part of this functionality.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Jim Nasby

On 11/17/15 4:41 AM, Thom Brown wrote:

Could someone post a TL;DR summary of what the current plan looks
like?  I can see there is a huge amount of discussion to trawl back
through.  I can see it's something to do with the visibility map.  And
does it avoid freezing very large tables like the title originally
sought?


Basically, it follows the same pattern that all-visible bits do, except 
instead of indicating a page is all-visible, the bit shows that all 
tuples on the page are frozen. That allows a scan_all vacuum to skip 
those pages.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Masahiko Sawada
On Wed, Nov 18, 2015 at 12:56 AM, Thom Brown  wrote:
> On 17 November 2015 at 15:43, Jim Nasby  wrote:
>> On 11/17/15 4:41 AM, Thom Brown wrote:
>>>
>>> Could someone post a TL;DR summary of what the current plan looks
>>> like?  I can see there is a huge amount of discussion to trawl back
>>> through.  I can see it's something to do with the visibility map.  And
>>> does it avoid freezing very large tables like the title originally
>>> sought?
>>
>>
>> Basically, it follows the same pattern that all-visible bits do, except
>> instead of indicating a page is all-visible, the bit shows that all tuples
>> on the page are frozen. That allows a scan_all vacuum to skip those pages.
>
> So the visibility map is being repurposed?

My proposal is to add additional one bit that indicates all tuples on
page are completely frozen, into visibility map.
That is, the visibility map will become a bitmap with two bits
(all-visible, all-frozen) per page.

> And if a row on a frozen
> page is modified, what happens to the visibility of all other rows on
> that page, as the bit will be set back to 0?

In this case, the corresponding VM both bits are cleared.
Such behaviour is almost same as what postgresql is doing today.


Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Masahiko Sawada
On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas > wrote:
> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila > wrote:
>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund > wrote:
>>> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
>>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs >
>>> > wrote:
>>> > >
>>> > > On 1 October 2015 at 23:30, Josh Berkus > wrote:
>>> > >>
>>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
>>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <
masao.fu...@gmail.com >
>>> > wrote:
>>> > >> >> I wonder how much it's worth renaming only the file extension
>>> > >> >> while
>>> > >> >> there are many places where "visibility map" and "vm" are used,
>>> > >> >> for example, log messages, function names, variables, etc.
>>> > >> >
>>> > >> > I'd be inclined to keep calling it the visibility map (vm) even
if
>>> > >> > it
>>> > >> > also contains freeze information.
>>> > >> >
>>> >
>>> > What is your main worry about changing the name of this map, is it
>>> > about more code churn or is it about that we might introduce new
issues
>>> > or is it about that people are already accustomed to call this map as
>>> > visibility map?
>>>
>>> Several:
>>> * Visibility map is rather descriptive, none of the replacement terms
>>>   imo come close. Few people will know what a 'freeze' map is.
>>> * It increases the size of the patch considerably
>>> * It forces tooling that knows about the layout of the database
>>>   directory to change their tools
>>>
>>
>> All these points are legitimate.
>>
>>> On the benfit side the only argument I've heard so far is that it allows
>>> to disambiguate the format. But, uh, a look at the major version does
>>> that just as well, for far less trouble.
>>>
>>> > It seems to me quite logical for understanding purpose as well.  Any
new
>>> > person who wants to work in this area or is looking into it will
always
>>> > wonder why this map is named as visibility map even though it contains
>>> > information about visibility of page as well as frozen state of page.
>>>
>>> Being frozen is about visibility as well.
>>>
>>
>> OTOH being visible doesn't mean page is frozen.  I understand that
frozen is
>> related to visibility, but still it is a separate state of page and used
for
>> different
>> purpose.  I think this is a subjective point and we could go either way,
it
>> is
>> just a matter in which way more people are comfortable.
>
> I'm stickin' with what I said before, and what I think Andres is
> saying too: renaming the map is a horrible idea.  It produces lots of
> code churn for no real benefit.  We usually avoid such changes, and I
> think we should do so here, too.

I understood.
I'm going to turn the patch back to visibility map, and just add the logic
of enhancement of VACUUM FREEZE.
If we want to add the other status not related to visibility into
visibility map in the future, it would be worth to consider.

Regards,

--
Masahiko Sawada


-- 
Regards,

--
Masahiko Sawada


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Masahiko Sawada
On Tue, Nov 17, 2015 at 7:29 PM, Masahiko Sawada  wrote:
>
>
> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas  wrote:
>> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila 
>> wrote:
>>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund 
>>> wrote:
 On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
 > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs 
 > wrote:
 > >
 > > On 1 October 2015 at 23:30, Josh Berkus  wrote:
 > >>
 > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
 > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
 > >> > 
 > wrote:
 > >> >> I wonder how much it's worth renaming only the file extension
 > >> >> while
 > >> >> there are many places where "visibility map" and "vm" are used,
 > >> >> for example, log messages, function names, variables, etc.
 > >> >
 > >> > I'd be inclined to keep calling it the visibility map (vm) even
 > >> > if
 > >> > it
 > >> > also contains freeze information.
 > >> >
 >
 > What is your main worry about changing the name of this map, is it
 > about more code churn or is it about that we might introduce new
 > issues
 > or is it about that people are already accustomed to call this map as
 > visibility map?

 Several:
 * Visibility map is rather descriptive, none of the replacement terms
   imo come close. Few people will know what a 'freeze' map is.
 * It increases the size of the patch considerably
 * It forces tooling that knows about the layout of the database
   directory to change their tools

>>>
>>> All these points are legitimate.
>>>
 On the benfit side the only argument I've heard so far is that it allows
 to disambiguate the format. But, uh, a look at the major version does
 that just as well, for far less trouble.

 > It seems to me quite logical for understanding purpose as well.  Any
 > new
 > person who wants to work in this area or is looking into it will
 > always
 > wonder why this map is named as visibility map even though it contains
 > information about visibility of page as well as frozen state of page.

 Being frozen is about visibility as well.

>>>
>>> OTOH being visible doesn't mean page is frozen.  I understand that frozen
>>> is
>>> related to visibility, but still it is a separate state of page and used
>>> for
>>> different
>>> purpose.  I think this is a subjective point and we could go either way,
>>> it
>>> is
>>> just a matter in which way more people are comfortable.
>>
>> I'm stickin' with what I said before, and what I think Andres is
>> saying too: renaming the map is a horrible idea.  It produces lots of
>> code churn for no real benefit.  We usually avoid such changes, and I
>> think we should do so here, too.
>
> I understood.
> I'm going to turn the patch back to visibility map, and just add the logic
> of enhancement of VACUUM FREEZE.

Attached latest v24 patch.
I've changed patch so that just adding frozen bit into visibility map.
So the size of patch is almost half of previous one.

Please review it.

Regards,

--
Masahiko Sawada
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
 		 * If the page has only visible tuples, then we can find out the free
 		 * space from the FSM and move on.
 		 */
-		if (visibilitymap_test(rel, blkno, ))
+		if (VM_ALL_VISIBLE(rel, blkno, ))
 		{
 			freespace = GetRecordedFreeSpace(rel, blkno);
 			stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e14851..c75a166 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5905,7 +5905,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an eager freezing if the table's
 pg_class.relfrozenxid field has reached
 the age specified by this setting.  The default is 150 million
 transactions.  Although users can set this value anywhere from zero to
@@ -5949,7 +5949,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
   

-VACUUM performs a whole-table scan if the table's
+VACUUM performs an eager freezing if the table's
 pg_class.relminmxid field has reached
 the age specified by this setting.  The default is 150 million multixacts.
 Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- 

Re: [HACKERS] Freeze avoidance of very large table.

2015-11-16 Thread Robert Haas
On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila  wrote:
> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund  wrote:
>> On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
>> > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs 
>> > wrote:
>> > >
>> > > On 1 October 2015 at 23:30, Josh Berkus  wrote:
>> > >>
>> > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
>> > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao 
>> > wrote:
>> > >> >> I wonder how much it's worth renaming only the file extension
>> > >> >> while
>> > >> >> there are many places where "visibility map" and "vm" are used,
>> > >> >> for example, log messages, function names, variables, etc.
>> > >> >
>> > >> > I'd be inclined to keep calling it the visibility map (vm) even if
>> > >> > it
>> > >> > also contains freeze information.
>> > >> >
>> >
>> > What is your main worry about changing the name of this map, is it
>> > about more code churn or is it about that we might introduce new issues
>> > or is it about that people are already accustomed to call this map as
>> > visibility map?
>>
>> Several:
>> * Visibility map is rather descriptive, none of the replacement terms
>>   imo come close. Few people will know what a 'freeze' map is.
>> * It increases the size of the patch considerably
>> * It forces tooling that knows about the layout of the database
>>   directory to change their tools
>>
>
> All these points are legitimate.
>
>> On the benfit side the only argument I've heard so far is that it allows
>> to disambiguate the format. But, uh, a look at the major version does
>> that just as well, for far less trouble.
>>
>> > It seems to me quite logical for understanding purpose as well.  Any new
>> > person who wants to work in this area or is looking into it will always
>> > wonder why this map is named as visibility map even though it contains
>> > information about visibility of page as well as frozen state of page.
>>
>> Being frozen is about visibility as well.
>>
>
> OTOH being visible doesn't mean page is frozen.  I understand that frozen is
> related to visibility, but still it is a separate state of page and used for
> different
> purpose.  I think this is a subjective point and we could go either way, it
> is
> just a matter in which way more people are comfortable.

I'm stickin' with what I said before, and what I think Andres is
saying too: renaming the map is a horrible idea.  It produces lots of
code churn for no real benefit.  We usually avoid such changes, and I
think we should do so here, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   >