Re: [HACKERS] Btrfs clone WIP patch
Greg Smith wrote: I think I can see how to construct such an example for the btrfs version, but having you show that explicitly (preferably with a whole sample session executing it) will also help reviewers. Remember: if you want to get your submission off to a good start, the reviewer should be able to run your sample test, see the code work, and do something fun within a few seconds of compiling it. Make that easy for them, and your reviewer will start with a good impression of you and a positive outlook for the change. Yes, an example is very simple with Btrfs, since it only requires one GUC variable and that the cluster be created on a Btrfs file system. Constructing a ZFS is decidedly non-trivial and I'm starting to question if it's worth it. Now onto the code nitpicking! = Extension vs. GUC = In addition to not polluting the postgresql.conf.sample, there's another reason this might make for better extension material eventually. Adding new execv calls that are building strings like this is just generally a risky thing. It would be nice from a security perspective if that entire mechanism wasn't even introduced into the server at all unless someone loaded the extension. An extension implementation will end up being more code, both to add a useful hook for replace these calls and for the extension packaging itself. Having a bit more code in contrib/ but less in src postgresql.conf is probably a net positive trade though. I will look into how to write an extension. = Diff reduction / code refactoring = Looks like you added a File Operation Options entry into guc.c but then not use it here? I would just keep this in an existing category for now, try to reduce the diff length of the proof of concept version as much as possible in the beginning. Yes, that is a good idea. On the topic of smaller diffs, the similar cut and paste sections of the two entries that both do fork/exec/waitpid should really be refactored into one function. The archiver does something similar for running archive_command, there may be some reuse or refactoring of its code to present this interface. I'll look at archive_command to see what might be in common. Again, this sort of refactoring is not necessary as a POC patch. But it will probably come up if this moves toward commit candidate. = Recursion and directory navigation = In either case, the directories are copied recursively while the Postgres internal copydir function does not recurse. I don't think that should be a problem since there shouldn't be nested directories in the first place. copydir takes an option for whether it should recurse or not. The rm side of makes me twitch for a number of reasons. First off, there's just the general scariness of the concept of shelling out to run rm recursively with some text string you build. The worst time I saw a bug in that sort of code destroyed a terabyte, and the last time I saw such a bug was only a week ago. Validation you're doing the right thing is always job #1 in removing files. I needed to call an external command to remove a directory only when experimenting on ZFS since the regular implementation works fine on Btrfs. Unlike Btrfs, ZFS does not have any capability to clone individual files. Therefore, a directory under base/ has to start out as a ZFS snapshot, which can be cloned with the zfs clone command to become a snapshot directory with exactly the same contents. To remove the clone, the zfs destroy command has to be called on the clone. AFAICT, clone and remove operations on ZFS always operate on a whole directory at a time. The same sort of issue is in your external_copydir. Iterating into subdirectories when it doesn't happen now just isn't safe, even though the one expected case you're hooking won't be any different. You really can't just do that. Would this work instead, and is there any concern about files that start with a .? cp * --reflink=auto Regardless, you need to keep most of the structure to copydir anyway. Error handling, handling cancellation, and fsync calls are all vital things. You probably have to make the forked command copy a single file at a time to get the same interrupt handling behavior. In an earlier implementation, I did call cp --reflink=auto once per regular file, preserving the behavior of copydir. On Btrfs, this works well, though slightly slower due to extra processes. AFAIK, there's no way to do something equivalent on ZFS without coming up with a much more complicated scheme involving both links and clones. I don't think it will be possible to implement a scheme that works on ZFS and addresses your concerns about file and directory handling that is not many times more complex than what I have so far. OTOH, I think the approach I have already implemented which calls an external command for each regular file to copy might be acceptable. Since I don't personally have much
Re: [HACKERS] Btrfs clone WIP patch
Phil Sorber wrote: On Wed, Feb 13, 2013 at 5:48 PM, Josh Berkus j...@agliodbs.com wrote: On 02/13/2013 02:13 PM, Tom Lane wrote: The big-picture question of course is whether we want to carry and maintain a filesystem-specific hack. I don't have a sense that btrfs is so widely used as to justify this. If this is a valuable hack, it seems like it could work on ZFS as well. If we could make it for any snapshot-capable filesystem, and not just BTRFS, then it would make more sense. I was thinking that too, but I think this is a file level clone, not a whole filesystem. As far as I can tell, you can't clone individual files in ZFS. I've been thinking about both of these issues and decided to try a different approach. This patch adds GUC options for two external commands: one to copy a directory and one to delete a directory. This allows filesystem-specific tools to be used to accomplish the efficient cloning without Postgres having to know any details. This works particularly well for Btrfs. On a GNU/Linux system, one can simply configure the external copy command as /bin/cp -r --reflink=auto and efficient cloning will be done on file systems that support it and ordinary copying will be done otherwise. The directory deletion command isn't needed and no special Postgres setup is required other than putting the data directory on a Btrfs file system. I have just been experimenting with ZFS and it does not seem to have any capability or interface for cloning ordinary files or directories so the configuration is not as straightforward. However, I was able to set up a Postgres cluster as a hierarchy of ZFS file systems in the same pool with each directory under base being a separate file system and configure Postgres to call shell scripts which call zfs snapshot and clone commands to do the cloning and deleting. In either case, the directories are copied recursively while the Postgres internal copydir function does not recurse. I don't think that should be a problem since there shouldn't be nested directories in the first place. -- Jonathan Ross Rogers diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 37bd0a4..5b87eb1 100644 *** a/src/backend/commands/dbcommands.c --- b/src/backend/commands/dbcommands.c *** *** 23,28 --- 23,29 #include locale.h #include unistd.h #include sys/stat.h + #include sys/wait.h #include access/genam.h #include access/heapam.h *** *** 44,49 --- 45,51 #include miscadmin.h #include pgstat.h #include postmaster/bgwriter.h + #include postmaster/fork_process.h #include storage/bufmgr.h #include storage/copydir.h #include storage/fd.h *** *** 608,614 * * We don't need to copy subdirectories */ ! copydir(srcpath, dstpath, false); /* Record the filesystem change in XLOG */ { --- 610,619 * * We don't need to copy subdirectories */ ! if (external_copy_command) ! external_copydir(srcpath, dstpath); ! else ! copydir(srcpath, dstpath, false); /* Record the filesystem change in XLOG */ { *** *** 1702,1707 --- 1707,1767 return result; } + + #define MAX_RM_TABLESPACE_WORDS 10 + char *rm_tablespace_dir_command = NULL; + + + /* Remove a single tablespace directory by calling an external command */ + void + external_rm_tablespace_dir(char *dir) + { + char *cmd_path; + /* leave space for directory names and terminator */ + char *argv[MAX_RM_TABLESPACE_WORDS + 3]; + int argc = 0; + pid_t pid; + int status; + + char *sc = strdup(rm_tablespace_dir_command); + char *token; + + Assert(sc); + token = strtok(sc, ); + while (token) + { + ereport(DEBUG1, (errmsg(Appending \%s\, token))); + argv[argc++] = token; + Assert(argc = MAX_RM_TABLESPACE_WORDS); + token = strtok(NULL, ); + } + Assert(argc 0); + argv[argc++] = dir; + cmd_path = argv[0]; + ereport(DEBUG1, (errmsg(Calling %s to delete \%s\, cmd_path, dir))); + argv[argc] = NULL; + Assert(argv[argc] == NULL); + + /* Fire off execv in child */ + if ((pid = fork_process()) == 0) + { + if (execv(cmd_path, argv) 0) + { + ereport(LOG, + (errmsg(could not execute \%s\, cmd_path))); + /* We're already in the child process here, can't return */ + exit(1); + } + } + else { + waitpid(pid, status, 0); + if (status) { + ereport(ERROR, (errmsg(could not delete dir \%s\, dir))); + } + } + free(sc); + } + /* * Remove tablespace directories * *** *** 1747,1753 continue; } ! if (!rmtree(dstpath, true)) ereport(WARNING, (errmsg(some useless files may be left behind in old database directory \%s\, dstpath))); --- 1807,1815 continue; } ! if (rm_tablespace_dir_command) ! external_rm_tablespace_dir(dstpath); ! else if (!rmtree(dstpath, true))
Re: [RFC] ideas for a new Python DBAPI driver (was Re: [HACKERS] libpq test suite)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 A number of the described features sound quite useful. Is it not practical to extend an existing library such as psycopg2? What method will you use to call libpq functions? As you are no doubt aware, psycopg2 uses the traditional CPython API but there is a fork which uses ctypes instead, enabling it to work with CPython and Pypy. Manlio Perillo wrote: For my Python DBAPI2 PostgreSQL driver I plan the following optimizations: 2) when the driver detects a Python string is being sent to the database, use binary format. What exactly do you mean by Python string? - -- Jonathan Ross Rogers -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlEdHFAACgkQVmXTv6uMqqOlzgCgxcGtjwFqK1AMKAED9rK5YSOa 3AMAoJhw4197HPBrjpxG/iSLK43B1B3j =RnaP -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Btrfs clone WIP patch
This patch against PostgreSQL 9.1.8 takes advantage of efficient file cloning on Linux Btrfs file systems to make CREATE DATABASE operations extremely fast regardless of the size of the database used as a template. On my system, I can create a database from a multi-gibibyte template in a second or less. This is very useful for automated testing as well in a development environment where reverting to a baseline database is frequently required. As an added bonus, newly created databases require very little additional disk storage until they diverge from the template. The efficient cloning is accomplished by a Btrfs-specific ioctl() call. On non-Linux systems or if the ioctl() call fails, file contents are copied in the conventional way so no configuration is needed. This has been tested on a Linux system on both Btrfs and XFS file systems as well as an OSX system. The clone_file() function was originally copied from GNU coreutils which is under GPL v3. The function is currently only about ten lines long and contains little essential information beyond the magic values needed for the ioctl() call so I'm not sure if license is a problem. -- Jonathan Ross Rogers diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c index 6cfb816..719a5c1 100644 --- a/src/backend/storage/file/copydir.c +++ b/src/backend/storage/file/copydir.c @@ -22,6 +22,10 @@ #include unistd.h #include sys/stat.h +#ifdef HAVE_SYS_IOCTL_H +#include sys/ioctl.h +#endif + #include storage/copydir.h #include storage/fd.h #include miscadmin.h @@ -139,6 +143,24 @@ copydir(char *fromdir, char *todir, bool recurse) } /* + * Perform the O(1) btrfs clone operation, if possible. + * Upon success, return 0. Otherwise, return -1. + */ +static inline int +clone_file (int dest_fd, int src_fd) +{ +#ifdef __linux__ +# define BTRFS_IOCTL_MAGIC 0x94 +# define BTRFS_IOC_CLONE _IOW (BTRFS_IOCTL_MAGIC, 9, int) + return ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd); +#else + (void) dest_fd; + (void) src_fd; + return -1; +#endif +} + +/* * copy one file */ void @@ -150,11 +172,6 @@ copy_file(char *fromfile, char *tofile) int nbytes; off_t offset; - /* Use palloc to ensure we get a maxaligned buffer */ -#define COPY_BUF_SIZE (8 * BLCKSZ) - - buffer = palloc(COPY_BUF_SIZE); - /* * Open the files */ @@ -171,38 +188,54 @@ copy_file(char *fromfile, char *tofile) (errcode_for_file_access(), errmsg(could not create file \%s\: %m, tofile))); - /* - * Do the data copying. - */ - for (offset = 0;; offset += nbytes) + if (clone_file (dstfd, srcfd) == 0) + ereport(DEBUG1, (errmsg(Cloned \%s\ to \%s\., fromfile, tofile))); + + else { - /* If we got a cancel signal during the copy of the file, quit */ - CHECK_FOR_INTERRUPTS(); + /* + * Do the data copying. + */ - nbytes = read(srcfd, buffer, COPY_BUF_SIZE); - if (nbytes 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg(could not read file \%s\: %m, fromfile))); - if (nbytes == 0) - break; - errno = 0; - if ((int) write(dstfd, buffer, nbytes) != nbytes) + /* Use palloc to ensure we get a maxaligned buffer */ +#define COPY_BUF_SIZE (8 * BLCKSZ) + + buffer = palloc(COPY_BUF_SIZE); + + ereport(DEBUG1, (errmsg(Copying \%s\ to \%s\ in userspace., +fromfile, tofile))); + for (offset = 0;; offset += nbytes) { - /* if write didn't set errno, assume problem is no disk space */ - if (errno == 0) -errno = ENOSPC; - ereport(ERROR, - (errcode_for_file_access(), - errmsg(could not write to file \%s\: %m, tofile))); + /* If we got a cancel signal during the copy of the file, quit */ + CHECK_FOR_INTERRUPTS(); + + nbytes = read(srcfd, buffer, COPY_BUF_SIZE); + if (nbytes 0) +ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not read file \%s\: %m, fromfile))); + if (nbytes == 0) +break; + errno = 0; + if ((int) write(dstfd, buffer, nbytes) != nbytes) + { +/* if write didn't set errno, assume problem is no disk space */ +if (errno == 0) + errno = ENOSPC; +ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not write to file \%s\: %m, tofile))); + } + + /* + * We fsync the files later but first flush them to avoid spamming the + * cache and hopefully get the kernel to start writing them out before + * the fsync comes. + */ + pg_flush_data(dstfd, offset, nbytes); } - /* - * We fsync the files later but first flush them to avoid spamming the - * cache and hopefully get the kernel to start writing them out before - * the fsync comes. - */ - pg_flush_data(dstfd, offset, nbytes); + pfree(buffer); } if (close(dstfd)) @@ -212,7 +245,6 @@ copy_file(char *fromfile, char *tofile) close(srcfd); - pfree(buffer); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Btrfs clone WIP patch
Tom Lane wrote: Jonathan Rogers jrog...@socialserve.com writes: This patch against PostgreSQL 9.1.8 takes advantage of efficient file cloning on Linux Btrfs file systems to make CREATE DATABASE operations extremely fast regardless of the size of the database used as a template. It would be easier to review this patch if the bulk of it weren't simple reindentation of existing code. (Or at least it ought to be that --- I object to your having moved the buffer palloc inside the loop. A patch that is trying to optimize a minority case can expect to be rejected if it makes things worse for everyone else.) The buffer allocation is actually not inside the loop, but inside the if branch for ordinary copying behavior since the buffer is unnecessary in the case of a successful clone. Consider whether you can't phrase the patch to avoid that, perhaps by use of continue instead of an else-block. Alternatively, enclose the existing code in braces but don't reindent it, ie, + if (whatever) + ... new code ... + else + { ... existing code ... + } Indeed, I was bothered by the need to reindent so much as well. I'll see if I can do better. The next pgindent run will fix the funny indentation, or the committer can do it if he wishes after reviewing. The efficient cloning is accomplished by a Btrfs-specific ioctl() call. The big-picture question of course is whether we want to carry and maintain a filesystem-specific hack. I don't have a sense that btrfs is so widely used as to justify this. Yes, this is a problem I considered. I think the basic problem is the lack of any kind of generic interface to copy or clone a file. A system call for Linux to copy or clone has been proposed more than once but so far, nothing has been accepted. I believe there are a few file systems that support some kind of efficient cloning, but I haven't investigated it deeply. +#ifdef __linux__ +# define BTRFS_IOCTL_MAGIC 0x94 +# define BTRFS_IOC_CLONE _IOW (BTRFS_IOCTL_MAGIC, 9, int) +return ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd); +#else This seems to me to be unacceptable on its face. If we can't get these constants out of a system header file, it's unlikely that the feature is stable enough to depend on, if indeed it's meant for general-purpose use at all. We could easily end up invoking unexpected behaviors. Of course you're right that defining values right there is no good. It looks like the values are in the Linux headers since 2.6.32 when Btrfs was merged into mainline. I guess I'll need to brush up on CPP to figure out how to use the Linux header values if they exist. Would it be better to move clone_file() into its own module where implementations for other file system types might eventually be added? My first implementation called cp with the --reflink=auto option since that seems to be the closest thing to a file system agnostic interface. The above snippet comes directly from the GNU cp source and I'm not sure why that code defines the values instead of taking them from Linux headers. -- Jonathan Ross Rogers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Btrfs clone WIP patch
Josh Berkus wrote: On 02/13/2013 02:13 PM, Tom Lane wrote: The big-picture question of course is whether we want to carry and maintain a filesystem-specific hack. I don't have a sense that btrfs is so widely used as to justify this. If this is a valuable hack, it seems like it could work on ZFS as well. If we could make it for any snapshot-capable filesystem, and not just BTRFS, then it would make more sense. Yes, that's exactly what I hope will be possible. I need to investigate ZFS to see if it's feasible. -- Jonathan Ross Rogers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers