Re: cmake
Hi, responding to all these at once. - Original Message - > From: "Yehuda Sadeh-Weinraub" <yeh...@redhat.com> > To: "Sage Weil" <sw...@redhat.com> > Cc: "ceph-devel" <ceph-devel@vger.kernel.org> > Sent: Wednesday, December 16, 2015 1:45:54 PM > Subject: Re: cmake > > On Wed, Dec 16, 2015 at 9:33 AM, Sage Weil <sw...@redhat.com> wrote: > > The work to transition to cmake has stalled somewhat. I've tried to use > > it a few times but keep running into issues that make it unusable for me. > > Not having make check is a big one, but I think the hackery required to > > get that going points to the underlying problem(s). I'm going to push for cmake work already in progress to be moved to the next milestone ASAP. With respect to "make check" blockers, which contains the issue of where cmake puts built objects. Ali, Casey, and I discussed this today at some length. We think the current "hackery" to make cmake make check work "the same way" auto* did is long-term undesirable due to it mutating files in the src dir. I have not assumed that it would be an improvement to put all objects built in a tree of submakes into a single dir, as automake does. I do think it is essential that at least eventually, it makes it simple to operate on any object that is built, and simple to extend processes like make check. Ali and Casey agree, but contend that the current make check work is "almost finished"--specifically, that it could be finished and a PR sent -this week-. Rewriting it will take additional time. They propose starting with finishing and documenting the current setup, then doing a larger cleanup. What do others think? Matt > > > > I seems like the main problem is that automake puts all build targets in > > src/ and cmake spreads them all over build/*. This makes that you can't > > just add ./ to anything that would normally be in your path (or, > > PATH=.:$PATH, and then run, say, ../qa/workunits/cephtool/test.sh). > > There's a bunch of kludges in vstart.sh to make it work that I think > > mostly point to this issue (and the .libs things). Is there simply an > > option we can give cmake to make it put built binaries directly in build/? > > > > Stepping back a bit, it seems like the goals should be > > > > 1. Be able to completely replace autotools. I don't fancy maintaining > > both in parallel. > > > > Is cmake a viable option in all environments we expect ceph (or any > part of) to be compiled on? (e.g. aix, solaris, freebsd, different > linux arm distros, etc.) One cannot expect cmake to be pre-installed on those platforms, but it will work on every one you mentioned, some others, not to mention Windows. > > > 2. Be able to run vstart etc from the build dir. > > There's an awful hack currently in vstart.sh and stop.sh that checks > for CMakeCache.txt in the current work directory to verify whether we > built using cmake or autotools. Can we make this go away? > We can do something like having the build system create a > 'ceph-setenv.sh' script that would set the env (or open a shelll) with > the appropriate paths. > > > > > 3. Be able to run ./ceph[-anything] from the build dir, or put the build > > dir in the path. (I suppose we could rely in a make install step, but > > that seems like more hassle... hopefully it's not neceesary?) > > > > 4. make check has to work > > > > 5. Use make-dist.sh to generate a release tarball (not make dist) > > > > 6. gitbuilders use make-dist.sh and cmake to build packages > > > > 7. release process uses make-dist.sh and cmake to build a relelase > > > > I'm probably missing something? > > > > Should we set a target of doing the 10.0.2 or .3 with cmake? > > > > sage > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Improving Data-At-Rest encryption in Ceph
Hi, Thanks for this detailed response. - Original Message - > From: "Lars Marowsky-Bree" <l...@suse.com> > To: "Ceph Development" <ceph-devel@vger.kernel.org> > Sent: Tuesday, December 15, 2015 9:23:04 AM > Subject: Re: Improving Data-At-Rest encryption in Ceph > > It's not yet perfect, but I think the approach is superior to being > implemented in Ceph natively. If there's any encryption that should be > implemented in Ceph, I believe it'd be the on-the-wire encryption to > protect against evasedroppers. ++ > > Other scenarios would require client-side encryption. ++ > > > Cryptographic keys are stored on filesystem of storage node that hosts > > OSDs. Changing them require redeploying the OSDs. > > This is solvable by storing the key on an external key server. ++ > > Changing the key is only necessary if the key has been exposed. And with > dm-crypt, that's still possible - it's not the actual encryption key > that's stored, but the secret that is needed to unlock it, and that can > be re-encrypted quite fast. (In theory; it's not implemented yet for > the Ceph OSDs.) > > > > Data incoming from Ceph clients would be encrypted by primary OSD. It > > would replicate ciphertext to non-primary members of an acting set. > > This still exposes data in coredumps or on swap on the primary OSD, and > metadata on the secondaries. > > > Regards, > Lars > > -- > Architect Storage/HA > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > -- -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: queue_transaction interface + unique_ptr + performance
++ #1 - Original Message - > From: "Sage Weil" <s...@newdream.net> > To: "Somnath Roy" <somnath@sandisk.com> > Cc: "Samuel Just (sam.j...@inktank.com)" <sam.j...@inktank.com>, > ceph-devel@vger.kernel.org > Sent: Thursday, December 3, 2015 6:50:26 AM > Subject: RE: queue_transaction interface + unique_ptr + performance > > 1- I agree we should avoid shared_ptr whenever possible. > > > sage -- -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cmake
I always run cmake from a build directory which is not the root, usually "build" in the root, so my minimal invocation would be "mkdir build; cd build; cmake ../src"--I'd at least try that, though I wouldn't have thought build location could affect something this basic (and it would be a bug). Matt - Original Message - > From: "Pete Zaitcev" <zait...@redhat.com> > To: ceph-devel@vger.kernel.org > Sent: Thursday, December 3, 2015 5:24:36 PM > Subject: cmake > > Dear All: > > I'm trying to run cmake, in order to make sure my patches do not break it > (in particular WIP 5073 added source files). Result looks like this: > > [zaitcev@lembas ceph-tip]$ cmake src > -- The C compiler identification is GNU 5.1.1 > -- The CXX compiler identification is GNU 5.1.1 > -- Check for working C compiler: /usr/bin/cc > -- Check for working C compiler: /usr/bin/cc -- works > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Detecting C compile features > -- Detecting C compile features - done > -- Check for working CXX compiler: /usr/bin/c++ > -- Check for working CXX compiler: /usr/bin/c++ -- works > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Detecting CXX compile features > -- Detecting CXX compile features - done > CMake Error at CMakeLists.txt:1 (include): > include could not find load file: > > GetGitRevisionDescription > > > -- The ASM compiler identification is GNU > -- Found assembler: /usr/bin/cc > CMake Warning (dev) at CMakeLists.txt:11 (add_definitions): > Policy CMP0005 is not set: Preprocessor definition values are now escaped > automatically. Run "cmake --help-policy CMP0005" for policy details. Use > the cmake_policy command to set the policy and suppress this warning. > This warning is for project developers. Use -Wno-dev to suppress it. > > CMake Warning (dev) at CMakeLists.txt:12 (add_definitions): > Policy CMP0005 is not set: Preprocessor definition values are now escaped > automatically. Run "cmake --help-policy CMP0005" for policy details. Use > the cmake_policy command to set the policy and suppress this warning. > This warning is for project developers. Use -Wno-dev to suppress it. > > -- we do not have a modern/working yasm > -- Performing Test COMPILER_SUPPORTS_CXX11 > -- Performing Test COMPILER_SUPPORTS_CXX11 - Success > CMake Error at CMakeLists.txt:95 (get_git_head_revision): > Unknown CMake command "get_git_head_revision". > > > CMake Warning (dev) in CMakeLists.txt: > No cmake_minimum_required command is present. A line of code such as > > cmake_minimum_required(VERSION 3.3) > > should be added at the top of the file. The version specified may be lower > if you wish to support older CMake versions for this project. For more > information run "cmake --help-policy CMP". > This warning is for project developers. Use -Wno-dev to suppress it. > > -- Configuring incomplete, errors occurred! > See also "/q/zaitcev/ceph/ceph-tip/CMakeFiles/CMakeOutput.log". > [zaitcev@lembas ceph-tip]$ rpm -qa | grep -i cmake > extra-cmake-modules-5.16.0-1.fc23.noarch > cmake-3.3.2-1.fc23.x86_64 > [zaitcev@lembas ceph-tip]$ > > Is this expected? Is my cmake incantation wrong? > > Thanks, > -- Pete > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cmake
sorry, "cmake .." for Ceph's setup. Matt - Original Message - > From: "Matt Benjamin" <mbenja...@redhat.com> > To: "Pete Zaitcev" <zait...@redhat.com> > Cc: ceph-devel@vger.kernel.org > Sent: Thursday, December 3, 2015 5:30:28 PM > Subject: Re: cmake > > I always run cmake from a build directory which is not the root, usually > "build" in the root, so my minimal invocation would be "mkdir build; cd > build; cmake ../src"--I'd at least try that, though I wouldn't have thought > build location could affect something this basic (and it would be a bug). > > Matt > > - Original Message - > > From: "Pete Zaitcev" <zait...@redhat.com> > > To: ceph-devel@vger.kernel.org > > Sent: Thursday, December 3, 2015 5:24:36 PM > > Subject: cmake > > > > Dear All: > > > > I'm trying to run cmake, in order to make sure my patches do not break it > > (in particular WIP 5073 added source files). Result looks like this: > > > > [zaitcev@lembas ceph-tip]$ cmake src > > -- The C compiler identification is GNU 5.1.1 > > -- The CXX compiler identification is GNU 5.1.1 > > -- Check for working C compiler: /usr/bin/cc > > -- Check for working C compiler: /usr/bin/cc -- works > > -- Detecting C compiler ABI info > > -- Detecting C compiler ABI info - done > > -- Detecting C compile features > > -- Detecting C compile features - done > > -- Check for working CXX compiler: /usr/bin/c++ > > -- Check for working CXX compiler: /usr/bin/c++ -- works > > -- Detecting CXX compiler ABI info > > -- Detecting CXX compiler ABI info - done > > -- Detecting CXX compile features > > -- Detecting CXX compile features - done > > CMake Error at CMakeLists.txt:1 (include): > > include could not find load file: > > > > GetGitRevisionDescription > > > > > > -- The ASM compiler identification is GNU > > -- Found assembler: /usr/bin/cc > > CMake Warning (dev) at CMakeLists.txt:11 (add_definitions): > > Policy CMP0005 is not set: Preprocessor definition values are now escaped > > automatically. Run "cmake --help-policy CMP0005" for policy details. > > Use > > the cmake_policy command to set the policy and suppress this warning. > > This warning is for project developers. Use -Wno-dev to suppress it. > > > > CMake Warning (dev) at CMakeLists.txt:12 (add_definitions): > > Policy CMP0005 is not set: Preprocessor definition values are now escaped > > automatically. Run "cmake --help-policy CMP0005" for policy details. > > Use > > the cmake_policy command to set the policy and suppress this warning. > > This warning is for project developers. Use -Wno-dev to suppress it. > > > > -- we do not have a modern/working yasm > > -- Performing Test COMPILER_SUPPORTS_CXX11 > > -- Performing Test COMPILER_SUPPORTS_CXX11 - Success > > CMake Error at CMakeLists.txt:95 (get_git_head_revision): > > Unknown CMake command "get_git_head_revision". > > > > > > CMake Warning (dev) in CMakeLists.txt: > > No cmake_minimum_required command is present. A line of code such as > > > > cmake_minimum_required(VERSION 3.3) > > > > should be added at the top of the file. The version specified may be > > lower > > if you wish to support older CMake versions for this project. For more > > information run "cmake --help-policy CMP". > > This warning is for project developers. Use -Wno-dev to suppress it. > > > > -- Configuring incomplete, errors occurred! > > See also "/q/zaitcev/ceph/ceph-tip/CMakeFiles/CMakeOutput.log". > > [zaitcev@lembas ceph-tip]$ rpm -qa | grep -i cmake > > extra-cmake-modules-5.16.0-1.fc23.noarch > > cmake-3.3.2-1.fc23.x86_64 > > [zaitcev@lembas ceph-tip]$ > > > > Is this expected? Is my cmake incantation wrong? > > > > Thanks, > > -- Pete > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > -- > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-707-0660 > fax. 734-769-8938 > cel. 734-216-5309 > -- -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cmake
Pete, Could you share the branch you are trying to build? (ceph/wip-5073 would not appear to be it.) Matt - Original Message - > From: "Pete Zaitcev" <zait...@redhat.com> > To: "Adam C. Emerson" <aemer...@redhat.com> > Cc: ceph-devel@vger.kernel.org > Sent: Thursday, December 3, 2015 7:03:47 PM > Subject: Re: cmake > > On Thu, 3 Dec 2015 17:30:21 -0500 > "Adam C. Emerson" <aemer...@redhat.com> wrote: > > > On 03/12/2015, Pete Zaitcev wrote: > > > > I'm trying to run cmake, in order to make sure my patches do not break it > > > (in particular WIP 5073 added source files). Result looks like this: > > > > > > [zaitcev@lembas ceph-tip]$ cmake src > > > > I believe the problem is 'cmake src' > > Thanks for the tip about the separate build directory and the top-level > CMakeLists.txt. However, it still fails like this: > > [zaitcev@lembas build]$ cmake .. > CMake Error at CMakeLists.txt:1 (include): > include could not find load file: > > GetGitRevisionDescription > ... > > Do you know by any chance where it gets that include? Also, what's > your cmake --version? > > Greetings, > -- Pete > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ack vs commit
the same > > portion of the file and sees the file content from before client A's > > change. The MDS is extremely careful about this on the metadata side: no > > side-effects of one client are visible to any other client until they are > > durable, so that a combination MDS and client failure will never make > > things appear to go back in time. > > > > Any opinions here? My inclination is to remove the functionality (less > > code, less complexity, more sane semantics), but we'd be closing the door > > on what might have been a half-decent idea (separating serialization from > > durability when multiple clients have the same file open for > > read/write)... > > I've considered this briefly in the past, but I'd really rather we keep it: > > 1) While we don't make much use of it right now, I think it's a useful > feature for raw RADOS users > > 2) It's an incredibly useful protocol semantic for future performance > work of the sort that makes Sam start to cry, but which I find very > interesting. Consider a future when RBD treats the OSDs more like a > disk with a cache, and is able to send out operations to get them out > of local memory, without forcing them instantly to permanent storage. > Similarly, I think as soon as we have a backend that lets us use a bit > of 3D Crosspoint in a system, we'll wish we had this functionality > again. > (Likewise with CephFS, where many many users will have different > consistency expectations thanks to NFS and other parallel FSes which > really aren't consistent.) > > Maybe we just think it's not worth it and we'd rather throw this out. > But the only complexity this really lets us drop is the OSD replay > stuff (which realistically I can't assess) — the dual acks is not hard > to work with, and doesn't get changed much anyway. If we drop this now > and decide to bring back any similar semantic in the future I think > that'll be a lot harder than simply carrying it, both in terms of > banging it back into the code, and especially in terms of getting it > deployed to all the clients in the world again. > -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
For hacking around, put "Graceless = true;" in the NFSV4 block. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "Daniel Gryniewicz" <d...@redhat.com> > To: "John Spray" <jsp...@redhat.com> > Cc: "Ceph Development" <ceph-devel@vger.kernel.org>, "Stefan Hajnoczi" > <shajn...@redhat.com> > Sent: Friday, October 23, 2015 12:34:42 PM > Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha) > > On Fri, Oct 23, 2015 at 9:27 AM, John Spray <jsp...@redhat.com> wrote: > > * NFS writes from the guest are lagging for like a minute before > > completing, my hunch is that this is something in the NFS client > > recovery stuff (in ganesha) that's not coping with vsock, the > > operations seem to complete at the point where the server declares > > itself "NOT IN GRACE". > > > Ganesha always starts in Grace, and will not process new clients until > it exits Grace. Existing clients should re-connect fine, and new > clients work fine after Grace is exited. > > Dan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
We mostly assumed that sort-of transactional file systems, perhaps hosted in user space was the most tractable trajectory. I have seen newstore and keyvalue store as essentially congruent approaches using database primitives (and I am interested in what you make of Russell Sears). I'm skeptical of any hope of keeping things "simple." Like Martin downthread, most systems I havce seen (filers, ZFS)) make use of a fast, durable commit log and then flex out...something else. -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "Sage Weil" <sw...@redhat.com> > To: "John Spray" <jsp...@redhat.com> > Cc: "Ceph Development" <ceph-devel@vger.kernel.org> > Sent: Tuesday, October 20, 2015 4:00:23 PM > Subject: Re: newstore direction > > On Tue, 20 Oct 2015, John Spray wrote: > > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sw...@redhat.com> wrote: > > > - We have to size the kv backend storage (probably still an XFS > > > partition) vs the block storage. Maybe we do this anyway (put metadata > > > on > > > SSD!) so it won't matter. But what happens when we are storing gobs of > > > rgw index data or cephfs metadata? Suddenly we are pulling storage out > > > of > > > a different pool and those aren't currently fungible. > > > > This is the concerning bit for me -- the other parts one "just" has to > > get the code right, but this problem could linger and be something we > > have to keep explaining to users indefinitely. It reminds me of cases > > in other systems where users had to make an educated guess about inode > > size up front, depending on whether you're expecting to efficiently > > store a lot of xattrs. > > > > In practice it's rare for users to make these kinds of decisions well > > up-front: it really needs to be adjustable later, ideally > > automatically. That could be pretty straightforward if the KV part > > was stored directly on block storage, instead of having XFS in the > > mix. I'm not quite up with the state of the art in this area: are > > there any reasonable alternatives for the KV part that would consume > > some defined range of a block device from userspace, instead of > > sitting on top of a filesystem? > > I agree: this is my primary concern with the raw block approach. > > There are some KV alternatives that could consume block, but the problem > would be similar: we need to dynamically size up or down the kv portion of > the device. > > I see two basic options: > > 1) Wire into the Env abstraction in rocksdb to provide something just > smart enough to let rocksdb work. It isn't much: named files (not that > many--we could easily keep the file table in ram), always written > sequentially, to be read later with random access. All of the code is > written around abstractions of SequentialFileWriter so that everything > posix is neatly hidden in env_posix (and there are various other env > implementations for in-memory mock tests etc.). > > 2) Use something like dm-thin to sit between the raw block device and XFS > (for rocksdb) and the block device consumed by newstore. As long as XFS > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb > files in their entirety) we can fstrim and size down the fs portion. If > we similarly make newstores allocator stick to large blocks only we would > be able to size down the block portion as well. Typical dm-thin block > sizes seem to range from 64KB to 512KB, which seems reasonable enough to > me. In fact, we could likely just size the fs volume at something > conservatively large (like 90%) and rely on -o discard or periodic fstrim > to keep its actual utilization in check. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
Hi Bruce, -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "J. Bruce Fields" <bfie...@redhat.com> > To: "Matt Benjamin" <mbenja...@redhat.com> > Cc: "Ceph Development" <ceph-devel@vger.kernel.org>, "Stefan Hajnoczi" > <stefa...@redhat.com>, "Sage Weil" > <sw...@redhat.com> > Sent: Monday, October 19, 2015 11:58:45 AM > Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha) > > On Mon, Oct 19, 2015 at 11:49:15AM -0400, Matt Benjamin wrote: > > - Original Message - > > > From: "J. Bruce Fields" <bfie...@redhat.com> > ... > > > > > > On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote: > > > > Hi devs (CC Bruce--here is a use case for vmci sockets transport) > > > > > > > > One of Sage's possible plans for Manilla integration would use nfs over > > > > the > > > > new Linux vmci sockets transport integration in qemu (below) to access > > > > Cephfs via an nfs-ganesha server running in the host vm. > > > > > > What does "the host vm" mean, and why is this a particularly useful > > > configuration? > > > > Sorry, I should say, "the vm host." > > Got it, thanks! > > > I think the claimed utility here is (at least) three-fold: > > > > 1. simplified configuration on host and guests > > 2. some claim to improved security through isolation > > So why is it especially interesting to put Ceph inside the VM and > Ganesha outside? Oh, sorry. Here Ceph (or Gluster, or, whatever underlying FS provider) is conceptually outside the vm complex altogether, Ganesha is re-exporting on the vm host, and guests access the namespace using NFS(v41). Regards, Matt > > > 3. some expectation of improved latency/performance wrt TCP > > > > Stefan sent a link to a set of slides with his original patches. Did you > > get a chance to read through those? > > > > [1] > > http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf > > Yep, thanks.--b. > > > > > Regards, > > > > Matt > > > > > > > > --b. > > > > > > > > > > > This now experimentally works. > > > > > > > > some notes on running nfs-ganesha over AF_VSOCK: > > > > > > > > 1. need stefan hajnoczi's patches for > > > > * linux kernel (and build w/vhost-vsock support > > > > * qemu (and build w/vhost-vsock support) > > > > * nfs-utils (in vm guest) > > > > > > > > all linked from https://github.com/stefanha?tab=repositories > > > > > > > > 2. host and vm guest kernels must include vhost-vsock > > > > * host kernel should load vhost-vsock.ko > > > > > > > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci > > > > device, e.g > > > > > > > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 > > > > --enable-kvm -drive > > > > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive > > > > file=/opt/isos/f22.iso,media=cdrom -net > > > > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 > > > > -parallel none -serial mon:stdio -device > > > > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c > > > > > > > > 4. nfs-gansha (in host) > > > > * need nfs-ganesha and its ntirpc rpc provider with vsock support > > > > https://github.com/linuxbox2/nfs-ganesha (vsock branch) > > > > https://github.com/linuxbox2/ntirpc (vsock branch) > > > > > > > > * configure ganesha w/vsock support > > > > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON > > > > -DUSE_VSOCK > > > > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src > > > > > > > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block > > > > > > > > 5. mount in guest w/nfs41: > > > > (e.g., in fstab) > > > > 2:// /vsock41 nfs > > > > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 > > > > 0 0 > > > > > > > > If you try this, send feedback. > > > > > > > > Thanks! > > > > > > > > Matt > > > > > > > > -- > > > > Matt Benjamin > > > > Red Hat, Inc. > > > > 315 West Huron Street, Suite 140A > > > > Ann Arbor, Michigan 48103 > > > > > > > > http://www.redhat.com/en/technologies/storage > > > > > > > > tel. 734-707-0660 > > > > fax. 734-769-8938 > > > > cel. 734-216-5309 > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majord...@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
Hi Bruce, -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "J. Bruce Fields" <bfie...@redhat.com> > To: "Matt Benjamin" <mbenja...@redhat.com> > Cc: "Ceph Development" <ceph-devel@vger.kernel.org>, "Stefan Hajnoczi" > <stefa...@redhat.com>, "Sage Weil" > <sw...@redhat.com> > Sent: Monday, October 19, 2015 11:13:52 AM > Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha) > > On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote: > > Hi devs (CC Bruce--here is a use case for vmci sockets transport) > > > > One of Sage's possible plans for Manilla integration would use nfs over the > > new Linux vmci sockets transport integration in qemu (below) to access > > Cephfs via an nfs-ganesha server running in the host vm. > > What does "the host vm" mean, and why is this a particularly useful > configuration? Sorry, I should say, "the vm host." I think the claimed utility here is (at least) three-fold: 1. simplified configuration on host and guests 2. some claim to improved security through isolation 3. some expectation of improved latency/performance wrt TCP Stefan sent a link to a set of slides with his original patches. Did you get a chance to read through those? [1] http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf Regards, Matt > > --b. > > > > > This now experimentally works. > > > > some notes on running nfs-ganesha over AF_VSOCK: > > > > 1. need stefan hajnoczi's patches for > > * linux kernel (and build w/vhost-vsock support > > * qemu (and build w/vhost-vsock support) > > * nfs-utils (in vm guest) > > > > all linked from https://github.com/stefanha?tab=repositories > > > > 2. host and vm guest kernels must include vhost-vsock > > * host kernel should load vhost-vsock.ko > > > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci > > device, e.g > > > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 > > --enable-kvm -drive > > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive > > file=/opt/isos/f22.iso,media=cdrom -net > > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 > > -parallel none -serial mon:stdio -device > > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c > > > > 4. nfs-gansha (in host) > > * need nfs-ganesha and its ntirpc rpc provider with vsock support > > https://github.com/linuxbox2/nfs-ganesha (vsock branch) > > https://github.com/linuxbox2/ntirpc (vsock branch) > > > > * configure ganesha w/vsock support > > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK > > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src > > > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block > > > > 5. mount in guest w/nfs41: > > (e.g., in fstab) > > 2:// /vsock41 nfs > > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 > > 0 0 > > > > If you try this, send feedback. > > > > Thanks! > > > > Matt > > > > -- > > Matt Benjamin > > Red Hat, Inc. > > 315 West Huron Street, Suite 140A > > Ann Arbor, Michigan 48103 > > > > http://www.redhat.com/en/technologies/storage > > > > tel. 734-707-0660 > > fax. 734-769-8938 > > cel. 734-216-5309 > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
nfsv41 over AF_VSOCK (nfs-ganesha)
Hi devs (CC Bruce--here is a use case for vmci sockets transport) One of Sage's possible plans for Manilla integration would use nfs over the new Linux vmci sockets transport integration in qemu (below) to access Cephfs via an nfs-ganesha server running in the host vm. This now experimentally works. some notes on running nfs-ganesha over AF_VSOCK: 1. need stefan hajnoczi's patches for * linux kernel (and build w/vhost-vsock support * qemu (and build w/vhost-vsock support) * nfs-utils (in vm guest) all linked from https://github.com/stefanha?tab=repositories 2. host and vm guest kernels must include vhost-vsock * host kernel should load vhost-vsock.ko 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci device, e.g /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 --enable-kvm -drive file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive file=/opt/isos/f22.iso,media=cdrom -net nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 -parallel none -serial mon:stdio -device vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c 4. nfs-gansha (in host) * need nfs-ganesha and its ntirpc rpc provider with vsock support https://github.com/linuxbox2/nfs-ganesha (vsock branch) https://github.com/linuxbox2/ntirpc (vsock branch) * configure ganesha w/vsock support cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block 5. mount in guest w/nfs41: (e.g., in fstab) 2:// /vsock41 nfs noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 0 0 If you try this, send feedback. Thanks! Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libcephfs invalidate upcalls
Hi, -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "John Spray" <jsp...@redhat.com> > To: "Matt Benjamin" <mbenja...@redhat.com> > Cc: "Ceph Development" <ceph-devel@vger.kernel.org> > Sent: Monday, September 28, 2015 9:01:28 AM > Subject: Re: libcephfs invalidate upcalls > > On Sat, Sep 26, 2015 at 8:03 PM, Matt Benjamin <mbenja...@redhat.com> wrote: > > Hi John, > > > > I prototyped an invalidate upcall for libcephfs and the Gasesha Ceph fsal, > > building on the Client invalidation callback registrations. > > > > As you suggested, NFS (or AFS, or DCE) minimally expect a more generic > > "cached vnode may have changed" trigger than the current inode and dentry > > invalidates, so I extended the model slightly to hook cap revocation, > > feedback appreciated. > > In cap_release, we probably need to be a bit more discriminating about > when to drop, e.g. if we've only lost our exclusive write caps, the > rest of our metadata might all still be fine to cache. Is ganesha in > general doing any data caching? I think I had implicitly assumed that > we were only worrying about metadata here but now I realise I never > checked that. Ganesha isn't currently, though it did once, and is likely to again, at some point. The exclusive write cap is in fact something with a direct mapping to NFSv4 delegations, so we do want to be able to trigger a recall, in this case. > > The awkward part is Client::trim_caps. In the Client::trim_caps case, > the lru_is_expirable part won't be true until something has already > been invalidated, so there needs to be an explicit hook there -- > rather than invalidating in response to cap release, we need to > invalidate in order to get ganesha to drop its handle, which will > render something expirable, and finally when we expire it, the cap > gets released. Ok, sure. > > In that case maybe we need a hook in ganesha to say "invalidate > everything you can" so that we don't have to make a very large number > of function calls to invalidate things. In the fuse/kernel case we > can only sometimes invalidate a piece of metadata (e.g. we can't if > its flocked or whatever), so we ask it to invalidate everything. But > perhaps in the NFS case we can always expect our invalidate calls to > be respected, so we could just invalidate a smaller number of things > (the difference between actual cache size and desired)? As you noted above, what we're invalidating a cache entry. With Dan's mdcache work, we might no longer be caching at the Ganesha level, but I didn't assume that here. Matt > > John > > > > > g...@github.com:linuxbox2/ceph.git , branch invalidate > > g...@github.com:linuxbox2/nfs-ganesha.git , branch ceph-invalidates > > > > thanks, > > > > Matt > > > > -- > > Matt Benjamin > > Red Hat, Inc. > > 315 West Huron Street, Suite 140A > > Ann Arbor, Michigan 48103 > > > > http://www.redhat.com/en/technologies/storage > > > > tel. 734-761-4689 > > fax. 734-769-8938 > > cel. 734-216-5309 > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
libcephfs invalidate upcalls
Hi John, I prototyped an invalidate upcall for libcephfs and the Gasesha Ceph fsal, building on the Client invalidation callback registrations. As you suggested, NFS (or AFS, or DCE) minimally expect a more generic "cached vnode may have changed" trigger than the current inode and dentry invalidates, so I extended the model slightly to hook cap revocation, feedback appreciated. g...@github.com:linuxbox2/ceph.git , branch invalidate g...@github.com:linuxbox2/nfs-ganesha.git , branch ceph-invalidates thanks, Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About Fio backend with ObjectStore API
It would be worth exploring async, sure. matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "James (Fei) Liu-SSI" <james@ssi.samsung.com> > To: "Casey Bodley" <cbod...@redhat.com> > Cc: "Haomai Wang" <haomaiw...@gmail.com>, ceph-devel@vger.kernel.org > Sent: Friday, September 11, 2015 1:18:31 PM > Subject: RE: About Fio backend with ObjectStore API > > Hi Casey, > You are right. I think the bottleneck is in fio side rather than in > filestore side in this case. The fio did not issue the io commands faster > enough to saturate the filestore. > Here is one of possible solution for it: Create a async engine which are > normally way faster than sync engine in fio. > >Here is possible framework. This new Objectstore-AIO engine in FIO in >theory will be way faster than sync engine. Once we have FIO which can >saturate newstore, memstore and filestore, we can investigate them in >very details of where the bottleneck in their design. > > . > struct objectstore_aio_data { > struct aio_ctx *q_aio_ctx; > struct aio_completion_data *a_data; > aio_ses_ctx_t *p_ses_ctx; > unsigned int entries; > }; > ... > /* > * Note that the structure is exported, so that fio can get it via > * dlsym(..., "ioengine"); > */ > struct ioengine_ops us_aio_ioengine = { > .name = "objectstore-aio", > .version= FIO_IOOPS_VERSION, > .init = fio_objectstore_aio_init, > .prep = fio_objectstore_aio_prep, > .queue = fio_objectstore_aio_queue, > .cancel = fio_objectstore_aio_cancel, > .getevents = fio_objectstore_aio_getevents, > .event = fio_objectstore_aio_event, > .cleanup= fio_objectstore_aio_cleanup, > .open_file = fio_objectstore_aio_open, > .close_file = fio_objectstore_aio_close, > }; > > > Let me know what you think. > > Regards, > James > > -Original Message- > From: Casey Bodley [mailto:cbod...@redhat.com] > Sent: Friday, September 11, 2015 7:28 AM > To: James (Fei) Liu-SSI > Cc: Haomai Wang; ceph-devel@vger.kernel.org > Subject: Re: About Fio backend with ObjectStore API > > Hi James, > > That's great that you were able to get fio-objectstore running! Thanks to you > and Haomai for all the help with testing. > > In terms of performance, it's possible that we're not handling the > completions optimally. When profiling with MemStore I remember seeing a > significant amount of cpu time spent in polling with > fio_ceph_os_getevents(). > > The issue with reads is more of a design issue than a bug. Because the test > starts with a mkfs(), there are no objects to read from initially. You would > just have to add a write job to run before the read job, to make sure that > the objects are initialized. Or perhaps the mkfs() step could be an optional > part of the configuration. > > Casey > > - Original Message - > From: "James (Fei) Liu-SSI" <james@ssi.samsung.com> > To: "Haomai Wang" <haomaiw...@gmail.com>, "Casey Bodley" <cbod...@redhat.com> > Cc: ceph-devel@vger.kernel.org > Sent: Thursday, September 10, 2015 8:08:04 PM > Subject: RE: About Fio backend with ObjectStore API > > Hi Casey and Haomai, > > We finally made the fio-objectstore works in our end . Here is fio data > against filestore with Samsung 850 Pro. It is sequential write and the > performance is very poor which is expected though. > > Run status group 0 (all jobs): > WRITE: io=524288KB, aggrb=9467KB/s, minb=9467KB/s, maxb=9467KB/s, > mint=55378msec, maxt=55378msec > > But anyway, it works even though still some bugs to fix like read and > filesytem issues. thanks a lot for your great work. > > Regards, > James > > jamesliu@jamesliu-OptiPlex-7010:~/WorkSpace/ceph_casey/src$ sudo ./fio/fio > ./test/objectstore.fio > filestore: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, > ioengine=cephobjectstore, iodepth=1 fio-2.2.9-56-g736a Starting 1 process > test1 > filestore: Laying out IO file(s) (1 file(s) / 512MB) > 2015-09-10 16:55:40.614494 7f19d34d1840 1 filestore(/home/jamesliu/fio_ceph) > mkfs in /home/jamesliu/fio_ceph > 2015-09-10 16:55:40.614924 7f19d34d1840 1 filestore(/home/jamesliu/fio_ceph)
Re: Ceph Hackathon: More Memory Allocator Testing
We've frequently run fio + libosd (cohort ceph-osd linked as a library) with jemalloc preloaded, without problems. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "Daniel Gryniewicz" <d...@redhat.com> > To: "Ceph Development" <ceph-devel@vger.kernel.org> > Sent: Thursday, September 3, 2015 9:06:47 AM > Subject: Re: Ceph Hackathon: More Memory Allocator Testing > > I believe preloading should work fine. It has been a common way to > debug buffer overruns using electric fence and similar tools for > years, and I have used it in large applications of similar size to > Ceph. > > Daniel > > On Thu, Sep 3, 2015 at 5:13 AM, Shinobu Kinjo <ski...@redhat.com> wrote: > > > > Pre loading jemalloc after compiling with malloc > > > > $ cat hoge.c > > #include > > > > int main() > > { > > int *ptr = malloc(sizeof(int) * 10); > > > > if (ptr == NULL) > > exit(EXIT_FAILURE); > > free(ptr); > > } > > > > > > $ gcc ./hoge.c > > > > > > $ ldd ./a.out > > linux-vdso.so.1 (0x7fffe17e5000) > > libc.so.6 => /lib64/libc.so.6 (0x7fc989c5f000) > > /lib64/ld-linux-x86-64.so.2 (0x55a718762000) > > > > > > $ nm ./a.out | grep malloc > > U malloc@@GLIBC_2.2.5 // malloc > > loaded > > > > > > $ LD_PRELOAD=/usr/lib64/libjemalloc.so.1 \ > > > ldd a.out > > linux-vdso.so.1 (0x7fff7fd36000) > > /usr/lib64/libjemalloc.so.1 (0x7fe6ffe39000)// jemallo > > loaded > > libc.so.6 => /lib64/libc.so.6 (0x7fe6ffa61000) > > libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe6ff844000) > > /lib64/ld-linux-x86-64.so.2 (0x560342ddf000) > > > > > > Logically it could work, but in real world I'm not 100% sure if it works > > for large scale application. > > > > Shinobu > > > > - Original Message - > > From: "Somnath Roy" <somnath@sandisk.com> > > To: "Alexandre DERUMIER" <aderum...@odiso.com> > > Cc: "Sage Weil" <s...@newdream.net>, "Milosz Tanski" <mil...@adfin.com>, > > "Shishir Gowda" <shishir.go...@sandisk.com>, "Stefan Priebe" > > <s.pri...@profihost.ag>, "Mark Nelson" <mnel...@redhat.com>, "ceph-devel" > > <ceph-devel@vger.kernel.org> > > Sent: Sunday, August 23, 2015 2:03:41 AM > > Subject: RE: Ceph Hackathon: More Memory Allocator Testing > > > > Need to see if client is overriding the libraries built with different > > malloc libraries I guess.. > > I am not sure in your case the benefit you are seeing is because of qemu is > > more efficient with tcmalloc/jemalloc or the entire client stack ? > > > > -Original Message- > > From: Alexandre DERUMIER [mailto:aderum...@odiso.com] > > Sent: Saturday, August 22, 2015 9:57 AM > > To: Somnath Roy > > Cc: Sage Weil; Milosz Tanski; Shishir Gowda; Stefan Priebe; Mark Nelson; > > ceph-devel > > Subject: Re: Ceph Hackathon: More Memory Allocator Testing > > > > >>Wanted to know is there any reason we didn't link client libraries with > > >>tcmalloc at the first place (but did link only OSDs/mon/RGW) ? > > > > Do we need to link client librairies ? > > > > I'm building qemu with jemalloc , and it's seem to be enough. > > > > > > > > - Mail original - > > De: "Somnath Roy" <somnath@sandisk.com> > > À: "Sage Weil" <s...@newdream.net>, "Milosz Tanski" <mil...@adfin.com> > > Cc: "Shishir Gowda" <shishir.go...@sandisk.com>, "Stefan Priebe" > > <s.pri...@profihost.ag>, "aderumier" <aderum...@odiso.com>, "Mark Nelson" > > <mnel...@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org> > > Envoyé: Samedi 22 Août 2015 18:15:36 > > Objet: RE: Ceph Hackathon: More Memory Allocator Testing > > > > Yes, even today rocksdb also linked with tcmalloc. It doesn't mean all the > > application using rocksdb needs to be built with tcmalloc. > > Sage, > > Wanted to know is there any reason we didn't link client libraries with > > tcmalloc
handle-based object store
(11:37:44 AM) mattbenjamin: sjusthm, cbodley: Casey and I think it might be useful to have a short video call on the meet points between object and collection handle as we did it, and the other objectstore changes; I don't know which aspects really should port over to master, but I think it would be useful to do a walk-through and discussion of what parts we could retarget, and anything that we could sequence cleanly later. (11:38:06 AM) mattbenjamin: sjusthm, cbodley: do you have a bit of time available? Some of the pieces we had: 1. the handle interface change itself 2. indexed slots for collection and object handles or ids (unions, iirc) in Transaction, and efficient operations to fill slots 3. probably more flexbility than needed in that every OS could completely redefine Collection and Object 4. lifecycle and refcounting which worked correctly 5. an Object hierarchy we actually used in our version of filestore, w/concurrent LRU system 6. a set of changes by Casey replacing the FDRef system w/management of objects--some of this could be useful, I don't know how it maps onto newstore at all 7. a unification of ObjectContext and opaque Object which we were debating in Oregon 8. thread-local caches of collections and objects above the OS interface that appeared to be a big help in IOPs work ok, apparently we're on for 11:00 am pst--I'll send an invite Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Hackathon: More Memory Allocator Testing
Jemalloc 4.0 seems to have some shiny new capabilities, at least. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 - Original Message - From: Shinobu Kinjo ski...@redhat.com To: Alexandre DERUMIER aderum...@odiso.com Cc: Stephen L Blinick stephen.l.blin...@intel.com, Somnath Roy somnath@sandisk.com, Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Sent: Thursday, August 20, 2015 8:54:59 AM Subject: Re: Ceph Hackathon: More Memory Allocator Testing Thank you for that result. So it might make sense to know difference between jemalloc and jemalloc 4.0. Shinobu - Original Message - From: Alexandre DERUMIER aderum...@odiso.com To: Shinobu Kinjo ski...@redhat.com Cc: Stephen L Blinick stephen.l.blin...@intel.com, Somnath Roy somnath@sandisk.com, Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Sent: Thursday, August 20, 2015 5:17:46 PM Subject: Re: Ceph Hackathon: More Memory Allocator Testing memory results of osd daemon under load, jemalloc use always more memory than tcmalloc, jemalloc 4.0 seem to reduce memory usage but still a little bit more than tcmalloc osd_op_threads=2 : tcmalloc 2.1 -- root 38066 2.3 0.7 1223088 505144 ? Ssl 08:35 1:32 /usr/bin/ceph-osd --cluster=ceph -i 4 -f root 38165 2.4 0.7 1247828 525356 ? Ssl 08:35 1:34 /usr/bin/ceph-osd --cluster=ceph -i 5 -f osd_op_threads=32: tcmalloc 2.1 -- root 39002 102 0.7 1455928 488584 ? Ssl 09:41 0:30 /usr/bin/ceph-osd --cluster=ceph -i 4 -f root 39168 114 0.7 1483752 518368 ? Ssl 09:41 0:30 /usr/bin/ceph-osd --cluster=ceph -i 5 -f osd_op_threads=2 jemalloc 3.5 - root 18402 72.0 1.1 1642000 769000 ? Ssl 09:43 0:17 /usr/bin/ceph-osd --cluster=ceph -i 0 -f root 18434 89.1 1.2 1677444 797508 ? Ssl 09:43 0:21 /usr/bin/ceph-osd --cluster=ceph -i 1 -f osd_op_threads=32 jemalloc 3.5 - root 17204 3.7 1.2 2030616 816520 ? Ssl 08:35 2:31 /usr/bin/ceph-osd --cluster=ceph -i 0 -f root 17228 4.6 1.2 2064928 830060 ? Ssl 08:35 3:05 /usr/bin/ceph-osd --cluster=ceph -i 1 -f osd_op_threads=2 jemalloc 4.0 - root 19967 113 1.1 1432520 737988 ? Ssl 10:04 0:31 /usr/bin/ceph-osd --cluster=ceph -i 1 -f root 19976 93.6 1.0 1409376 711192 ? Ssl 10:04 0:26 /usr/bin/ceph-osd --cluster=ceph -i 0 -f osd_op_threads=32 jemalloc 4.0 - root 20484 128 1.1 1689176 778508 ? Ssl 10:06 0:26 /usr/bin/ceph-osd --cluster=ceph -i 0 -f root 20502 170 1.2 1720524 810668 ? Ssl 10:06 0:35 /usr/bin/ceph-osd --cluster=ceph -i 1 -f - Mail original - De: aderumier aderum...@odiso.com À: Shinobu Kinjo ski...@redhat.com Cc: Stephen L Blinick stephen.l.blin...@intel.com, Somnath Roy somnath@sandisk.com, Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Jeudi 20 Août 2015 07:29:22 Objet: Re: Ceph Hackathon: More Memory Allocator Testing Hi, jemmaloc 4.0 has been released 2 days agos https://github.com/jemalloc/jemalloc/releases I'm curious to see performance/memory usage improvement :) - Mail original - De: Shinobu Kinjo ski...@redhat.com À: Stephen L Blinick stephen.l.blin...@intel.com Cc: aderumier aderum...@odiso.com, Somnath Roy somnath@sandisk.com, Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Jeudi 20 Août 2015 04:00:15 Objet: Re: Ceph Hackathon: More Memory Allocator Testing How about making any sheet for testing patter? Shinobu - Original Message - From: Stephen L Blinick stephen.l.blin...@intel.com To: Alexandre DERUMIER aderum...@odiso.com, Somnath Roy somnath@sandisk.com Cc: Mark Nelson mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org Sent: Thursday, August 20, 2015 10:09:36 AM Subject: RE: Ceph Hackathon: More Memory Allocator Testing Would it make more sense to try this comparison while changing the size of the worker thread pool? i.e. changing osd_op_num_threads_per_shard and osd_op_num_shards (default is currently 2 and 5 respectively, for a total of 10 worker threads). Thanks, Stephen -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre DERUMIER Sent: Wednesday, August 19, 2015 11:47 AM To: Somnath Roy Cc: Mark Nelson; ceph-devel Subject: Re: Ceph Hackathon: More Memory Allocator Testing Just have done a small test with jemalloc, change osd_op_threads
Re: Async reads, sync writes, op thread model discussion
Hi, I tend to agree with your comments regarding swapcontext/fibers. I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either. I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions. I also would like to see how Yehuda's system works before arguing generalities. My intuition is, since the goal is more deterministic performance in a short horizion, you a. need to prioritize transparency over novel abstractions b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline My .05. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 - Original Message - From: Milosz Tanski mil...@adfin.com To: Haomai Wang haomaiw...@gmail.com Cc: Yehuda Sadeh-Weinraub ysade...@redhat.com, Samuel Just sj...@redhat.com, Sage Weil s...@newdream.net, ceph-devel@vger.kernel.org Sent: Friday, August 14, 2015 4:56:26 PM Subject: Re: Async reads, sync writes, op thread model discussion On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang haomaiw...@gmail.com wrote: On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub ysade...@redhat.com wrote: Already mentioned it on irc, adding to ceph-devel for the sake of completeness. I did some infrastructure work for rgw and it seems (at least to me) that it could at least be partially useful here. Basically it's an async execution framework that utilizes coroutines. It's comprised of aio notification manager that can also be tied into coroutines execution. The coroutines themselves are stackless, they are implemented as state machines, but using some boost trickery to hide the details so they can be written very similar to blocking methods. Coroutines can also execute other coroutines and can be stacked, or can generate concurrent execution. It's still somewhat in flux, but I think it's mostly done and already useful at this point, so if there's anything you could use it might be a good idea to avoid effort duplication. coroutines like qemu is cool. The only thing I afraid is the complicate of debug and it's really a big task :-( I agree with sage that this design is really a new implementation for objectstore so that it's harmful to existing objectstore impl. I also suffer the pain from sync read xattr, we may add a async read interface to solove this? For context switch thing, now we have at least 3 cs for one op at osd side. messenger - op queue - objectstore queue. I guess op queue - objectstore is easier to kick off just as sam said. We can make write journal inline with queue_transaction, so the caller could directly handle the transaction right now. I would caution agains coroutines (fibers) esp. in a multi-threaded environment. Posix has officially obsoleted the swapcontext family of functions in 1003.1-2004 and removed it in 1003.1-2008. That's because they were notoriously non portable, and buggy. And yes you can use something like boost::context / boost::coroutine instead but they also have platform limitations. These implementations tend to abuse / turn of various platform scrutiny features (like the one for setjmp/longjmp). And on top of that many platforms don't consider alternative context so you end up with obscure bugs. I've debugged my fair share of bugs in Mordor coroutines with C++ exceptions, and errno variables (since errno is really a function on linux and it's output a pointer to threads errno is marked pure) if your coroutine migrates threads. And you need to migrate them because of blocking and uneven processor/thread distribution. None of these are obstacles that can't be solved, but added together they become a pretty long term liability. So I think long and hard about it. Qemu doesn't have some of those issues because it's uses a single thread and a much simpler C ABI that it deals with. An alternative to coroutines that goes a long way towards solving the callback spaghetti problem is futures/promises. I'm not talking of the very future model that exists in C++11 library but more along the lines that exist in other languages (like what's being done in Javascript today). There's a good implementation of it Folly (the facebook c++11 library). They have a very nice piece of documentation here to understand how they work and how they differ. That future model is very handy when dealing with the callback control flow problem. You can chain a bunch of processing steps that requires some async action, return a future and continue so on and so forth. Also, it makes handling complex error cases easy by giving you a way to skip lots of processing steps strait to onError at the end of the chain. Take a look at folly. Take
Re: bufferlist allocation optimization ideas
We explored a number of these ideas. We have a few branches that might be picked over. Having said that, our feeling was that the generality to span shared and non-shared cases transparently has cost in the unmarked case. Other aspects of the buffer indirection are essential (e.g., Accelio originated buffers, etc). We see a large contribution from ptr::release in perf. One of the main aspirations we had was to identify code paths which would never share buffers, and not pay for sharing in those paths. To the degree that bufferlist is frequently used as a kind of flexible string class, while other code uses it as a smart tailq of iovec or struct uio, there is client code with disjoint assumptions. As mentioned, shared vs. non-shared code paths are similarly disjoint. I'm not certain what the consequent here is. Ceph code gets a lot of simplification from this idiom, but it is not minimalist. We found ways, as Piotr suggested, to avoid allocations of groups of objects related to a message, and this had a lot of impact. We're trying to merge some of that soon. Matt - Original Message - From: Sage Weil sw...@redhat.com To: Piotr Dałek piotr.da...@ts.fujitsu.com Cc: ceph-devel@vger.kernel.org Sent: Monday, August 10, 2015 3:39:56 PM Subject: RE: bufferlist allocation optimization ideas On Mon, 10 Aug 2015, Da?ek, Piotr wrote: This is pretty much low-level approach, what I was actually wondering is whether we can reduce amount of memory (de)allocations on higher level, like improving the message lifecycle logic (from receiving to performing actual operation and finishing it), so it wouldn't involve so many allocations and deallocations. Reducing memory allocation on low level will help, no doubts about this, but we can probably improve on higher level and don't risk breaking more than we need. Yes, definitely! I think we should pursue both... sage With best regards / Pozdrawiam Piotr Da?ek -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, August 10, 2015 9:20 PM To: ceph-devel@vger.kernel.org Subject: bufferlist allocation optimization ideas Currently putting something in a bufferlist invovles 3 allocations: 1. raw buffer (posix_memalign, or new char[])  2. buffer::rawÂ(this holds the refcount. lifecycle matches the raw buffer exactly)  3. bufferlist's STL list node, which embeds buffer::ptr --- combine buffer and buffer::raw --- This should be a pretty simple patch, and turns 2 allocations into one. Most buffers are constructed/allocated via buffer::create_*() methods. Those each look something like buffer::raw* buffer::create(unsigned len) { return new raw_char(len); } where raw_char::raw_char() allocates the actual buffer. Instead, allocate sizeof(raw_char_combined) + len, and use the right magic C++ syntax to call the constructor on that memory. Something like raw_char_combined *foo = new (ptr) raw_char_combined(ptr); where the raw_char_combined constructor is smart enough to figure out that data goes at ptr + sizeof(*this). That takes us from 3 - 2 allocations. An open question is whether this is always a good idea, or whether there are cases where 2 allocates are better, e.g. when len is exactly one page, and we're better off with a mempool allocation for raw and page separately. Or maybe for very large buffers? I'm really not sure what would be better... --- make bufferlist use boost::intrusive::list --- Most buffers exist in only one list, so the indirection through the ptr is mostly wasted. 1. embed a boost::intrustive::list node into buffer::ptr. (Note that doing just this buys us nothing... we are just allocating ptr's and using the intrusive node instead the list node with an embedded ptr.) 2. embed a ptr in buffer::raw (or raw_char_combined) When adding a buffer to the bufferlist, we use the raw_char_combined's embedded ptr if it is available. Otherwise, we allocate one as before. This would need some careful adjustment of hte common append() paths, since they currently are all ptr-based. One way to make this work well might be to embed N ptr's in raw_char_combined, on the assumption that the refcount for a buffer is never more than 2 or 3. Only in extreme cases will we need to explicitly allocate ptr's. Thoughts? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [michigan-eng] cmake and gitbuilder, juntos
(fyi, ceph-devel, this was irc discussion about enhancing gitbuilder, a temporary blocker for cmake) (05:36:09 PM) sjusthm: sage mattbenjamin: so that means we should adapt gitbuilder to use cmake, right? (05:36:17 PM) sjusthm: in the immediate term? (05:36:23 PM) sjusthm: since we want to switch to cmake anyway (05:36:26 PM) sjusthm: and we need it for C++11? (05:39:50 PM) mattbenjamin: sjustm: that sounds correct snippage (05:47:44 PM) sjusthm: mattbenjamin: oh, I'd be fine with doing cmake first (05:47:52 PM) sjusthm: no one actually *likes* messing with automake - Original Message - From: Matt Benjamin mbenja...@redhat.com To: michigan-...@redhat.com Sent: Monday, August 3, 2015 5:21:38 PM Subject: [michigan-eng] cmake and gitbuilder, juntos (04:53:04 PM) mattbenjamin: gitbuilder doesn't understand cmake; I heard someone (sam?) talk about gitbiulder eol--but that's not soon? (04:54:45 PM) mattbenjamin: this is appropo of: casey's c++11 change includes automake logic to get around that (04:58:44 PM) sage: mattbenjamin: yeah, we'll need tdo change all of the build tooling (gitbuilder and ceph-build.git) to use cmake (04:58:54 PM) mattbenjamin: ok (04:59:01 PM) sage: it'll be a while before we phase out gitbuidler (04:59:51 PM) joshd: mattbenjamin: gitbuilder just runs a script you give it - it has no knowledge of build systems. it'll involve replacing parts of scripts like https://github.com/ceph/autobuild-ceph/blob/master/build-ceph.sh (05:00:48 PM) mattbenjamin: tx (05:01:01 PM) haomaiwang left the room (quit: Remote host closed the connection). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cluster Network Public Network w.r.t XIO ?
Hi Neo, On our formerly-internal firefly-based branch, what we did was create additional Messenger instances ad infinitum, which at least let you do this, but it's not what anybody wanted for upstream or long-term. What's upstream now doesn't let you IIRC describe that. The rdma_local parameter like you say is insufficient (and actually a hack). What we plan to do (and have in progress) is extending work Sage started on wip-address, which will enable multi-homing and identify instances by their transport type(s). We might put more information there to help with future topologies. Improved configuration language to let you describe your desired network setup would be packaged with that. The plan is that an improved situation might arrive as early as J. If we need an interim method, now would be a good time to start discussion. Matt - Original Message - From: kernel neophyte neophyte.hacker...@gmail.com To: v...@mellanox.com, raju kurunkad raju.kurun...@sandisk.com, ceph-devel@vger.kernel.org Sent: Thursday, July 30, 2015 11:21:06 PM Subject: Cluster Network Public Network w.r.t XIO ? Hi Vu, Raju, I am trying to bring up ceph cluster on a powerful dell server with two 40Gbe ROCEv2 NIC. I have assigned one as my cluster network (would prefer all osd communications happen on that) and have assigned one as my public n/w. this works fine for simple messenger case. (ofcourse no rdma) but when I try to bring this up on XIO, this gets very complicated, as in how do I specify two RDMA_LOCAL ? one for cluster n/w and other for public ? can choose XIO for client to osd communication and simple for cluster n/w ? any thoughts ? Thanks, Neo -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html