Re: cmake

2015-12-16 Thread Matt Benjamin
Hi,

responding to all these at once.

- Original Message -
> From: "Yehuda Sadeh-Weinraub" <yeh...@redhat.com>
> To: "Sage Weil" <sw...@redhat.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
> Sent: Wednesday, December 16, 2015 1:45:54 PM
> Subject: Re: cmake
> 
> On Wed, Dec 16, 2015 at 9:33 AM, Sage Weil <sw...@redhat.com> wrote:
> > The work to transition to cmake has stalled somewhat.  I've tried to use
> > it a few times but keep running into issues that make it unusable for me.
> > Not having make check is a big one, but I think the hackery required to
> > get that going points to the underlying problem(s).

I'm going to push for cmake work already in progress to be moved to the next 
milestone ASAP.

With respect to "make check" blockers, which contains the issue of where cmake 
puts built objects.  Ali, Casey, and I discussed this today at some length.  We 
think the current "hackery" to make cmake make check work "the same way" auto* 
did is long-term undesirable due to it mutating files in the src dir.  I have 
not assumed that it would be an improvement to put all objects built in a tree 
of submakes into a single dir, as automake does.  I do think it is essential 
that at least eventually, it makes it simple to operate on any object that is 
built, and simple to extend processes like make check.

Ali and Casey agree, but contend that the current make check work is "almost 
finished"--specifically, that it could be finished and a PR sent -this week-.  
Rewriting it will take additional time.  They propose starting with finishing 
and documenting the current setup, then doing a larger cleanup.

What do others think?

Matt

> >
> > I seems like the main problem is that automake puts all build targets in
> > src/ and cmake spreads them all over build/*.  This makes that you can't
> > just add ./ to anything that would normally be in your path (or,
> > PATH=.:$PATH, and then run, say, ../qa/workunits/cephtool/test.sh).
> > There's a bunch of kludges in vstart.sh to make it work that I think
> > mostly point to this issue (and the .libs things).  Is there simply an
> > option we can give cmake to make it put built binaries directly in build/?
> >
> > Stepping back a bit, it seems like the goals should be
> >
> > 1. Be able to completely replace autotools.  I don't fancy maintaining
> > both in parallel.
> >
> 
> Is cmake a viable option in all environments we expect ceph (or any
> part of) to be compiled on? (e.g. aix, solaris, freebsd, different
> linux arm distros, etc.)

One cannot expect cmake to be pre-installed on those platforms, but it will 
work on every one you mentioned, some others, not to mention Windows.

> 
> > 2. Be able to run vstart etc from the build dir.
> 
> There's an awful hack currently in vstart.sh and stop.sh that checks
> for CMakeCache.txt in the current work directory to verify whether we
> built using cmake or autotools. Can we make this go away?
> We can do something like having the build system create a
> 'ceph-setenv.sh' script that would set the env (or open a shelll) with
> the appropriate paths.



> 
> >
> > 3. Be able to run ./ceph[-anything] from the build dir, or put the build
> > dir in the path.  (I suppose we could rely in a make install step, but
> > that seems like more hassle... hopefully it's not neceesary?)
> >
> > 4. make check has to work
> >
> > 5. Use make-dist.sh to generate a release tarball (not make dist)
> >
> > 6. gitbuilders use make-dist.sh and cmake to build packages
> >
> > 7. release process uses make-dist.sh and cmake to build a relelase
> >
> > I'm probably missing something?
> >
> > Should we set a target of doing the 10.0.2 or .3 with cmake?
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Improving Data-At-Rest encryption in Ceph

2015-12-15 Thread Matt Benjamin
Hi,

Thanks for this detailed response.

- Original Message -
> From: "Lars Marowsky-Bree" <l...@suse.com>
> To: "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Tuesday, December 15, 2015 9:23:04 AM
> Subject: Re: Improving Data-At-Rest encryption in Ceph

> 
> It's not yet perfect, but I think the approach is superior to being
> implemented in Ceph natively. If there's any encryption that should be
> implemented in Ceph, I believe it'd be the on-the-wire encryption to
> protect against evasedroppers.

++

> 
> Other scenarios would require client-side encryption.

++

> 
> > Cryptographic keys are stored on filesystem of storage node that hosts
> > OSDs. Changing them require redeploying the OSDs.
> 
> This is solvable by storing the key on an external key server.

++

> 
> Changing the key is only necessary if the key has been exposed. And with
> dm-crypt, that's still possible - it's not the actual encryption key
> that's stored, but the secret that is needed to unlock it, and that can
> be re-encrypted quite fast. (In theory; it's not implemented yet for
> the Ceph OSDs.)
> 
> 
> > Data incoming from Ceph clients would be encrypted by primary OSD. It
> > would replicate ciphertext to non-primary members of an acting set.
> 
> This still exposes data in coredumps or on swap on the primary OSD, and
> metadata on the secondaries.
> 
> 
> Regards,
> Lars
> 
> --
> Architect Storage/HA
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB
> 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> 


-- 
-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: queue_transaction interface + unique_ptr + performance

2015-12-03 Thread Matt Benjamin
++ #1

- Original Message -
> From: "Sage Weil" <s...@newdream.net>
> To: "Somnath Roy" <somnath@sandisk.com>
> Cc: "Samuel Just (sam.j...@inktank.com)" <sam.j...@inktank.com>, 
> ceph-devel@vger.kernel.org
> Sent: Thursday, December 3, 2015 6:50:26 AM
> Subject: RE: queue_transaction interface + unique_ptr + performance
> 
> 1- I agree we should avoid shared_ptr whenever possible.
> 
> 
> sage

-- 
-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cmake

2015-12-03 Thread Matt Benjamin
I always run cmake from a build directory which is not the root, usually 
"build" in the root, so my minimal invocation would be "mkdir build; cd build; 
cmake ../src"--I'd at least try that, though I wouldn't have thought build 
location could affect something this basic (and it would be a bug).

Matt

- Original Message -
> From: "Pete Zaitcev" <zait...@redhat.com>
> To: ceph-devel@vger.kernel.org
> Sent: Thursday, December 3, 2015 5:24:36 PM
> Subject: cmake
> 
> Dear All:
> 
> I'm trying to run cmake, in order to make sure my patches do not break it
> (in particular WIP 5073 added source files). Result looks like this:
> 
> [zaitcev@lembas ceph-tip]$ cmake src
> -- The C compiler identification is GNU 5.1.1
> -- The CXX compiler identification is GNU 5.1.1
> -- Check for working C compiler: /usr/bin/cc
> -- Check for working C compiler: /usr/bin/cc -- works
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Check for working CXX compiler: /usr/bin/c++
> -- Check for working CXX compiler: /usr/bin/c++ -- works
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> CMake Error at CMakeLists.txt:1 (include):
>   include could not find load file:
> 
> GetGitRevisionDescription
> 
> 
> -- The ASM compiler identification is GNU
> -- Found assembler: /usr/bin/cc
> CMake Warning (dev) at CMakeLists.txt:11 (add_definitions):
>   Policy CMP0005 is not set: Preprocessor definition values are now escaped
>   automatically.  Run "cmake --help-policy CMP0005" for policy details.  Use
>   the cmake_policy command to set the policy and suppress this warning.
> This warning is for project developers.  Use -Wno-dev to suppress it.
> 
> CMake Warning (dev) at CMakeLists.txt:12 (add_definitions):
>   Policy CMP0005 is not set: Preprocessor definition values are now escaped
>   automatically.  Run "cmake --help-policy CMP0005" for policy details.  Use
>   the cmake_policy command to set the policy and suppress this warning.
> This warning is for project developers.  Use -Wno-dev to suppress it.
> 
> --  we do not have a modern/working yasm
> -- Performing Test COMPILER_SUPPORTS_CXX11
> -- Performing Test COMPILER_SUPPORTS_CXX11 - Success
> CMake Error at CMakeLists.txt:95 (get_git_head_revision):
>   Unknown CMake command "get_git_head_revision".
> 
> 
> CMake Warning (dev) in CMakeLists.txt:
>   No cmake_minimum_required command is present.  A line of code such as
> 
> cmake_minimum_required(VERSION 3.3)
> 
>   should be added at the top of the file.  The version specified may be lower
>   if you wish to support older CMake versions for this project.  For more
>   information run "cmake --help-policy CMP".
> This warning is for project developers.  Use -Wno-dev to suppress it.
> 
> -- Configuring incomplete, errors occurred!
> See also "/q/zaitcev/ceph/ceph-tip/CMakeFiles/CMakeOutput.log".
> [zaitcev@lembas ceph-tip]$ rpm -qa | grep -i cmake
> extra-cmake-modules-5.16.0-1.fc23.noarch
> cmake-3.3.2-1.fc23.x86_64
> [zaitcev@lembas ceph-tip]$
> 
> Is this expected? Is my cmake incantation wrong?
> 
> Thanks,
> -- Pete
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cmake

2015-12-03 Thread Matt Benjamin
sorry, "cmake .." for Ceph's setup.

Matt

- Original Message -
> From: "Matt Benjamin" <mbenja...@redhat.com>
> To: "Pete Zaitcev" <zait...@redhat.com>
> Cc: ceph-devel@vger.kernel.org
> Sent: Thursday, December 3, 2015 5:30:28 PM
> Subject: Re: cmake
> 
> I always run cmake from a build directory which is not the root, usually
> "build" in the root, so my minimal invocation would be "mkdir build; cd
> build; cmake ../src"--I'd at least try that, though I wouldn't have thought
> build location could affect something this basic (and it would be a bug).
> 
> Matt
> 
> - Original Message -
> > From: "Pete Zaitcev" <zait...@redhat.com>
> > To: ceph-devel@vger.kernel.org
> > Sent: Thursday, December 3, 2015 5:24:36 PM
> > Subject: cmake
> > 
> > Dear All:
> > 
> > I'm trying to run cmake, in order to make sure my patches do not break it
> > (in particular WIP 5073 added source files). Result looks like this:
> > 
> > [zaitcev@lembas ceph-tip]$ cmake src
> > -- The C compiler identification is GNU 5.1.1
> > -- The CXX compiler identification is GNU 5.1.1
> > -- Check for working C compiler: /usr/bin/cc
> > -- Check for working C compiler: /usr/bin/cc -- works
> > -- Detecting C compiler ABI info
> > -- Detecting C compiler ABI info - done
> > -- Detecting C compile features
> > -- Detecting C compile features - done
> > -- Check for working CXX compiler: /usr/bin/c++
> > -- Check for working CXX compiler: /usr/bin/c++ -- works
> > -- Detecting CXX compiler ABI info
> > -- Detecting CXX compiler ABI info - done
> > -- Detecting CXX compile features
> > -- Detecting CXX compile features - done
> > CMake Error at CMakeLists.txt:1 (include):
> >   include could not find load file:
> > 
> > GetGitRevisionDescription
> > 
> > 
> > -- The ASM compiler identification is GNU
> > -- Found assembler: /usr/bin/cc
> > CMake Warning (dev) at CMakeLists.txt:11 (add_definitions):
> >   Policy CMP0005 is not set: Preprocessor definition values are now escaped
> >   automatically.  Run "cmake --help-policy CMP0005" for policy details.
> >   Use
> >   the cmake_policy command to set the policy and suppress this warning.
> > This warning is for project developers.  Use -Wno-dev to suppress it.
> > 
> > CMake Warning (dev) at CMakeLists.txt:12 (add_definitions):
> >   Policy CMP0005 is not set: Preprocessor definition values are now escaped
> >   automatically.  Run "cmake --help-policy CMP0005" for policy details.
> >   Use
> >   the cmake_policy command to set the policy and suppress this warning.
> > This warning is for project developers.  Use -Wno-dev to suppress it.
> > 
> > --  we do not have a modern/working yasm
> > -- Performing Test COMPILER_SUPPORTS_CXX11
> > -- Performing Test COMPILER_SUPPORTS_CXX11 - Success
> > CMake Error at CMakeLists.txt:95 (get_git_head_revision):
> >   Unknown CMake command "get_git_head_revision".
> > 
> > 
> > CMake Warning (dev) in CMakeLists.txt:
> >   No cmake_minimum_required command is present.  A line of code such as
> > 
> > cmake_minimum_required(VERSION 3.3)
> > 
> >   should be added at the top of the file.  The version specified may be
> >   lower
> >   if you wish to support older CMake versions for this project.  For more
> >   information run "cmake --help-policy CMP".
> > This warning is for project developers.  Use -Wno-dev to suppress it.
> > 
> > -- Configuring incomplete, errors occurred!
> > See also "/q/zaitcev/ceph/ceph-tip/CMakeFiles/CMakeOutput.log".
> > [zaitcev@lembas ceph-tip]$ rpm -qa | grep -i cmake
> > extra-cmake-modules-5.16.0-1.fc23.noarch
> > cmake-3.3.2-1.fc23.x86_64
> > [zaitcev@lembas ceph-tip]$
> > 
> > Is this expected? Is my cmake incantation wrong?
> > 
> > Thanks,
> > -- Pete
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> --
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
> 
> http://www.redhat.com/en/technologies/storage
> 
> tel.  734-707-0660
> fax.  734-769-8938
> cel.  734-216-5309
> 

-- 
-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cmake

2015-12-03 Thread Matt Benjamin
Pete,

Could you share the branch you are trying to build?  (ceph/wip-5073 would not 
appear to be it.)

Matt

- Original Message -
> From: "Pete Zaitcev" <zait...@redhat.com>
> To: "Adam C. Emerson" <aemer...@redhat.com>
> Cc: ceph-devel@vger.kernel.org
> Sent: Thursday, December 3, 2015 7:03:47 PM
> Subject: Re: cmake
> 
> On Thu, 3 Dec 2015 17:30:21 -0500
> "Adam C. Emerson" <aemer...@redhat.com> wrote:
> 
> > On 03/12/2015, Pete Zaitcev wrote:
> 
> > > I'm trying to run cmake, in order to make sure my patches do not break it
> > > (in particular WIP 5073 added source files). Result looks like this:
> > > 
> > > [zaitcev@lembas ceph-tip]$ cmake src
> > 
> > I believe the problem is 'cmake src'
> 
> Thanks for the tip about the separate build directory and the top-level
> CMakeLists.txt. However, it still fails like this:
> 
> [zaitcev@lembas build]$ cmake ..
> CMake Error at CMakeLists.txt:1 (include):
>   include could not find load file:
> 
> GetGitRevisionDescription
> ...
> 
> Do you know by any chance where it gets that include? Also, what's
> your cmake --version?
> 
> Greetings,
> -- Pete
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ack vs commit

2015-12-03 Thread Matt Benjamin
 the same
> > portion of the file and sees the file content from before client A's
> > change.  The MDS is extremely careful about this on the metadata side: no
> > side-effects of one client are visible to any other client until they are
> > durable, so that a combination MDS and client failure will never make
> > things appear to go back in time.
> >
> > Any opinions here?  My inclination is to remove the functionality (less
> > code, less complexity, more sane semantics), but we'd be closing the door
> > on what might have been a half-decent idea (separating serialization from
> > durability when multiple clients have the same file open for
> > read/write)...
> 
> I've considered this briefly in the past, but I'd really rather we keep it:
> 
> 1) While we don't make much use of it right now, I think it's a useful
> feature for raw RADOS users
> 
> 2) It's an incredibly useful protocol semantic for future performance
> work of the sort that makes Sam start to cry, but which I find very
> interesting. Consider a future when RBD treats the OSDs more like a
> disk with a cache, and is able to send out operations to get them out
> of local memory, without forcing them instantly to permanent storage.
> Similarly, I think as soon as we have a backend that lets us use a bit
> of 3D Crosspoint in a system, we'll wish we had this functionality
> again.
> (Likewise with CephFS, where many many users will have different
> consistency expectations thanks to NFS and other parallel FSes which
> really aren't consistent.)
> 
> Maybe we just think it's not worth it and we'd rather throw this out.
> But the only complexity this really lets us drop is the OSD replay
> stuff (which realistically I can't assess) — the dual acks is not hard
> to work with, and doesn't get changed much anyway. If we drop this now
> and decide to bring back any similar semantic in the future I think
> that'll be a lot harder than simply carrying it, both in terms of
> banging it back into the code, and especially in terms of getting it
> deployed to all the clients in the world again.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-23 Thread Matt Benjamin
For hacking around, put  "Graceless = true;" in the NFSV4 block.

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
> From: "Daniel Gryniewicz" <d...@redhat.com>
> To: "John Spray" <jsp...@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>, "Stefan Hajnoczi" 
> <shajn...@redhat.com>
> Sent: Friday, October 23, 2015 12:34:42 PM
> Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha)
> 
> On Fri, Oct 23, 2015 at 9:27 AM, John Spray <jsp...@redhat.com> wrote:
> >  * NFS writes from the guest are lagging for like a minute before
> > completing, my hunch is that this is something in the NFS client
> > recovery stuff (in ganesha) that's not coping with vsock, the
> > operations seem to complete at the point where the server declares
> > itself "NOT IN GRACE".
> 
> 
> Ganesha always starts in Grace, and will not process new clients until
> it exits Grace.  Existing clients should re-connect fine, and new
> clients work fine after Grace is exited.
> 
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-20 Thread Matt Benjamin
We mostly assumed that sort-of transactional file systems, perhaps hosted in 
user space was the most tractable trajectory.  I have seen newstore and 
keyvalue store as essentially congruent approaches using database primitives 
(and I am interested in what you make of Russell Sears).  I'm skeptical of any 
hope of keeping things "simple."  Like Martin downthread, most systems I havce 
seen (filers, ZFS)) make use of a fast, durable commit log and then flex 
out...something else.

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309


- Original Message -
> From: "Sage Weil" <sw...@redhat.com>
> To: "John Spray" <jsp...@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Tuesday, October 20, 2015 4:00:23 PM
> Subject: Re: newstore direction
> 
> On Tue, 20 Oct 2015, John Spray wrote:
> > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sw...@redhat.com> wrote:
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put metadata
> > > on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > > rgw index data or cephfs metadata?  Suddenly we are pulling storage out
> > > of
> > > a different pool and those aren't currently fungible.
> > 
> > This is the concerning bit for me -- the other parts one "just" has to
> > get the code right, but this problem could linger and be something we
> > have to keep explaining to users indefinitely.  It reminds me of cases
> > in other systems where users had to make an educated guess about inode
> > size up front, depending on whether you're expecting to efficiently
> > store a lot of xattrs.
> > 
> > In practice it's rare for users to make these kinds of decisions well
> > up-front: it really needs to be adjustable later, ideally
> > automatically.  That could be pretty straightforward if the KV part
> > was stored directly on block storage, instead of having XFS in the
> > mix.  I'm not quite up with the state of the art in this area: are
> > there any reasonable alternatives for the KV part that would consume
> > some defined range of a block device from userspace, instead of
> > sitting on top of a filesystem?
> 
> I agree: this is my primary concern with the raw block approach.
> 
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
> 
> I see two basic options:
> 
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).
> 
> 2) Use something like dm-thin to sit between the raw block device and XFS
> (for rocksdb) and the block device consumed by newstore.  As long as XFS
> doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> files in their entirety) we can fstrim and size down the fs portion.  If
> we similarly make newstores allocator stick to large blocks only we would
> be able to size down the block portion as well.  Typical dm-thin block
> sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> me.  In fact, we could likely just size the fs volume at something
> conservatively large (like 90%) and rely on -o discard or periodic fstrim
> to keep its actual utilization in check.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread Matt Benjamin
Hi Bruce,

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
> From: "J. Bruce Fields" <bfie...@redhat.com>
> To: "Matt Benjamin" <mbenja...@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>, "Stefan Hajnoczi" 
> <stefa...@redhat.com>, "Sage Weil"
> <sw...@redhat.com>
> Sent: Monday, October 19, 2015 11:58:45 AM
> Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha)
> 
> On Mon, Oct 19, 2015 at 11:49:15AM -0400, Matt Benjamin wrote:
> > - Original Message -
> > > From: "J. Bruce Fields" <bfie...@redhat.com>
> ...
> > > 
> > > On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote:
> > > > Hi devs (CC Bruce--here is a use case for vmci sockets transport)
> > > > 
> > > > One of Sage's possible plans for Manilla integration would use nfs over
> > > > the
> > > > new Linux  vmci sockets transport integration in qemu (below) to access
> > > > Cephfs via an nfs-ganesha server running in the host vm.
> > > 
> > > What does "the host vm" mean, and why is this a particularly useful
> > > configuration?
> > 
> > Sorry, I should say, "the vm host."
> 
> Got it, thanks!
> 
> > I think the claimed utility here is (at least) three-fold:
> > 
> > 1. simplified configuration on host and guests
> > 2. some claim to improved security through isolation
> 
> So why is it especially interesting to put Ceph inside the VM and
> Ganesha outside?

Oh, sorry.  Here Ceph (or Gluster, or, whatever underlying FS provider) is 
conceptually outside the vm complex altogether, Ganesha is re-exporting on the 
vm host, and guests access the namespace using NFS(v41).

Regards,

Matt

> 
> > 3. some expectation of improved latency/performance wrt TCP
> > 
> > Stefan sent a link to a set of slides with his original patches.  Did you
> > get a chance to read through those?
> > 
> > [1]
> > http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf
> 
> Yep, thanks.--b.
> 
> > 
> > Regards,
> > 
> > Matt
> > 
> > > 
> > > --b.
> > > 
> > > > 
> > > > This now experimentally works.
> > > > 
> > > > some notes on running nfs-ganesha over AF_VSOCK:
> > > > 
> > > > 1. need stefan hajnoczi's patches for
> > > > * linux kernel (and build w/vhost-vsock support
> > > > * qemu (and build w/vhost-vsock support)
> > > > * nfs-utils (in vm guest)
> > > > 
> > > > all linked from https://github.com/stefanha?tab=repositories
> > > > 
> > > > 2. host and vm guest kernels must include vhost-vsock
> > > > * host kernel should load vhost-vsock.ko
> > > > 
> > > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci
> > > > device, e.g
> > > > 
> > > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1
> > > > --enable-kvm -drive
> > > > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive
> > > > file=/opt/isos/f22.iso,media=cdrom -net
> > > > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0
> > > > -parallel none -serial mon:stdio -device
> > > > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c
> > > > 
> > > > 4. nfs-gansha (in host)
> > > > * need nfs-ganesha and its ntirpc rpc provider with vsock support
> > > > https://github.com/linuxbox2/nfs-ganesha (vsock branch)
> > > > https://github.com/linuxbox2/ntirpc (vsock branch)
> > > > 
> > > > * configure ganesha w/vsock support
> > > > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON
> > > > -DUSE_VSOCK
> > > > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src
> > > > 
> > > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block
> > > > 
> > > > 5. mount in guest w/nfs41:
> > > > (e.g., in fstab)
> > > > 2:// /vsock41 nfs
> > > > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
> > > > 0 0
> > > > 
> > > > If you try this, send feedback.
> > > > 
> > > > Thanks!
> > > > 
> > > > Matt
> > > > 
> > > > --
> > > > Matt Benjamin
> > > > Red Hat, Inc.
> > > > 315 West Huron Street, Suite 140A
> > > > Ann Arbor, Michigan 48103
> > > > 
> > > > http://www.redhat.com/en/technologies/storage
> > > > 
> > > > tel.  734-707-0660
> > > > fax.  734-769-8938
> > > > cel.  734-216-5309
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread Matt Benjamin
Hi Bruce,

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
> From: "J. Bruce Fields" <bfie...@redhat.com>
> To: "Matt Benjamin" <mbenja...@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>, "Stefan Hajnoczi" 
> <stefa...@redhat.com>, "Sage Weil"
> <sw...@redhat.com>
> Sent: Monday, October 19, 2015 11:13:52 AM
> Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha)
> 
> On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote:
> > Hi devs (CC Bruce--here is a use case for vmci sockets transport)
> > 
> > One of Sage's possible plans for Manilla integration would use nfs over the
> > new Linux  vmci sockets transport integration in qemu (below) to access
> > Cephfs via an nfs-ganesha server running in the host vm.
> 
> What does "the host vm" mean, and why is this a particularly useful
> configuration?

Sorry, I should say, "the vm host."

I think the claimed utility here is (at least) three-fold:

1. simplified configuration on host and guests
2. some claim to improved security through isolation
3. some expectation of improved latency/performance wrt TCP

Stefan sent a link to a set of slides with his original patches.  Did you get a 
chance to read through those?

[1] 
http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf

Regards,

Matt

> 
> --b.
> 
> > 
> > This now experimentally works.
> > 
> > some notes on running nfs-ganesha over AF_VSOCK:
> > 
> > 1. need stefan hajnoczi's patches for
> > * linux kernel (and build w/vhost-vsock support
> > * qemu (and build w/vhost-vsock support)
> > * nfs-utils (in vm guest)
> > 
> > all linked from https://github.com/stefanha?tab=repositories
> > 
> > 2. host and vm guest kernels must include vhost-vsock
> > * host kernel should load vhost-vsock.ko
> > 
> > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci
> > device, e.g
> > 
> > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1
> > --enable-kvm -drive
> > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive
> > file=/opt/isos/f22.iso,media=cdrom -net
> > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0
> > -parallel none -serial mon:stdio -device
> > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c
> > 
> > 4. nfs-gansha (in host)
> > * need nfs-ganesha and its ntirpc rpc provider with vsock support
> > https://github.com/linuxbox2/nfs-ganesha (vsock branch)
> > https://github.com/linuxbox2/ntirpc (vsock branch)
> > 
> > * configure ganesha w/vsock support
> > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK
> > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src
> > 
> > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block
> > 
> > 5. mount in guest w/nfs41:
> > (e.g., in fstab)
> > 2:// /vsock41 nfs
> > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
> > 0 0
> > 
> > If you try this, send feedback.
> > 
> > Thanks!
> > 
> > Matt
> > 
> > --
> > Matt Benjamin
> > Red Hat, Inc.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> > 
> > http://www.redhat.com/en/technologies/storage
> > 
> > tel.  734-707-0660
> > fax.  734-769-8938
> > cel.  734-216-5309
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-16 Thread Matt Benjamin
Hi devs (CC Bruce--here is a use case for vmci sockets transport)

One of Sage's possible plans for Manilla integration would use nfs over the new 
Linux  vmci sockets transport integration in qemu (below) to access Cephfs via 
an nfs-ganesha server running in the host vm.

This now experimentally works.

some notes on running nfs-ganesha over AF_VSOCK:

1. need stefan hajnoczi's patches for
* linux kernel (and build w/vhost-vsock support
* qemu (and build w/vhost-vsock support)
* nfs-utils (in vm guest)

all linked from https://github.com/stefanha?tab=repositories

2. host and vm guest kernels must include vhost-vsock
* host kernel should load vhost-vsock.ko

3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci device, 
e.g

/opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 
--enable-kvm -drive 
file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive 
file=/opt/isos/f22.iso,media=cdrom -net 
nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 -parallel 
none -serial mon:stdio -device 
vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c

4. nfs-gansha (in host)
* need nfs-ganesha and its ntirpc rpc provider with vsock support
https://github.com/linuxbox2/nfs-ganesha (vsock branch)
https://github.com/linuxbox2/ntirpc (vsock branch)

* configure ganesha w/vsock support
cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK 
-DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src

in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block

5. mount in guest w/nfs41:
(e.g., in fstab)
2:// /vsock41 nfs 
noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
 0 0

If you try this, send feedback.

Thanks!

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libcephfs invalidate upcalls

2015-09-28 Thread Matt Benjamin
Hi,

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
> From: "John Spray" <jsp...@redhat.com>
> To: "Matt Benjamin" <mbenja...@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Monday, September 28, 2015 9:01:28 AM
> Subject: Re: libcephfs invalidate upcalls
> 
> On Sat, Sep 26, 2015 at 8:03 PM, Matt Benjamin <mbenja...@redhat.com> wrote:
> > Hi John,
> >
> > I prototyped an invalidate upcall for libcephfs and the Gasesha Ceph fsal,
> > building on the Client invalidation callback registrations.
> >
> > As you suggested, NFS (or AFS, or DCE) minimally expect a more generic
> > "cached vnode may have changed" trigger than the current inode and dentry
> > invalidates, so I extended the model slightly to hook cap revocation,
> > feedback appreciated.
> 
> In cap_release, we probably need to be a bit more discriminating about
> when to drop, e.g. if we've only lost our exclusive write caps, the
> rest of our metadata might all still be fine to cache.  Is ganesha in
> general doing any data caching?  I think I had implicitly assumed that
> we were only worrying about metadata here but now I realise I never
> checked that.

Ganesha isn't currently, though it did once, and is likely to again, at some 
point.

The exclusive write cap is in fact something with a direct mapping to NFSv4 
delegations,
so we do want to be able to trigger a recall, in this case.

> 
> The awkward part is Client::trim_caps.  In the Client::trim_caps case,
> the lru_is_expirable part won't be true until something has already
> been invalidated, so there needs to be an explicit hook there --
> rather than invalidating in response to cap release, we need to
> invalidate in order to get ganesha to drop its handle, which will
> render something expirable, and finally when we expire it, the cap
> gets released.

Ok, sure.

> 
> In that case maybe we need a hook in ganesha to say "invalidate
> everything you can" so that we don't have to make a very large number
> of function calls to invalidate things.  In the fuse/kernel case we
> can only sometimes invalidate a piece of metadata (e.g. we can't if
> its flocked or whatever), so we ask it to invalidate everything.  But
> perhaps in the NFS case we can always expect our invalidate calls to
> be respected, so we could just invalidate a smaller number of things
> (the difference between actual cache size and desired)?

As you noted above, what we're invalidating a cache entry.  With Dan's
mdcache work, we might no longer be caching at the Ganesha level, but
I didn't assume that here.

Matt

> 
> John
> 
> >
> > g...@github.com:linuxbox2/ceph.git , branch invalidate
> > g...@github.com:linuxbox2/nfs-ganesha.git , branch ceph-invalidates
> >
> > thanks,
> >
> > Matt
> >
> > --
> > Matt Benjamin
> > Red Hat, Inc.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> >
> > http://www.redhat.com/en/technologies/storage
> >
> > tel.  734-761-4689
> > fax.  734-769-8938
> > cel.  734-216-5309
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


libcephfs invalidate upcalls

2015-09-26 Thread Matt Benjamin
Hi John,

I prototyped an invalidate upcall for libcephfs and the Gasesha Ceph fsal, 
building on the Client invalidation callback registrations.

As you suggested, NFS (or AFS, or DCE) minimally expect a more generic "cached 
vnode may have changed" trigger than the current inode and dentry invalidates, 
so I extended the model slightly to hook cap revocation, feedback appreciated.

g...@github.com:linuxbox2/ceph.git , branch invalidate
g...@github.com:linuxbox2/nfs-ganesha.git , branch ceph-invalidates

thanks,

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About Fio backend with ObjectStore API

2015-09-12 Thread Matt Benjamin
It would be worth exploring async, sure.

matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309


- Original Message -
> From: "James (Fei) Liu-SSI" <james@ssi.samsung.com>
> To: "Casey Bodley" <cbod...@redhat.com>
> Cc: "Haomai Wang" <haomaiw...@gmail.com>, ceph-devel@vger.kernel.org
> Sent: Friday, September 11, 2015 1:18:31 PM
> Subject: RE: About Fio backend with ObjectStore API
> 
> Hi Casey,
>   You are right. I think the bottleneck is in fio side rather than in
>   filestore side in this case. The fio did not issue the io commands faster
>   enough to saturate the filestore.
>   Here is one of possible solution for it: Create a  async engine which are
>   normally way faster than sync engine in fio.
>
>Here is possible framework. This new Objectstore-AIO engine in FIO in
>theory will be way faster than sync engine. Once we have FIO which can
>saturate newstore, memstore and filestore, we can investigate them in
>very details of where the bottleneck in their design.
> 
> .
> struct objectstore_aio_data {
>   struct aio_ctx *q_aio_ctx;
>   struct aio_completion_data *a_data;
>   aio_ses_ctx_t *p_ses_ctx;
>   unsigned int entries;
> };
> ...
> /*
>  * Note that the structure is exported, so that fio can get it via
>  * dlsym(..., "ioengine");
>  */
> struct ioengine_ops us_aio_ioengine = {
>   .name   = "objectstore-aio",
>   .version= FIO_IOOPS_VERSION,
>   .init   = fio_objectstore_aio_init,
>   .prep   = fio_objectstore_aio_prep,
>   .queue  = fio_objectstore_aio_queue,
>   .cancel = fio_objectstore_aio_cancel,
>   .getevents  = fio_objectstore_aio_getevents,
>   .event  = fio_objectstore_aio_event,
>   .cleanup= fio_objectstore_aio_cleanup,
>   .open_file  = fio_objectstore_aio_open,
>   .close_file = fio_objectstore_aio_close,
> };
> 
> 
> Let me know what you think.
> 
> Regards,
> James
> 
> -Original Message-
> From: Casey Bodley [mailto:cbod...@redhat.com]
> Sent: Friday, September 11, 2015 7:28 AM
> To: James (Fei) Liu-SSI
> Cc: Haomai Wang; ceph-devel@vger.kernel.org
> Subject: Re: About Fio backend with ObjectStore API
> 
> Hi James,
> 
> That's great that you were able to get fio-objectstore running! Thanks to you
> and Haomai for all the help with testing.
> 
> In terms of performance, it's possible that we're not handling the
> completions optimally. When profiling with MemStore I remember seeing a
> significant amount of cpu time spent in polling with
> fio_ceph_os_getevents().
> 
> The issue with reads is more of a design issue than a bug. Because the test
> starts with a mkfs(), there are no objects to read from initially. You would
> just have to add a write job to run before the read job, to make sure that
> the objects are initialized. Or perhaps the mkfs() step could be an optional
> part of the configuration.
> 
> Casey
> 
> - Original Message -
> From: "James (Fei) Liu-SSI" <james@ssi.samsung.com>
> To: "Haomai Wang" <haomaiw...@gmail.com>, "Casey Bodley" <cbod...@redhat.com>
> Cc: ceph-devel@vger.kernel.org
> Sent: Thursday, September 10, 2015 8:08:04 PM
> Subject: RE: About Fio backend with ObjectStore API
> 
> Hi Casey and Haomai,
> 
>   We finally made the fio-objectstore works in our end . Here is fio data
>   against filestore with Samsung 850 Pro. It is sequential write and the
>   performance is very poor which is expected though.
> 
> Run status group 0 (all jobs):
>   WRITE: io=524288KB, aggrb=9467KB/s, minb=9467KB/s, maxb=9467KB/s,
>   mint=55378msec, maxt=55378msec
> 
>   But anyway, it works even though still some bugs to fix like read and
>   filesytem issues. thanks a lot for your great work.
> 
>   Regards,
>   James
> 
>   jamesliu@jamesliu-OptiPlex-7010:~/WorkSpace/ceph_casey/src$ sudo ./fio/fio
>   ./test/objectstore.fio
> filestore: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K,
> ioengine=cephobjectstore, iodepth=1 fio-2.2.9-56-g736a Starting 1 process
> test1
> filestore: Laying out IO file(s) (1 file(s) / 512MB)
> 2015-09-10 16:55:40.614494 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph)
> mkfs in /home/jamesliu/fio_ceph
> 2015-09-10 16:55:40.614924 7f19d34d1840  1 filestore(/home/jamesliu/fio_ceph)

Re: Ceph Hackathon: More Memory Allocator Testing

2015-09-03 Thread Matt Benjamin
We've frequently run fio + libosd (cohort ceph-osd linked as a library) with 
jemalloc preloaded, without problems.

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
> From: "Daniel Gryniewicz" <d...@redhat.com>
> To: "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Thursday, September 3, 2015 9:06:47 AM
> Subject: Re: Ceph Hackathon: More Memory Allocator Testing
> 
> I believe preloading should work fine.  It has been a common way to
> debug buffer overruns using electric fence and similar tools for
> years, and I have used it in large applications of similar size to
> Ceph.
> 
> Daniel
> 
> On Thu, Sep 3, 2015 at 5:13 AM, Shinobu Kinjo <ski...@redhat.com> wrote:
> >
> > Pre loading jemalloc after compiling with malloc
> >
> > $ cat hoge.c
> > #include 
> >
> > int main()
> > {
> > int *ptr = malloc(sizeof(int) * 10);
> >
> > if (ptr == NULL)
> > exit(EXIT_FAILURE);
> > free(ptr);
> > }
> >
> >
> > $ gcc ./hoge.c
> >
> >
> > $ ldd ./a.out
> > linux-vdso.so.1 (0x7fffe17e5000)
> > libc.so.6 => /lib64/libc.so.6 (0x7fc989c5f000)
> > /lib64/ld-linux-x86-64.so.2 (0x55a718762000)
> >
> >
> > $ nm ./a.out | grep malloc
> >  U malloc@@GLIBC_2.2.5   // malloc
> >  loaded
> >
> >
> > $ LD_PRELOAD=/usr/lib64/libjemalloc.so.1 \
> > > ldd a.out
> > linux-vdso.so.1 (0x7fff7fd36000)
> > /usr/lib64/libjemalloc.so.1 (0x7fe6ffe39000)// jemallo
> > loaded
> > libc.so.6 => /lib64/libc.so.6 (0x7fe6ffa61000)
> > libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe6ff844000)
> > /lib64/ld-linux-x86-64.so.2 (0x560342ddf000)
> >
> >
> > Logically it could work, but in real world I'm not 100% sure if it works
> > for large scale application.
> >
> > Shinobu
> >
> > - Original Message -
> > From: "Somnath Roy" <somnath@sandisk.com>
> > To: "Alexandre DERUMIER" <aderum...@odiso.com>
> > Cc: "Sage Weil" <s...@newdream.net>, "Milosz Tanski" <mil...@adfin.com>,
> > "Shishir Gowda" <shishir.go...@sandisk.com>, "Stefan Priebe"
> > <s.pri...@profihost.ag>, "Mark Nelson" <mnel...@redhat.com>, "ceph-devel"
> > <ceph-devel@vger.kernel.org>
> > Sent: Sunday, August 23, 2015 2:03:41 AM
> > Subject: RE: Ceph Hackathon: More Memory Allocator Testing
> >
> > Need to see if client is overriding the libraries built with different
> > malloc libraries I guess..
> > I am not sure in your case the benefit you are seeing is because of qemu is
> > more efficient with tcmalloc/jemalloc or the entire client stack ?
> >
> > -Original Message-
> > From: Alexandre DERUMIER [mailto:aderum...@odiso.com]
> > Sent: Saturday, August 22, 2015 9:57 AM
> > To: Somnath Roy
> > Cc: Sage Weil; Milosz Tanski; Shishir Gowda; Stefan Priebe; Mark Nelson;
> > ceph-devel
> > Subject: Re: Ceph Hackathon: More Memory Allocator Testing
> >
> > >>Wanted to know is there any reason we didn't link client libraries with
> > >>tcmalloc at the first place (but did link only OSDs/mon/RGW) ?
> >
> > Do we need to link client librairies ?
> >
> > I'm building qemu with jemalloc , and it's seem to be enough.
> >
> >
> >
> > - Mail original -
> > De: "Somnath Roy" <somnath@sandisk.com>
> > À: "Sage Weil" <s...@newdream.net>, "Milosz Tanski" <mil...@adfin.com>
> > Cc: "Shishir Gowda" <shishir.go...@sandisk.com>, "Stefan Priebe"
> > <s.pri...@profihost.ag>, "aderumier" <aderum...@odiso.com>, "Mark Nelson"
> > <mnel...@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> > Envoyé: Samedi 22 Août 2015 18:15:36
> > Objet: RE: Ceph Hackathon: More Memory Allocator Testing
> >
> > Yes, even today rocksdb also linked with tcmalloc. It doesn't mean all the
> > application using rocksdb needs to be built with tcmalloc.
> > Sage,
> > Wanted to know is there any reason we didn't link client libraries with
> > tcmalloc

handle-based object store

2015-08-24 Thread Matt Benjamin
(11:37:44 AM) mattbenjamin: sjusthm, cbodley:  Casey and I think it might be 
useful to have a short video call on the meet points between object and 
collection handle as we did it, and the other objectstore changes; I don't know 
which aspects really should port over to master, but I think it would be useful 
to do a walk-through and discussion of what parts we could retarget, and 
anything that we could sequence cleanly later.
(11:38:06 AM) mattbenjamin: sjusthm, cbodley: do you have a bit of time 
available?

Some of the pieces we had:

1. the handle interface change itself
2. indexed slots for collection and object handles or ids (unions, iirc) in 
Transaction, and efficient operations to fill slots
3. probably more flexbility than needed in that every OS could completely 
redefine Collection and Object
4. lifecycle and refcounting which worked correctly
5. an Object hierarchy we actually used in our version of filestore, 
w/concurrent LRU system
6. a set of changes by Casey replacing the FDRef system w/management of 
objects--some of this could be useful, I don't know how it maps onto newstore 
at all
7. a unification of ObjectContext and opaque Object which we were debating in 
Oregon
8. thread-local caches of collections and objects above the OS interface that 
appeared to be a big help in IOPs work

ok, apparently we're on for 11:00 am pst--I'll send an invite

Matt 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Hackathon: More Memory Allocator Testing

2015-08-20 Thread Matt Benjamin
Jemalloc 4.0 seems to have some shiny new capabilities, at least.

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
 From: Shinobu Kinjo ski...@redhat.com
 To: Alexandre DERUMIER aderum...@odiso.com
 Cc: Stephen L Blinick stephen.l.blin...@intel.com, Somnath Roy 
 somnath@sandisk.com, Mark Nelson
 mnel...@redhat.com, ceph-devel ceph-devel@vger.kernel.org
 Sent: Thursday, August 20, 2015 8:54:59 AM
 Subject: Re: Ceph Hackathon: More Memory Allocator Testing
 
 Thank you for that result.
 So it might make sense to know difference between jemalloc and jemalloc 4.0.
 
  Shinobu
 
 - Original Message -
 From: Alexandre DERUMIER aderum...@odiso.com
 To: Shinobu Kinjo ski...@redhat.com
 Cc: Stephen L Blinick stephen.l.blin...@intel.com, Somnath Roy
 somnath@sandisk.com, Mark Nelson mnel...@redhat.com, ceph-devel
 ceph-devel@vger.kernel.org
 Sent: Thursday, August 20, 2015 5:17:46 PM
 Subject: Re: Ceph Hackathon: More Memory Allocator Testing
 
 memory results of osd daemon under load,
 
 jemalloc use always more memory than tcmalloc,
 jemalloc 4.0 seem to reduce memory usage but still a little bit more than
 tcmalloc
 
 
 
 osd_op_threads=2 : tcmalloc 2.1
 --
 root  38066  2.3  0.7 1223088 505144 ?  Ssl  08:35   1:32
 /usr/bin/ceph-osd --cluster=ceph -i 4 -f
 root  38165  2.4  0.7 1247828 525356 ?  Ssl  08:35   1:34
 /usr/bin/ceph-osd --cluster=ceph -i 5 -f
 
 
 osd_op_threads=32: tcmalloc 2.1
 --
 
 root  39002  102  0.7 1455928 488584 ?  Ssl  09:41   0:30
 /usr/bin/ceph-osd --cluster=ceph -i 4 -f
 root  39168  114  0.7 1483752 518368 ?  Ssl  09:41   0:30
 /usr/bin/ceph-osd --cluster=ceph -i 5 -f
 
 
 osd_op_threads=2 jemalloc 3.5
 -
 root  18402 72.0  1.1 1642000 769000 ?  Ssl  09:43   0:17
 /usr/bin/ceph-osd --cluster=ceph -i 0 -f
 root  18434 89.1  1.2 1677444 797508 ?  Ssl  09:43   0:21
 /usr/bin/ceph-osd --cluster=ceph -i 1 -f
 
 
 osd_op_threads=32 jemalloc 3.5
 -
 root  17204  3.7  1.2 2030616 816520 ?  Ssl  08:35   2:31
 /usr/bin/ceph-osd --cluster=ceph -i 0 -f
 root  17228  4.6  1.2 2064928 830060 ?  Ssl  08:35   3:05
 /usr/bin/ceph-osd --cluster=ceph -i 1 -f
 
 
 osd_op_threads=2 jemalloc 4.0
 -
 root  19967  113  1.1 1432520 737988 ?  Ssl  10:04   0:31
 /usr/bin/ceph-osd --cluster=ceph -i 1 -f
 root  19976 93.6  1.0 1409376 711192 ?  Ssl  10:04   0:26
 /usr/bin/ceph-osd --cluster=ceph -i 0 -f
 
 
 osd_op_threads=32 jemalloc 4.0
 -
 root  20484  128  1.1 1689176 778508 ?  Ssl  10:06   0:26
 /usr/bin/ceph-osd --cluster=ceph -i 0 -f
 root  20502  170  1.2 1720524 810668 ?  Ssl  10:06   0:35
 /usr/bin/ceph-osd --cluster=ceph -i 1 -f
 
 
 
 - Mail original -
 De: aderumier aderum...@odiso.com
 À: Shinobu Kinjo ski...@redhat.com
 Cc: Stephen L Blinick stephen.l.blin...@intel.com, Somnath Roy
 somnath@sandisk.com, Mark Nelson mnel...@redhat.com, ceph-devel
 ceph-devel@vger.kernel.org
 Envoyé: Jeudi 20 Août 2015 07:29:22
 Objet: Re: Ceph Hackathon: More Memory Allocator Testing
 
 Hi,
 
 jemmaloc 4.0 has been released 2 days agos
 
 https://github.com/jemalloc/jemalloc/releases
 
 I'm curious to see performance/memory usage improvement :)
 
 
 - Mail original -
 De: Shinobu Kinjo ski...@redhat.com
 À: Stephen L Blinick stephen.l.blin...@intel.com
 Cc: aderumier aderum...@odiso.com, Somnath Roy
 somnath@sandisk.com, Mark Nelson mnel...@redhat.com, ceph-devel
 ceph-devel@vger.kernel.org
 Envoyé: Jeudi 20 Août 2015 04:00:15
 Objet: Re: Ceph Hackathon: More Memory Allocator Testing
 
 How about making any sheet for testing patter?
 
 Shinobu
 
 - Original Message -
 From: Stephen L Blinick stephen.l.blin...@intel.com
 To: Alexandre DERUMIER aderum...@odiso.com, Somnath Roy
 somnath@sandisk.com
 Cc: Mark Nelson mnel...@redhat.com, ceph-devel
 ceph-devel@vger.kernel.org
 Sent: Thursday, August 20, 2015 10:09:36 AM
 Subject: RE: Ceph Hackathon: More Memory Allocator Testing
 
 Would it make more sense to try this comparison while changing the size of
 the worker thread pool? i.e. changing osd_op_num_threads_per_shard and
 osd_op_num_shards (default is currently 2 and 5 respectively, for a total
 of 10 worker threads).
 
 Thanks,
 
 Stephen
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre DERUMIER
 Sent: Wednesday, August 19, 2015 11:47 AM
 To: Somnath Roy
 Cc: Mark Nelson; ceph-devel
 Subject: Re: Ceph Hackathon: More Memory Allocator Testing
 
 Just have done a small test with jemalloc, change osd_op_threads

Re: Async reads, sync writes, op thread model discussion

2015-08-14 Thread Matt Benjamin
Hi,

I tend to agree with your comments regarding swapcontext/fibers.  I am not much 
more enamored of jumping to new models (new! frameworks!) as a single jump, 
either.

I like the way I interpreted Sam's design to be going, and in particular, that 
it seems to allow for consistent handling of read, write transactions.  I also 
would like to see how Yehuda's system works before arguing generalities.

My intuition is, since the goal is more deterministic performance in a short 
horizion, you

a. need to prioritize transparency over novel abstractions
b. need to build solid microbenchmarks that encapsulate small, then larger 
pieces of the work pipeline

My .05.

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
 From: Milosz Tanski mil...@adfin.com
 To: Haomai Wang haomaiw...@gmail.com
 Cc: Yehuda Sadeh-Weinraub ysade...@redhat.com, Samuel Just 
 sj...@redhat.com, Sage Weil s...@newdream.net,
 ceph-devel@vger.kernel.org
 Sent: Friday, August 14, 2015 4:56:26 PM
 Subject: Re: Async reads, sync writes, op thread model discussion
 
 On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang haomaiw...@gmail.com wrote:
  On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
  ysade...@redhat.com wrote:
  Already mentioned it on irc, adding to ceph-devel for the sake of
  completeness. I did some infrastructure work for rgw and it seems (at
  least to me) that it could at least be partially useful here.
  Basically it's an async execution framework that utilizes coroutines.
  It's comprised of aio notification manager that can also be tied into
  coroutines execution. The coroutines themselves are stackless, they
  are implemented as state machines, but using some boost trickery to
  hide the details so they can be written very similar to blocking
  methods. Coroutines can also execute other coroutines and can be
  stacked, or can generate concurrent execution. It's still somewhat in
  flux, but I think it's mostly done and already useful at this point,
  so if there's anything you could use it might be a good idea to avoid
  effort duplication.
 
 
  coroutines like qemu is cool. The only thing I afraid is the
  complicate of debug and it's really a big task :-(
 
  I agree with sage that this design is really a new implementation for
  objectstore so that it's harmful to existing objectstore impl. I also
  suffer the pain from sync read xattr, we may add a async read
  interface to solove this?
 
  For context switch thing, now we have at least 3 cs for one op at osd
  side. messenger - op queue - objectstore queue. I guess op queue -
  objectstore is easier to kick off just as sam said. We can make write
  journal inline with queue_transaction, so the caller could directly
  handle the transaction right now.
 
 I would caution agains coroutines (fibers) esp. in a multi-threaded
 environment. Posix has officially obsoleted the swapcontext family of
 functions in 1003.1-2004 and removed it in 1003.1-2008. That's because
 they were notoriously non portable, and buggy. And yes you can use
 something like boost::context / boost::coroutine instead but they also
 have platform limitations. These implementations tend to abuse / turn
 of various platform scrutiny features (like the one for
 setjmp/longjmp). And on top of that many platforms don't consider
 alternative context so you end up with obscure bugs. I've debugged my
 fair share of bugs in Mordor coroutines with C++ exceptions, and errno
 variables (since errno is really a function on linux and it's output a
 pointer to threads errno is marked pure) if your coroutine migrates
 threads. And you need to migrate them because of blocking and uneven
 processor/thread distribution.
 
 None of these are obstacles that can't be solved, but added together
 they become a pretty long term liability. So I think long and hard
 about it. Qemu doesn't have some of those issues because it's uses a
 single thread and a much simpler C ABI that it deals with.
 
 An alternative to coroutines that goes a long way towards solving the
 callback spaghetti problem is futures/promises. I'm not talking of the
 very future model that exists in C++11 library but more along the
 lines that exist in other languages (like what's being done in
 Javascript today). There's a good implementation of it Folly (the
 facebook c++11 library). They have a very nice piece of documentation
 here to understand how they work and how they differ.
 
 That future model is very handy when dealing with the callback control
 flow problem. You can chain a bunch of processing steps that requires
 some async action, return a future and continue so on and so forth.
 Also, it makes handling complex error cases easy by giving you a way
 to skip lots of processing steps strait to onError at the end of the
 chain.
 
 Take a look at folly. Take

Re: bufferlist allocation optimization ideas

2015-08-10 Thread Matt Benjamin
We explored a number of these ideas.  We have a few branches that might be 
picked over.

Having said that, our feeling was that the generality to span shared and 
non-shared cases transparently has cost in the unmarked case.  Other aspects of 
the buffer indirection are essential (e.g., Accelio originated buffers, etc).  
We see a large contribution from ptr::release in perf.  One of the main 
aspirations we had was to identify code paths which would never share buffers, 
and not pay for sharing in those paths.

To the degree that bufferlist is frequently used as a kind of flexible string 
class, while other code uses it as a smart tailq of iovec or struct uio, there 
is client code with disjoint assumptions.  As mentioned, shared vs. non-shared 
code paths are similarly disjoint.  I'm not certain what the consequent here 
is.  Ceph code gets a lot of simplification from this idiom, but it is not 
minimalist.

We found ways, as Piotr suggested, to avoid allocations of groups of objects 
related to a message, and this had a lot of impact.  We're trying to merge some 
of that soon.

Matt

- Original Message -
From: Sage Weil sw...@redhat.com
To: Piotr Dałek piotr.da...@ts.fujitsu.com
Cc: ceph-devel@vger.kernel.org
Sent: Monday, August 10, 2015 3:39:56 PM
Subject: RE: bufferlist allocation optimization ideas

On Mon, 10 Aug 2015, Da?ek, Piotr wrote:
 This is pretty much low-level approach, what I was actually wondering is 
 whether we can reduce amount of memory (de)allocations on higher level, 
 like improving the message lifecycle logic (from receiving to performing 
 actual operation and finishing it), so it wouldn't involve so many 
 allocations and deallocations. Reducing memory allocation on low level 
 will help, no doubts about this, but we can probably improve on higher 
 level and don't risk breaking more than we need.

Yes, definitely!  I think we should pursue both...

sage


 
 
 With best regards / Pozdrawiam
 Piotr Da?ek
 
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
  ow...@vger.kernel.org] On Behalf Of Sage Weil
  Sent: Monday, August 10, 2015 9:20 PM
  To: ceph-devel@vger.kernel.org
  Subject: bufferlist allocation optimization ideas
  
  Currently putting something in a bufferlist invovles 3 allocations:
  
   1. raw buffer (posix_memalign, or new char[])    2. buffer::rawÂ(this 
  holds
  the refcount.  lifecycle matches the
  raw buffer exactly)
   Â  3. bufferlist's STL list node, which embeds buffer::ptr
  
  --- combine buffer and buffer::raw ---
  
  This should be a pretty simple patch, and turns 2 allocations into one.  
  Most
  buffers are constructed/allocated via buffer::create_*() methods.  Those
  each look something like
  
buffer::raw* buffer::create(unsigned len) {
  return new raw_char(len);
}
  
  where raw_char::raw_char() allocates the actual buffer.  Instead, allocate
  sizeof(raw_char_combined) + len, and use the right magic C++ syntax to call
  the constructor on that memory.  Something like
  
raw_char_combined *foo = new (ptr) raw_char_combined(ptr);
  
  where the raw_char_combined constructor is smart enough to figure out
  that data goes at ptr + sizeof(*this).
  
  That takes us from 3 - 2 allocations.
  
  An open question is whether this is always a good idea, or whether there are
  cases where 2 allocates are better, e.g. when len is exactly one page, and
  we're better off with a mempool allocation for raw and page separately.  Or
  maybe for very large buffers?  I'm really not sure what would be better...
  
  
  --- make bufferlist use boost::intrusive::list ---
  
  Most buffers exist in only one list, so the indirection through the ptr is 
  mostly
  wasted.
  
  1. embed a boost::intrustive::list node into buffer::ptr.  (Note that doing 
  just
  this buys us nothing... we are just allocating ptr's and using the 
  intrusive node
  instead the list node with an embedded ptr.)
  
  2. embed a ptr in buffer::raw (or raw_char_combined)
  
  When adding a buffer to the bufferlist, we use the raw_char_combined's
  embedded ptr if it is available.  Otherwise, we allocate one as before.
  
  This would need some careful adjustment of hte common append() paths,
  since they currently are all ptr-based.  One way to make this work well 
  might
  be to embed N ptr's in raw_char_combined, on the assumption that the
  refcount for a buffer is never more than 2 or 3.  Only in extreme cases 
  will we
  need to explicitly allocate ptr's.
  
  
  Thoughts?
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [michigan-eng] cmake and gitbuilder, juntos

2015-08-03 Thread Matt Benjamin
(fyi, ceph-devel, this was irc discussion about enhancing gitbuilder, a 
temporary blocker for cmake)

(05:36:09 PM) sjusthm: sage mattbenjamin: so that means we should adapt 
gitbuilder to use cmake, right?
(05:36:17 PM) sjusthm: in the immediate term?
(05:36:23 PM) sjusthm: since we want to switch to cmake anyway
(05:36:26 PM) sjusthm: and we need it for C++11?
(05:39:50 PM) mattbenjamin: sjustm:  that sounds correct
snippage
(05:47:44 PM) sjusthm: mattbenjamin: oh, I'd be fine with doing cmake first
(05:47:52 PM) sjusthm: no one actually *likes* messing with automake

- Original Message -
From: Matt Benjamin mbenja...@redhat.com
To: michigan-...@redhat.com
Sent: Monday, August 3, 2015 5:21:38 PM
Subject: [michigan-eng] cmake and gitbuilder, juntos

(04:53:04 PM) mattbenjamin: gitbuilder doesn't understand cmake;  I heard 
someone (sam?) talk about gitbiulder eol--but that's not soon?
(04:54:45 PM) mattbenjamin: this is appropo of:  casey's c++11 change includes 
automake logic to get around that
(04:58:44 PM) sage: mattbenjamin: yeah, we'll need tdo change all of the build 
tooling (gitbuilder and ceph-build.git) to use cmake
(04:58:54 PM) mattbenjamin: ok
(04:59:01 PM) sage: it'll be a while before we phase out gitbuidler
(04:59:51 PM) joshd: mattbenjamin: gitbuilder just runs a script you give it - 
it has no knowledge of build systems. it'll involve replacing parts of scripts 
like https://github.com/ceph/autobuild-ceph/blob/master/build-ceph.sh
(05:00:48 PM) mattbenjamin: tx
(05:01:01 PM) haomaiwang left the room (quit: Remote host closed the 
connection).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cluster Network Public Network w.r.t XIO ?

2015-07-31 Thread Matt Benjamin
Hi Neo,

On our formerly-internal firefly-based branch, what we did was create 
additional Messenger
instances ad infinitum, which at least let you do this, but it's not what 
anybody wanted for
upstream or long-term.  What's upstream now doesn't let you IIRC describe that. 
 The rdma_local
parameter like you say is insufficient (and actually a hack).

What we plan to do (and have in progress) is extending work Sage started on 
wip-address, which
will enable multi-homing and identify instances by their transport type(s).  We 
might put more
information there to help with future topologies.  Improved configuration 
language to let you
describe your desired network setup would be packaged with that.

The plan is that an improved situation might arrive as early as J.  If we need 
an interim method,
now would be a good time to start discussion.

Matt

- Original Message -
From: kernel neophyte neophyte.hacker...@gmail.com
To: v...@mellanox.com, raju kurunkad raju.kurun...@sandisk.com, 
ceph-devel@vger.kernel.org
Sent: Thursday, July 30, 2015 11:21:06 PM
Subject: Cluster Network  Public Network w.r.t XIO ?

Hi Vu, Raju,

I am trying to bring up ceph cluster on a powerful dell server with
two 40Gbe ROCEv2 NIC.

I have assigned one as my cluster network (would prefer all osd
communications happen on that) and have assigned one as my public n/w.
this works fine for simple messenger case. (ofcourse no rdma)

but when I try to bring this up on XIO, this gets very complicated, as
in how do I specify two RDMA_LOCAL  ? one for cluster n/w and other
for public ? can choose XIO for client to osd communication and simple
for cluster n/w ?

any thoughts ?

Thanks,
Neo
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html