Re: [ceph-users] Cluster network slower than public network

2017-11-16 Thread Jake Young
On Wed, Nov 15, 2017 at 1:07 PM Ronny Aasen 
wrote:

> On 15.11.2017 13:50, Gandalf Corvotempesta wrote:
>
> As 10gb switches are expansive, what would happen by using a gigabit
> cluster network and a 10gb public network?
>
> Replication and rebalance should be slow, but what about public I/O ?
> When a client wants to write to a file, it does over the public network
> and the ceph automatically replicate it over the cluster network or the
> whole IO is made over the public?
>
>
>
> public io would be slow.
> each write goes from client to primary osd on public network, then is
> replicated 2 times to the secondary osd's over the cluster network, then
> the client is informed the block is written.
> since cluster network would see 2x write traffic compared to public
> network when things a OK. and many times the traffic of the public network
> when things are recovering or backfilling. i would prioritize the
> clusternetwork for the highest speed if one could not have 10Gbps on
> everything.
>


I would seriously consider combining the cluster and public network. It
will simplify your configuration.   It really takes a lot to saturate a 10G
network with Ceph.

If you find that you need to separate your public and cluster networks
later, you can do that in the future.

Jake

>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-11 Thread Jake Young
On Wed, Oct 11, 2017 at 8:57 AM Jason Dillaman  wrote:

> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López 
> wrote:
>
>> As far as I am able to understand there are 2 ways of setting iscsi for
>> ceph
>>
>> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>>
>
> The target_core_rbd approach is only utilized by SUSE (and its derivatives
> like PetaSAN) as far as I know. This was the initial approach for Red
> Hat-derived kernels as well until the upstream kernel maintainers indicated
> that they really do not want a specialized target backend for just krbd.
> The next attempt was to re-use the existing target_core_iblock to interface
> with krbd via the kernel's block layer, but that hit similar upstream walls
> trying to get support for SCSI command passthrough to the block layer.
>
>
>> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>>
>
> The TCMU approach is what upstream and Red Hat-derived kernels will
> support going forward.
>
> The lrbd project was developed by SUSE to assist with configuring a
> cluster of iSCSI gateways via the cli.  The ceph-iscsi-config +
> ceph-iscsi-cli projects are similar in goal but take a slightly different
> approach. ceph-iscsi-config provides a set of common Python libraries that
> can be re-used by ceph-iscsi-cli and ceph-ansible for deploying and
> configuring the gateway. The ceph-iscsi-cli project provides the gwcli tool
> which acts as a cluster-aware replacement for targetcli.
>
> I don't know which one is better, I am seeing that oficial support is
>> pointing to tcmu but i havent done any testbench.
>>
>
> We (upstream Ceph) provide documentation for the TCMU approach because
> that is what is available against generic upstream kernels (starting with
> 4.14 when it's out). Since it uses librbd (which still needs to undergo
> some performance improvements) instead of krbd, we know that librbd 4k IO
> performance is slower compared to krbd, but 64k and 128k IO performance is
> comparable. However, I think most iSCSI tuning guides would already tell
> you to use larger block sizes (i.e. 64K NTFS blocks or 32K-128K ESX blocks).
>
>
>> Does anyone tried both? Do they give the same output? Are both able to
>> manage multiple iscsi targets mapped to a single rbd disk?
>>
>
> Assuming you mean multiple portals mapped to the same RBD disk, the answer
> is yes, both approaches should support ALUA. The ceph-iscsi-config tooling
> will only configure Active/Passive because we believe there are certain
> edge conditions that could result in data corruption if configured for
> Active/Active ALUA.
>
> The TCMU approach also does not currently support SCSI persistent
> reservation groups (needed for Windows clustering) because that support
> isn't available in the upstream kernel. The SUSE kernel has an approach
> that utilizes two round-trips to the OSDs for each IO to simulate PGR
> support. Earlier this summer I believe SUSE started to look into how to get
> generic PGR support merged into the upstream kernel using corosync/dlm to
> synchronize the states between multiple nodes in the target. I am not sure
> of the current state of that work, but it would benefit all LIO targets
> when complete.
>
>
>> I will try to make my own testing but if anyone has tried in advance it
>> would be really helpful.
>>
>> --
>> *Jorge Pinilla López*
>> jorp...@unizar.es
>> --
>>
>>
>> 
>>  Libre
>> de virus. www.avast.com
>> 
>> <#m_7291678653307726003_m_7112777861777147567_m_2432837294105570265_m_4580024349895004366_m_-4947191068488210222_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
Thanks Jason!

You should cut and paste that answer into a blog post on ceph.com. It is a
great summary of where things stand.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tunable question

2017-10-03 Thread Jake Young
On Tue, Oct 3, 2017 at 8:38 AM lists  wrote:

> Hi,
>
> What would make the decision easier: if we knew that we could easily
> revert the
>  > "ceph osd crush tunables optimal"
> once it has begun rebalancing data?
>
> Meaning: if we notice that impact is too high, or it will take too long,
> that we could simply again say
>  > "ceph osd crush tunables hammer"
> and the cluster would calm down again?


Yes you can revert the tunables back; but it will then move all the data
back where it was, so be prepared for that.

Verify you have the following values in ceph.conf. Note that these are the
defaults in Jewel, so if they aren’t defined, you’re probably good:
osd_max_backfills=1
osd_recovery_threads=1

You can try to set these (using ceph —inject) if you notice a large impact
to your client performance:
osd_recovery_op_priority=1
osd_recovery_max_active=1
osd_recovery_threads=1

I recall this tunables change when we went from hammer to jewel last year.
It took over 24 hours to rebalance 122TB on our 110 osd  cluster.

Jake


>
> MJ
>
> On 2-10-2017 9:41, Manuel Lausch wrote:
> > Hi,
> >
> > We have similar issues.
> > After upgradeing from hammer to jewel the tunable "choose leave stabel"
> > was introduces. If we activate it nearly all data will be moved. The
> > cluster has 2400 OSD on 40 nodes over two datacenters and is filled with
> > 2,5 PB Data.
> >
> > We tried to enable it but the backfillingtraffic is to high to be
> > handled without impacting other services on the Network.
> >
> > Do someone know if it is neccessary to enable this tunable? And could
> > it be a problem in the future if we want to upgrade to newer versions
> > wihout it enabled?
> >
> > Regards,
> > Manuel Lausch
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-20 Thread Jake Young
On Wed, Sep 20, 2017 at 5:31 AM Marc Roos <m.r...@f1-outsourcing.eu> wrote:

>
>
>
> We use these :
> NVDATA Product ID  : SAS9207-8i
> Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308
> PCI-Express Fusion-MPT SAS-2 (rev 05)
>
> Does someone by any chance know how to turn on the drive identification
> lights?
>

storcli64 /c0/e8/s1 start locate

Where c is the controller id, e is the enclosure id and s is the drive slot

Look for the PD List section in the output to see the enclosure id / slot
id list.

 storcli64 /c0 show


>
>
>
> -Original Message-
> From: Jake Young [mailto:jak3...@gmail.com]
> Sent: dinsdag 19 september 2017 18:00
> To: Kees Meijs; ceph-us...@ceph.com
> Subject: Re: [ceph-users] What HBA to choose? To expand or not to
> expand?
>
>
> On Tue, Sep 19, 2017 at 9:38 AM Kees Meijs <k...@nefos.nl> wrote:
>
>
> Hi Jake,
>
> On 19-09-17 15:14, Jake Young wrote:
> > Ideally you actually want fewer disks per server and more
> servers.
> > This has been covered extensively in this mailing list. Rule of
> thumb
> > is that each server should have 10% or less of the capacity of
> your
> > cluster.
>
> That's very true, but let's focus on the HBA.
>
> > I didn't do extensive research to decide on this HBA, it's simply
> what
> > my server vendor offered. There are probably better, faster,
> cheaper
> > HBAs out there. A lot of people complain about LSI HBAs, but I am
> > comfortable with them.
>
> Given a configuration our vendor offered it's about LSI/Avago
> 9300-8i
> with 8 drives connected individually using SFF8087 on a backplane
> (e.g.
> not an expander). Or, 24 drives using three HBAs (6xSFF8087 in
> total)
> when using a 4HE SuperMicro chassis with 24 drive bays.
>
> But, what are the LSI complaints about? Or, are the complaints
> generic
> to HBAs and/or cryptic CLI tools and not LSI specific?
>
>
> Typically people rant about how much Megaraid/LSI support sucks. I've
> been using LSI or MegaRAID for years and haven't had any big problems.
>
> I had some performance issues with Areca onboard SAS chips (non-Ceph
> setup, 4 disks in a RAID10) and after about 6 months of troubleshooting
> with the server vendor and Areca support they did patch the firmware and
> resolve the issue.
>
>
>
>
> > There is a management tool called storcli that can fully
> configure the
> > HBA in one or two command lines.  There's a command that
> configures
> > all attached disks as individual RAID0 disk groups. That command
> gets
> > run by salt when I provision a new osd server.
>
> The thread I read was about Areca in JBOD but still able to utilise
> the
> cache, if I'm not mistaken. I'm not sure anymore if there was
> something
> mentioned about BBU.
>
>
> JBOD with WB cache would be nice so you can get smart data directly from
> the disks instead of having interrogate the HBA for the data.  This
> becomes more important once your cluster is stable and in production.
>
> IMHO if there is unwritten data in a RAM chip, like when you enable WB
> cache, you really, really need a BBU. This is another nice thing about
> using SSD journals instead of HBAs in WB mode, the journaled data is
> safe on the SSD before the write is acknowledged.
>
>
>
>
> >
> > What many other people are doing is using the least expensive
> JBOD HBA
> > or the on board SAS controller in JBOD mode and then using SSD
> > journals. Save the money you would have spent on the fancy HBA
> for
> > fast, high endurance SSDs.
>
> Thanks! And obviously I'm very interested in other comments as
> well.
>
> Regards,
> Kees
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-19 Thread Jake Young
On Tue, Sep 19, 2017 at 9:38 AM Kees Meijs <k...@nefos.nl> wrote:

> Hi Jake,
>
> On 19-09-17 15:14, Jake Young wrote:
> > Ideally you actually want fewer disks per server and more servers.
> > This has been covered extensively in this mailing list. Rule of thumb
> > is that each server should have 10% or less of the capacity of your
> > cluster.
>
> That's very true, but let's focus on the HBA.
>
> > I didn't do extensive research to decide on this HBA, it's simply what
> > my server vendor offered. There are probably better, faster, cheaper
> > HBAs out there. A lot of people complain about LSI HBAs, but I am
> > comfortable with them.
>
> Given a configuration our vendor offered it's about LSI/Avago 9300-8i
> with 8 drives connected individually using SFF8087 on a backplane (e.g.
> not an expander). Or, 24 drives using three HBAs (6xSFF8087 in total)
> when using a 4HE SuperMicro chassis with 24 drive bays.
>
> But, what are the LSI complaints about? Or, are the complaints generic
> to HBAs and/or cryptic CLI tools and not LSI specific?


Typically people rant about how much Megaraid/LSI support sucks. I've been
using LSI or MegaRAID for years and haven't had any big problems.

I had some performance issues with Areca onboard SAS chips (non-Ceph setup,
4 disks in a RAID10) and after about 6 months of troubleshooting with the
server vendor and Areca support they did patch the firmware and resolve the
issue.


>
> > There is a management tool called storcli that can fully configure the
> > HBA in one or two command lines.  There's a command that configures
> > all attached disks as individual RAID0 disk groups. That command gets
> > run by salt when I provision a new osd server.
>
> The thread I read was about Areca in JBOD but still able to utilise the
> cache, if I'm not mistaken. I'm not sure anymore if there was something
> mentioned about BBU.


JBOD with WB cache would be nice so you can get smart data directly from
the disks instead of having interrogate the HBA for the data.  This becomes
more important once your cluster is stable and in production.

IMHO if there is unwritten data in a RAM chip, like when you enable WB
cache, you really, really need a BBU. This is another nice thing about
using SSD journals instead of HBAs in WB mode, the journaled data is safe
on the SSD before the write is acknowledged.


>
> >
> > What many other people are doing is using the least expensive JBOD HBA
> > or the on board SAS controller in JBOD mode and then using SSD
> > journals. Save the money you would have spent on the fancy HBA for
> > fast, high endurance SSDs.
>
> Thanks! And obviously I'm very interested in other comments as well.
>
> Regards,
> Kees
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-19 Thread Jake Young
On Tue, Sep 19, 2017 at 7:34 AM Kees Meijs  wrote:

> Hi list,
>
> It's probably something to discuss over coffee in Ede tomorrow but I'll
> ask anyway: what HBA is best suitable for Ceph nowadays?
>
> In an earlier thread I read some comments about some "dumb" HBAs running
> in IT mode but still being able to use cache on the HBA. Does it make
> sense? Or, is this dangerous similar to RAID solutions* without BBU?



Yes, that would be dangerous without a BBU.



>
> (On a side note, we're planning on not using SAS expanders any-more but
> to "wire" each individual disk e.g. using SFF8087 per four disks
> minimising risk of bus congestion and/or lock-ups.)
>
> Anyway, in short I'm curious about opinions on brand, type and
> configuration of HBA to choose.
>
> Cheers,
> Kees
>
> *: apologies for cursing.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

It depends a lot on how many disks you want per server.

Ideally you actually want fewer disks per server and more servers. This has
been covered extensively in this mailing list. Rule of thumb is that each
server should have 10% or less of the capacity of your cluster.

In my cluster I use the LSI 3108 HBA with 4GB of RAM, BBU and 9 3.5" 2TB
disks in 2U servers. Each disk is configured as a RAID0 disk group so I can
use the write back cache. I chose to use the HBA for write coalescing
rather than using SSD journals. It isn't as fast as SSD journals could be,
but it is cheaper and simpler to install and maintain.

I didn't do extensive research to decide on this HBA, it's simply what my
server vendor offered. There are probably better, faster, cheaper HBAs out
there. A lot of people complain about LSI HBAs, but I am comfortable with
them.

There is a management tool called storcli that can fully configure the HBA
in one or two command lines.  There's a command that configures all
attached disks as individual RAID0 disk groups. That command gets run by
salt when I provision a new osd server.

What many other people are doing is using the least expensive JBOD HBA or
the on board SAS controller in JBOD mode and then using SSD journals. Save
the money you would have spent on the fancy HBA for fast, high endurance
SSDs.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph re-ip of OSD node

2017-08-30 Thread Jake Young
Hey Ben,

Take a look at the osd log for another OSD who's ip you did not change.

What errors does it show related the re-ip'd OSD?

Is the other OSD trying to communicate with the re-ip'd OSD's old ip
address?

Jake


On Wed, Aug 30, 2017 at 3:55 PM Jeremy Hanmer 
wrote:

> This is simply not true. We run quite a few ceph clusters with
> rack-level layer2 domains (thus routing between racks) and everything
> works great.
>
> On Wed, Aug 30, 2017 at 10:52 AM, David Turner 
> wrote:
> > ALL OSDs need to be running the same private network at the same time.
> ALL
> > clients, RGW, OSD, MON, MGR, MDS, etc, etc need to be running on the same
> > public network at the same time.  You cannot do this as a one at a time
> > migration to the new IP space.  Even if all of the servers can still
> > communicate via routing, it just won't work.  Changing the public/private
> > network addresses for a cluster requires full cluster down time.
> >
> > On Wed, Aug 30, 2017 at 11:09 AM Ben Morrice 
> wrote:
> >>
> >> Hello
> >>
> >> We have a small cluster that we need to move to a different network in
> >> the same datacentre.
> >>
> >> My workflow was the following (for a single OSD host), but I failed
> >> (further details below)
> >>
> >> 1) ceph osd set noout
> >> 2) stop ceph-osd processes
> >> 3) change IP, gateway, domain (short hostname is the same), VLAN
> >> 4) change references of OLD IP (cluster and public network) in
> >> /etc/ceph/ceph.conf with NEW IP (see [1])
> >> 5) start a single OSD process
> >>
> >> This seems to work as the NEW IP can communicate with mon hosts and osd
> >> hosts on the OLD network, the OSD is booted and is visible via 'ceph -w'
> >> however after a few seconds the OSD drops with messages such as the
> >> below in it's log file
> >>
> >> heartbeat_check: no reply from 10.1.1.100:6818 osd.14 ever on either
> >> front or back, first ping sent 2017-08-30 16:42:14.692210 (cutoff
> >> 2017-08-30 16:42:24.962245)
> >>
> >> There are logs like the above for every OSD server/process
> >>
> >> and then eventually a
> >>
> >> 2017-08-30 16:42:14.486275 7f6d2c966700  0 log_channel(cluster) log
> >> [WRN] : map e85351 wrongly marked me down
> >>
> >>
> >> Am I missing something obvious to reconfigure the network on a OSD host?
> >>
> >>
> >>
> >> [1]
> >>
> >> OLD
> >> [osd.0]
> >> host = sn01
> >> devs = /dev/sdi
> >> cluster addr = 10.1.1.101
> >> public addr = 10.1.1.101
> >> NEW
> >> [osd.0]
> >> host = sn01
> >> devs = /dev/sdi
> >> cluster addr = 10.1.2.101
> >> public addr = 10.1.2.101
> >>
> >> --
> >> Kind regards,
> >>
> >> Ben Morrice
> >>
> >> __
> >> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
> >> EPFL / BBP
> >> Biotech Campus
> >> Chemin des Mines 9
> >> 1202 Geneva
> >> Switzerland
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and IPv4 -> IPv6

2017-06-27 Thread Jake Young
On Tue, Jun 27, 2017 at 2:19 PM Wido den Hollander  wrote:

>
> > Op 27 juni 2017 om 19:00 schreef george.vasilaka...@stfc.ac.uk:
> >
> >
> > Hey Ceph folks,
> >
> > I was wondering what the current status/roadmap/intentions etc. are on
> the possibility of providing a way of transitioning a cluster from IPv4 to
> IPv6 in the future.
> >
> > My current understanding is that this not possible at the moment and
> that one should deploy initially with the version they want long term.
> >
> > However, given the general lack of widespread readiness, I think lots of
> us have deployed with IPv4 and were hoping to go to IPv6 when the rest of
> our environments enabled it.
> >
> > Is adding such a capability to a future version of Ceph being considered?
> >
>
> I think you can, but not without downtime.
>
> The main problem is the monmap which contains IPv4 addresses and you want
> to change that to IPv6.
>
> I haven't tried this, but I think you should be able to:
> - Extract MONMap
> - Update the IPv4 addresses to IPv6 using monmaptool
> - Set noout flag
> - Stop all OSDs
> - Inject new monmap
> - Stop MONs
> - Make sure IPv6 is fixed on MONs
> - Start MONs
> - Start OSDs
>
> Again, this is from the top of my head, haven't tried it, but something
> like that should probably work.
>
> Wido
>
>
> >
> > Best regards,
> >
> > George V.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


I think you could configure all of your mons, osds and clients as
dual-stack (both IPv4 and IPv6) in advance.

Once you have confirmed IPv6 connectivity everywhere, add a new mon using
its IPv6 address.

You would then replace each mon one by one with IPv6 addressed mons.

You can then start to deconfigure the IPv4 interfaces.

Just a thought

Jake

> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CentOS7 Mounting Problem

2017-04-10 Thread Jake Young
I've had this issue as well.  In my case some or most osds on each host do
mount, but a few don't mount or start. (I have 9 osds on each host).

My workaround is to run partprobe on the device that isn't mounted. This
causes the osd to mount and start automatically. The osds then also mount
on subsequent boots

I couldn't find any info in any logs about the osds that don't mount.
There was no difference in the output of the commands Xavier posted between
a working osd and one that didn't mount at boot time.

Jake


On Mon, Apr 10, 2017 at 5:25 PM David Turner  wrote:

> The main issue I see with osds not automatically mounting and starting is
> the partition ID of the OSD and journals are not set to the GUID expected
> by the udev rules for OSDs and journals.  Running ceph-disk activate-all
> might give you more information as to why the OSDs aren't mounting
> properly.  That's the command that is run when your system boots up.  You
> also want to make sure the the right type of file is touched on your osds
> (upstart, systemd, etc) to indicate which service manager should try to
> start the osd.
>
> On Mon, Apr 10, 2017 at 4:43 PM Georgios Dimitrakakis <
> gior...@acmac.uoc.gr> wrote:
>
>  Hi Xavier,
>
>  I still have the entries in my /etc/fstab file and what I did to solve
>  the problem was to enable on all nodes the service
>  "ceph-osd@XXX.service" where "XXX" is the OSD number.
>
>  I don't know the reason why this was initially disabled in my
>  installation...
>
>  As for the "ceph-disk list" command you were referring to it showed
>  correctly the results for my disks e.g.:
>  /dev/sdd :
>   /dev/sdd2 ceph journal, for /dev/sdd1
>   /dev/sdd1 ceph data, active, cluster ceph, osd.1, journal /dev/sdd2
>
>
>  Unfortunately I couldn't run "udevadm" correctly...I must be missing
>  something...
>
>  # udevadm test -h $(udevadm info -q path /dev/sdd)
>  calling: test
>  version 219
>  udevadm test OPTIONS 
>
>  Test an event run.
>-h --helpShow this help
>   --version Show package version
>-a --action=ACTION   Set action string
>-N --resolve-names=early|late|never  When to resolve names
>
>
>
>  Best,
>
>  G.
>
>
>
> > Hi Georgios,
> >
> > Ive had a few issues with automatic mounting on CentOS two months
> > ago,
> > and here are a few tips to how we got automatic mount running with no
> > entries in the fstab. The versions for my test are CentOS 7.1 with
> > Ceph Hammer, kernel 3.10.0-229 and udev/systemd 208.
> >
> > First, I strongly recommend using `ceph-disk list` as a first test.
> > If
> > all goes well the output should look like this:
> >
> > [root@ceph-test ~]# ceph-disk list
> > /dev/sda :
> >  /dev/sda1 other, xfs, mounted on /boot
> >  /dev/sda2 other, LVM2_member
> > /dev/sdb :
> >  /dev/sdb1 ceph journal, for /dev/sdd1
> >  /dev/sdb2 ceph journal, for /dev/sde1
> >  /dev/sdb3 ceph journal, for /dev/sdc1
> > /dev/sdc :
> >  /dev/sdc1 ceph data, active, cluster ceph, osd.2, journal /dev/sdb3
> > /dev/sdd :
> >  /dev/sdd1 ceph data, active, cluster ceph, osd.1, journal /dev/sdb1
> > /dev/sde :
> >  /dev/sde1 ceph data, active, cluster ceph, osd.0, journal /dev/sdb2
> >
> > If the partitions are not detected as ceph data/journal, then your
> > partitions type UUIDs are not set properly; this is important for the
> > Ceph udev rules to work. An if the data-journal associations are not
> > displayed, you might want to check that the "journal" symlink and
> > "journal_uuid" files in the OSD directory are correct and pointing to
> > the right device. Thats if youre using separate partitions as
> > journals, of course.
> >
> > Then `udevadm`can help you see what exactly is going on in the udev
> > rule when its run. Try:
> > udevadm test -h $(udevadm info -q path /dev/sdc)
> > (or any other device thats used as data for OSDs)
> >
> > This command should show you a full log of the events. In our case,
> > the failure was due to a missing keyring file that made the
> > `ceph-disk-activate` call from 95-ceph-osd.rules fail.
> >
> > Finally, you might also want to try using
> > 60-ceph-partuuid-workaround.rules instead of
> > 60-ceph-by-parttypeuuid.rules if its the later that is used in your
> > system. The `udevadm test` log should give good clues to whether
> > thats
> > the issue or not.
> >
> > Kind Regards,
> > --
> >
> > Xavier Villaneau
> >
> > Software Engineer, Concurrent Computer Corporation
> >
> > On Sat, Apr 1, 2017 at 4:47 AM Georgios Dimitrakakis  wrote:
> >
> >>  Hi,
> >>
> >>  just to provide some more feedback on this one and what I ve done
> >> to
> >>  solve it, although not sure if this is the most "elegant"
> >> solution.
> >>
> >>  I have add manually to /etc/fstab on all systems the respective
> >> mount
> >>  points for Ceph OSDs, e.g. entries like this:
> >>
> >>  UUID=9d2e7674-f143-48a2-bb7a-1c55b99da1f7
> >> /var/lib/ceph/osd/ceph-0 xfs
> >>  defaults

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-07 Thread Jake Young
I use 2U servers with 9x 3.5" spinning disks in each. This has scaled well
for me, in both performance and  budget.

I may add 3 more spinning disks to each server at a later time if I need to
maximize storage, or I may add 3 SSDs for journals/cache tier if we need
better performance.

Another consideration is failure domain. If you had a server crash, how
much of your cluster will go down?  Some good advice I've read on this
forum is no single OSD server should be more than 10% of the cluster.

I had taken a week off and one of my 12 OSD servers had an OS SD card fail,
which took down the server. No one even noticed it went down. None of the
VM clients had any performance issues and no data was lost (3x
replication). I have the recovery settings turned down as low as possible,
and even so it only took about 6 hours to rebuild.

Speaking of rebuilding, do your performance measurements during a rebuild.
This has been the time when the cluster is the most stressed and when
performance is the most important.

There's a lot to think about. Read through the archives of this mailing
list, there is a lot of useful advice!

Jake


On Sat, Jan 7, 2017 at 1:38 PM Maged Mokhtar  wrote:

>
>
> Adding more nodes is best if you have unlimited budget :)
> You should add more osds per node until you start hitting cpu or network
> bottlenecks. Use a perf tool like atop/sysstat to know when this happens.
>
>
>
>
>  Original message 
> From: kevin parrikar 
> Date: 07/01/2017 19:56 (GMT+02:00)
> To: Lionel Bouton 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Analysing ceph performance with SSD journal,
> 10gbe NIC and 2 replicas -Hammer release
>
> Wow thats a lot of good information. I wish i knew about all these before
> investing on all these devices.Since i dont have any other option,will get
> better SSD and faster HDD .
> I have one more generic question about Ceph.
> To increase the throughput of a cluster what is the standard practice is
> it more osd "per" node or more osd "nodes".
>
> Thanks alot for all your help.Learned so many new things thanks again
>
> Kevin
>
> On Sat, Jan 7, 2017 at 7:33 PM, Lionel Bouton <
> lionel-subscript...@bouton.name> wrote:
>
>
>
>
>
>
> Le 07/01/2017 à 14:11, kevin parrikar a
> écrit :
>
>
>
> Thanks for your valuable input.
>
> We were using these SSD in our NAS box(synology)  and it was
> giving 13k iops for our fileserver in raid1.We had a few spare
> disks which we added to our ceph nodes hoping that it will give
> good performance same as that of NAS box.(i am not comparing NAS
> with ceph ,just the reason why we decided to use these SSD)
>
>
>
> We dont have S3520 or S3610 at
> the moment but can order one of these to see how it performs
> in ceph .We have 4xS3500  80Gb handy.
>
> If i create a 2 node cluster with 2xS3500 each and with
> replica of 2,do you think it can deliver 24MB/s of 4k writes .
>
>
>
>
>
> Probably not. See
>
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
>
>
> According to the page above the DC S3500 reaches 39MB/s. Its
> capacity isn't specified, yours are 80GB only which is the lowest
> capacity I'm aware of and for all DC models I know of the speed goes
> down with the capacity so you probably will get lower than that.
>
> If you put both data and journal on the same device you cut your
> bandwidth in half : so this would give you an average <20MB/s per
> OSD (with occasional peaks above that if you don't have a sustained
> 20MB/s). With 4 OSDs and size=2, your total write bandwidth is
> <40MB/s. For a single stream of data you will only get <20MB/s
> though (you won't benefit from parallel writes to the 4 OSDs and
> will only write on 2 at a time).
>
>
>
> Not that by comparison the 250GB 840 EVO only reaches 1.9MB/s.
>
>
>
> But even if you reach the 40MB/s, these models are not designed for
> heavy writes, you will probably kill them long before their warranty
> is expired (IIRC these are rated for ~24GB writes per day over the
> warranty period). In your configuration you only have to write 24G
> each day (as you have 4 of them, write both to data and journal and
> size=2) to be in this situation (this is an average of only 0.28
> MB/s compared to your 24 MB/s target).
>
>
>
>
> We bought S3500
> because last time when we tried ceph, people were suggesting
> this model :) :)
>
>
>
>
>
> The 3500 series might be enough with the higher capacities in some
> rare cases but the 80GB model is almost useless.
>
>
>
> You have to do the math considering :
>
> - how much you will write to the cluster (guess high if you have to
> guess),
>
> - if you will use the SSD for both journals and data (which means
> writing twice on them),
>
> - your replication level (which means you will write multiple times
> the same data),
>
> - when you expect to replace the hardware,
>
> - the amount 

Re: [ceph-users] tgt+librbd error 4

2016-12-18 Thread Jake Young
It's running as a guest in a Linux hypervisor.

I'm mapping rbd disks attached to a virtual scsi adaptor (so they can be
added and removed).

I've configured FreeNAS to just share each disk as an iSCSI LUN, rather
than configuring a ZFS pool with the disks.

Jake

On Sun, Dec 18, 2016 at 8:37 AM Bruno Silva <bemanuel...@gmail.com> wrote:

> But FreeNAS is based on FreeBSD.
>
>
>
> Em dom, 18 de dez de 2016 00:40, ZHONG <desert...@icloud.com> escreveu:
>
> Thank you for your reply。
>
> 在 2016年12月17日,22:21,Jake Young <jak3...@gmail.com> 写道:
>
> FreeNAS running in KVM Linux hypervisor
>
>
> ___
>
>
> ceph-users mailing list
>
>
> ceph-users@lists.ceph.com
>
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tgt+librbd error 4

2016-12-17 Thread Jake Young
I don't have the specific crash info, but I have seen crashes with tgt when
the ceph cluster was slow to respond to IO.

It was things like this that pushed me to using another iSCSI to Ceph
solution (FreeNAS running in KVM Linux hypervisor).

Jake

On Fri, Dec 16, 2016 at 9:16 PM ZHONG  wrote:

> Hi All,
>
> I'm using tgt(1.0.55) + librbd(H 0.94.5) for iSCSI service。Recently
> encountered problems, TGT in the absence of pressure crush, exception
> information is as follows:“kernel: tgtd[52067]: segfault at 0 ip
> 7f424cb0d76a sp 7f4228fe0b90 error 4 in
> librbd.so.1.0.0[7f424c9b9000+54b000]”。Has anyone encountered similar
> problems? Thank you!
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking for a definition for some undocumented variables

2016-12-12 Thread Jake Young
Thanks John,

To partially answer my own question:

OPTION(osd_recovery_sleep, OPT_FLOAT, 0) // seconds to sleep between
recovery ops

OPTION(osd_recovery_max_single_start, OPT_U64, 1)

Funny, in the examples where I've seen osd_recovery_max_single_start it is
being set to 1, which is the default.


On Mon, Dec 12, 2016 at 12:26 PM, John Spray <jsp...@redhat.com> wrote:

> On Mon, Dec 12, 2016 at 5:23 PM, Jake Young <jak3...@gmail.com> wrote:
> > I've seen these referenced a few times in the mailing list, can someone
> > explain what they do exactly?
> >
> > What are the defaults for these values?
> >
> > osd recovery sleep
> >
> > and
> >
> > osd recover max single start
>
> Aside from the definition, you can alwasy check default values here:
> https://github.com/ceph/ceph/blob/master/src/common/config_opts.h
>
> John
>
> > Thanks!
> >
> > Jake
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Looking for a definition for some undocumented variables

2016-12-12 Thread Jake Young
I've seen these referenced a few times in the mailing list, can someone
explain what they do exactly?

What are the defaults for these values?

osd recovery sleep

and

osd recover max single start

Thanks!

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] problem after reinstalling system

2016-12-08 Thread Jake Young
Hey Dan,

I had the same issue that Jacek had after changing my OS  and Ceph version
from Ubuntu 14 - Hammer to Centos 7 - Jewel. I was also able to recover
from the failure by renaming the .ldb files to .sst files.

Do you know why this works?

Is it just because leveldb changed the file naming standard and it isn't
backwards compatible with the older version on Centos?


Jake


On Mon, Dec 14, 2015 at 5:09 AM Jacek Jarosiewicz <
jjarosiew...@supermedia.pl> wrote:

> On 12/10/2015 02:56 PM, Jacek Jarosiewicz wrote:
>
> > On 12/10/2015 02:50 PM, Dan van der Ster wrote:
>
> >> On Wed, Dec 9, 2015 at 1:25 PM, Jacek Jarosiewicz
>
> >>  wrote:
>
> >>> 2015-12-09 13:11:51.171377 7fac03c7f880 -1
>
> >>> filestore(/var/lib/ceph/osd/ceph-5) Error initializing leveldb :
>
> >>> Corruption:
>
> >>> 29 missing files; e.g.:
> /var/lib/ceph/osd/ceph-5/current/omap/046388.sst
>
> >>
>
> >> Did you have .lbd files? If so, this should make it work:
>
> >>
>
> >> rename -v .ldb .sst /var/lib/ceph/osd/ceph-5/current/omap/*.ldb
>
> >>
>
> >> Cheers, Dan
>
> >>
>
> >
>
> > I will try that after reinstalling next node, I had to act quickly and
>
> > this one is allready backfilling :)
>
> >
>
> > J
>
> >
>
>
>
> Renaming files did the trick! Thanks!
>
>
>
> J
>
>
>
> --
>
> Jacek Jarosiewicz
>
> Administrator Systemów Informatycznych
>
>
>
>
> 
>
> SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
>
> ul. Senatorska 13/15, 00-075 Warszawa
>
> Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego
>
> Rejestru Sądowego,
>
> nr KRS 029537; kapitał zakładowy 42.756.000 zł
>
> NIP: 957-05-49-503
>
> Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa
>
>
>
>
> 
>
> SUPERMEDIA ->   http://www.supermedia.pl
>
> dostep do internetu - hosting - kolokacja - lacza - telefonia
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-07 Thread Jake Young
Hey Patrick,

I work for Cisco.

We have a 200TB cluster (108 OSDs on 12 OSD Nodes) and use the cluster for
both OpenStack and VMware deployments.

We are using iSCSI now, but it really would be much better if VMware did
support RBD natively.

We present a 1-2TB Volume that is shared between 4-8 ESXi hosts.

I have been looking for an optimal solution for a few years now, and I have
finally found something that works pretty well:

We are installing FreeNAS on a KVM hypervisor and passing through rbd
volumes as disks on a SCSI bus. We are able to add volumes dynamically (no
need to reboot FreeNAS to recognize new drives).  In FreeNAS, we are
passing the disks through directly as iscsi targets, we are not putting the
disks into a ZFS volume.

The biggest benefit to this is that VMware really likes the FreeBSD target
and all VAAI stuff works reliably. We also get the benefit of the stability
of rbd in QEMU client.

My next step is to create a redundant KVM host with a redundant FreeNAS VM
and see how iscsi multipath works with the ESXi hosts.

We have tried many different things and have run into all the same issues
as others have posted on this list. The general theme seems to be that most
(all?) Linux iSCSI Target software and Linux NFS solutions are not very
good. The BSD OS's (FreeBSD, Solaris derivatives, etc.) do these things a
lot better, but typically lack Ceph support as well as having poor HW
compatibility (compared to Linux).

Our goal has always been to replace FC SAN with something comparable in
performance, reliability and redundancy.

Again, the best thing in the world would be for ESXi to mount rbd volumes
natively using librbd. I'm not sure if VMware is interested in this though.

Jake


On Wednesday, October 5, 2016, Patrick McGarry  wrote:

> Hey guys,
>
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
>
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
>
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
>
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-26 Thread Jake Young
On Thursday, July 21, 2016, Mike Christie <mchri...@redhat.com> wrote:

> On 07/21/2016 11:41 AM, Mike Christie wrote:
> > On 07/20/2016 02:20 PM, Jake Young wrote:
> >>
> >> For starters, STGT doesn't implement VAAI properly and you will need to
> >> disable VAAI in ESXi.
> >>
> >> LIO does seem to implement VAAI properly, but performance is not nearly
> >> as good as STGT even with VAAI's benefits. The assumption for the cause
> >> is that LIO currently uses kernel rbd mapping and kernel rbd performance
> >> is not as good as librbd.
> >>
> >> I recently did a simple test of creating an 80GB eager zeroed disk with
> >> STGT (VAAI disabled, no rbd client cache) and LIO (VAAI enabled) and
> >> found that STGT was actually slightly faster.
> >>
> >> I think we're all holding our breath waiting for LIO librbd support via
> >> TCMU, which seems to be right around the corner. That solution will
> >
> > Is there a thread for that?


Not a thread, but it has come up a few times...  Maybe I'm getting ahead of
myself. I can't wait for this solution to be available.


> >
> >> combine the performance benefits of librbd with the more feature-full
> >> LIO iSCSI interface. The lrbd configuration tool for LIO from SUSE is
> >> pretty cool and it makes configuring LIO easier than STGT.
> >>
> >
> > I wrote a tcmu rbd driver a while back. It is based on gpl2 code, so
> > Andy could not take it into tcmu. I attached it here if you want to play
> > with it.
> >
>
> Here it is attached in patch form built against the current tcmu code.
>
> I have not tested it since March, so if there have been major changes to
> the tcmu code there might be issues.
>
> You should only use this for testing. I wrote it up in a night. I have
> done very little testing.
>
> It only supports READ, WRITE, DISCARD/UNMAP, TUR, MODE_SENSE/SELECT, and
> SYNC_CACHE.
>

Thanks for this!  I was able to patch and compile without errors.

I'm having trouble using it though. Does it require targetcli-fb?  This
should show up as a "User: rbd" backstore, right?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Jake Young
I think the answer is that with 1 thread you can only ever write to one
journal at a time. Theoretically, you would need 10 threads to be able to
write to 10 nodes at the same time.

Jake

On Thursday, July 21, 2016, w...@globe.de  wrote:

> What i not really undertand is:
>
> Lets say the Intel P3700 works with 200 MByte/s rados bench one thread...
> See Nicks results below...
>
> If we have multiple OSD Nodes. For example 10 Nodes.
>
> Every Node has exactly 1x P3700 NVMe built in.
>
> Why is the single Thread performance exactly at 200 MByte/s on the rbd
> client with 10 OSD Node Cluster???
>
> I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.
>
>
> Everyone look yourself at your cluster.
>
> dstat -D sdb,sdc,sdd,sdX 
>
> You will see that Ceph stripes the data over all OSD's in the cluster if
> you test at the client side with rados bench...
>
> *rados bench -p rbd 60 write -b 4M -t 1*
>
>
>
> Am 21.07.16 um 14:38 schrieb w...@globe.de
> :
>
> Is there not a way to enable Linux page Cache? So do not user D_Sync...
>
> Then we would the dramatically performance improve.
>
>
> Am 21.07.16 um 14:33 schrieb Nick Fisk:
>
> -Original Message-
> From: w...@globe.de  [
> mailto:w...@globe.de ]
> Sent: 21 July 2016 13:23
> To: n...@fisk.me.uk ;
> 'Horace Ng' 
> 
> Cc: ceph-users@lists.ceph.com
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Okay and what is your plan now to speed up ?
>
> Now I have come up with a lower latency hardware design, there is not much
> further improvement until persistent RBD caching is implemented, as you
> will be moving the SSD/NVME closer to the client. But I'm happy with what I
> can achieve at the moment. You could also experiment with bcache on the
> RBD.
>
> Would it help to put in multiple P3700 per OSD Node to improve performance
> for a single Thread (example Storage VMotion) ?
>
> Most likely not, it's all the other parts of the puzzle which are causing
> the latency. ESXi was designed for storage arrays that service IO's in
> 100us-1ms range, Ceph is probably about 10x slower than this, hence the
> problem. Disable the BBWC on a RAID controller or SAN and you will the same
> behaviour.
>
> Regards
>
>
> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] On
> Behalf
> Of w...@globe.de 
> Sent: 21 July 2016 13:04
> To: n...@fisk.me.uk ;
> 'Horace Ng' 
> 
> Cc: ceph-users@lists.ceph.com
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Hi,
>
> hmm i think 200 MByte/s is really bad. Is your Cluster in production right
> now?
>
> It's just been built, not running yet.
>
> So if you start a storage migration you get only 200 MByte/s right?
>
> I wish. My current cluster (not this new one) would storage migrate at
> ~10-15MB/s. Serial latency is the problem, without being able to
> buffer, ESXi waits on an ack for each IO before sending the next. Also it
> submits the migrations in 64kb chunks, unless you get VAAI
>
> working. I think esxi will try and do them in parallel, which will help as
> well.
>
> I think it would be awesome if you get 1000 MByte/s
>
> Where is the Bottleneck?
>
> Latency serialisation, without a buffer, you can't drive the devices
> to 100%. With buffered IO (or high queue depths) I can max out the
> journals.
>
> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the
> P3700.
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
> -ssd-is-suitable-as-a-journal-device/
>
> How could it be that the rbd client performance is 50% slower?
>
> Regards
>
>
> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>
> I've had a lot of pain with this, smaller block sizes are even worse.
> You want to try and minimize latency at every point as there is no
> buffering happening in the iSCSI stack. This means:-
>
> 1. Fast journals (NVME or NVRAM)
> 2. 10GB or better networking
> 3. Fast CPU's (Ghz)
> 4. Fix CPU c-state's to C1
> 5. Fix CPU's Freq to max
>
> Also I can't be sure, but I think there is a metadata update
> happening with VMFS, particularly if you are using thin VMDK's, this
> can also be a major bottleneck. For my use case, I've switched over to NFS
> as it has given much more performance at scale and
>
> less headache.
>
> For the RADOS Run, here you go 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Jake Young
My workaround to your single threaded performance issue was to increase the
thread count of the tgtd process (I added --nr_iothreads=128 as an argument
to tgtd).  This does help my workload.

FWIW below are my rados bench numbers from my cluster with 1 thread:

This first one is a "cold" run. This is a test pool, and it's not in use.
This is the first time I've written to it in a week (but I have written to
it before).

Total time run: 60.049311
Total writes made:  1196
Write size: 4194304
Bandwidth (MB/sec): 79.668

Stddev Bandwidth:   80.3998
Max bandwidth (MB/sec): 208
Min bandwidth (MB/sec): 0
Average Latency:0.0502066
Stddev Latency: 0.47209
Max latency:12.9035
Min latency:0.013051

This next one is the 6th run. I honestly don't understand why there is such
a huge performance difference.

Total time run: 60.042933
Total writes made:  2980
Write size: 4194304
Bandwidth (MB/sec): 198.525

Stddev Bandwidth:   32.129
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 0
Average Latency:0.0201471
Stddev Latency: 0.0126896
Max latency:0.265931
Min latency:0.013211


75 OSDs, all 2TB SAS spinners.  There are 9 OSD servers each has a 2GB
BBU RAID cache.

I have tuned my CPU c-state and freq to max, I have 8x 2.5MHz cores, so
just about one core per OSD. I have 40G networking.  I don't use journals,
but I have the RAID cache enabled.


Nick,

What NFS server are you using?

Jake


On Thursday, July 21, 2016, Nick Fisk  wrote:

> I've had a lot of pain with this, smaller block sizes are even worse. You
> want to try and minimize latency at every point as there
> is no buffering happening in the iSCSI stack. This means:-
>
> 1. Fast journals (NVME or NVRAM)
> 2. 10GB or better networking
> 3. Fast CPU's (Ghz)
> 4. Fix CPU c-state's to C1
> 5. Fix CPU's Freq to max
>
> Also I can't be sure, but I think there is a metadata update happening
> with VMFS, particularly if you are using thin VMDK's, this
> can also be a major bottleneck. For my use case, I've switched over to NFS
> as it has given much more performance at scale and less
> headache.
>
> For the RADOS Run, here you go (400GB P3700):
>
> Total time run: 60.026491
> Total writes made:  3104
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 206.842
> Stddev Bandwidth:   8.10412
> Max bandwidth (MB/sec): 224
> Min bandwidth (MB/sec): 180
> Average IOPS:   51
> Stddev IOPS:2
> Max IOPS:   56
> Min IOPS:   45
> Average Latency(s): 0.0193366
> Stddev Latency(s):  0.00148039
> Max latency(s): 0.0377946
> Min latency(s): 0.015909
>
> Nick
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] On Behalf Of Horace
> > Sent: 21 July 2016 10:26
> > To: w...@globe.de 
> > Cc: ceph-users@lists.ceph.com 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> > Hi,
> >
> > Same here, I've read some blog saying that vmware will frequently verify
> the locking on VMFS over iSCSI, hence it will have much
> > slower performance than NFS (with different locking mechanism).
> >
> > Regards,
> > Horace Ng
> >
> > - Original Message -
> > From: w...@globe.de 
> > To: ceph-users@lists.ceph.com 
> > Sent: Thursday, July 21, 2016 5:11:21 PM
> > Subject: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> > Hi everyone,
> >
> > we see at our cluster relatively slow Single Thread Performance on the
> iscsi Nodes.
> >
> >
> > Our setup:
> >
> > 3 Racks:
> >
> > 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache
> off).
> >
> > 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> > Red 1TB per Data Node as OSD.
> >
> > Replication = 3
> >
> > chooseleaf = 3 type Rack in the crush map
> >
> >
> > We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> >
> > rados bench -p rbd 60 write -b 4M -t 1
> >
> >
> > If we test with:
> >
> > rados bench -p rbd 60 write -b 4M -t 32
> >
> > we get ca. 600 - 700 MByte/s
> >
> >
> > We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
> > the Journal to get better Single Thread Performance.
> >
> > Is anyone of you out there who has an Intel P3700 for Journal an can
> > give me back test results with:
> >
> >
> > rados bench -p rbd 60 write -b 4M -t 1
> >
> >
> > Thank you very much !!
> >
> > Kind Regards !!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > 

Re: [ceph-users] ceph + vmware

2016-07-20 Thread Jake Young
On Wednesday, July 20, 2016, Jan Schermer  wrote:

>
> > On 20 Jul 2016, at 18:38, Mike Christie  > wrote:
> >
> > On 07/20/2016 03:50 AM, Frédéric Nass wrote:
> >>
> >> Hi Mike,
> >>
> >> Thanks for the update on the RHCS iSCSI target.
> >>
> >> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
> >> it too early to say / announce).
> >
> > No HA support for sure. We are looking into non HA support though.
> >
> >>
> >> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
> >> so we'll just have to remap RBDs to RHCS targets when it's available.
> >>
> >> So we're currently running :
> >>
> >> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
> >> has all VAAI primitives enabled and run the same configuration.
> >> - RBD images are mapped on each target using the kernel client (so no
> >> RBD cache).
> >> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
> >> but in a failover manner so that each ESXi always access the same LUN
> >> through one target at a time.
> >> - LUNs are VMFS datastores and VAAI primitives are enabled client side
> >> (except UNMAP as per default).
> >>
> >> Do you see anthing risky regarding this configuration ?
> >
> > If you use a application that uses scsi persistent reservations then you
> > could run into troubles, because some apps expect the reservation info
> > to be on the failover nodes as well as the active ones.
> >
> > Depending on the how you do failover and the issue that caused the
> > failover, IO could be stuck on the old active node and cause data
> > corruption. If the initial active node looses its network connectivity
> > and you failover, you have to make sure that the initial active node is
> > fenced off and IO stuck on that node will never be executed. So do
> > something like add it to the ceph monitor blacklist and make sure IO on
> > that node is flushed and failed before unblacklisting it.
> >
>
> With iSCSI you can't really do hot failover unless you only use
> synchronous IO.


VMware does only use synchronous IO. Since the hypervisor can't tell what
type of data the VMs are writing, all IO is treated as needing to be
synchronous.

(With any of opensource target softwares available).
> Flushing the buffers doesn't really help because you don't know what
> in-flight IO happened before the outage
> and which didn't. You could end with only part of the "transaction"
> written on persistent storage.
>
> If you only use synchronous IO all the way from client to the persistent
> storage shared between
> iSCSI target then all should be fine, otherwise YMMV - some people run it
> like that without realizing
> the dangers and have never had a problem, so it may be strictly
> theoretical, and it all depends on how often you need to do the
> failover and what data you are storing - corrupting a few images on a
> gallery site could be fine but corrupting
> a large database tablespace is no fun at all.


No, it's not. VMFS corruption is pretty bad too and there is no fsck for
VMFS...


>
> Some (non opensource) solutions exist, Solaris supposedly does this in
> some(?) way, maybe some iSCSI guru
> can chime tell us what magic they do, but I don't think it's possible
> without client support
> (you essentialy have to do something like transactions and replay the last
> transaction on failover). Maybe
> something can be enabled in protocol to do the iSCSI IO synchronous or
> make it at least wait for some sort of ACK from the
> server (which would require some sort of cache mirroring between the
> targets) without making it synchronous all the way.


This is why the SAN vendors wrote their own clients and drivers. It is not
possible to dynamically make all OS's do what your iSCSI target expects.

Something like VMware does the right thing pretty much all the time (there
are some iSCSI initiator bugs in earlier ESXi 5.x).  If you have control of
your ESXi hosts then attempting to set up HA iSCSI targets is possible.

If you have a mixed client environment with various versions of Windows
connecting to the target, you may be better off buying some SAN appliances.


> The one time I had to use it I resorted to simply mirroring in via mdraid
> on the client side over two targets sharing the same
> DAS, and this worked fine during testing but never went to production in
> the end.
>
> Jan
>
> >
> >>
> >> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
> >> clients ?
> >
> > I can't say, because I have not used stgt with rbd bs-type support
> enough.


For starters, STGT doesn't implement VAAI properly and you will need to
disable VAAI in ESXi.

LIO does seem to implement VAAI properly, but performance is not nearly as
good as STGT even with VAAI's benefits. The assumption for the cause is
that LIO currently uses kernel rbd mapping and kernel rbd performance is
not as good as librbd.

I recently did a simple test of 

Re: [ceph-users] ceph + vmware

2016-07-16 Thread Jake Young
On Saturday, July 16, 2016, Oliver Dzombic <i...@ip-interactive.de> wrote:

> Hi Jake,
>
> thank you very much both was needed, MTU and VAAI deactivated ( i hope
> that wont interfere with vmotion or other features ).
>
> I changed now the MTU of vmkernel and vswitch. That solved this problem.


Try turning VAAI back on at some point.


>
> So i could make an ext4 filesystem and mount it.
>
> Running
>
> dd if=/dev/zero of=/mnt/8G_test bs=4k count=2M conv=fdatasync
>
> Something is strange to me:
>
> The network gets streight 1 Gbit ( maximum connection ) of iscsi bandwidth.
>
> But inside the vm i can only see 40-50MB/s.
>
> I mean replicationsize is 2. So it would be easy to say 1/2 of 1 Gbit =
> 500 Mbit = 40-50MB/s.
>
> But should this reduction not be inside of the ceph cluster ? Which is
> going with 10G network ?
>
> I mean the data are hitting with 1 Gbit the ceph iscsi server. So now
> this is transported to RBD internally by tgt.
> And there its multiplied by 2 ( over the  cluster network which is 10G )
> before the ACK is sended back to iscsi. So the cluster will internally
> duplicate it via 10G. So my expected bandwidth inside the vm should be
> higher than half of the maximum speed.
>
> Is this a wrong understanding of the mechanism ?


The delay is most likely just having to wait for 2 disks to actually do the
write.


>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de <javascript:;>
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 16.07.2016 um 02:18 schrieb Jake Young:
> > I had some odd issues like that due to MTU mismatch.
> >
> > Keep in mind that the vSwitch and vmkernel port have independent MTU
> > settings.  Verify you can ping with large size packets without
> > fragmentation between your host and iscsi target.
> >
> > If that's not it, you can try to disable VAAI options to see if one of
> > them is causing issues. I haven't used ESXi 6.0 yet.
> >
> > Jake
> >
> >
> > On Friday, July 15, 2016, Oliver Dzombic <i...@ip-interactive.de
> <javascript:;>
> > <mailto:i...@ip-interactive.de <javascript:;>>> wrote:
> >
> > Hi,
> >
> > i am currently trying out the stuff.
> >
> > My tgt config:
> >
> > # cat tgtd.conf
> > # The default config file
> > include /etc/tgt/targets.conf
> >
> > # Config files from other packages etc.
> > include /etc/tgt/conf.d/*.conf
> >
> > nr_iothreads=128
> >
> >
> > -
> >
> > # cat iqn.2016-07.tgt.esxi-test.conf
> > 
> >   initiator-address ALL
> >   scsi_sn esxi-test
> >   #vendor_id CEPH
> >   #controller_tid 1
> >   write-cache on
> >   read-cache on
> >   driver iscsi
> >   bs-type rbd
> >   
> >   lun 1
> >   scsi_id cf1c4a71e700506357
> >   
> >   
> >
> >
> > --
> >
> >
> > If i create a vm inside esxi 6 and try to format the virtual hdd, i
> see
> > in logs:
> >
> > sd:2:0:0:0: [sda] CDB:
> > Write(10): 2a 00 0f 86 a8 80 00 01 40 00
> > mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=880068aa5e00)
> > mptscsih: ioc0: attempting task abort! ( sc=880068aa4a80)
> >
> > With the LSI HDD emulation. With the vmware paravirtualization
> > everything just freeze.
> >
> > Any idea with that issue ?
> >
> > --
> > Mit freundlichen Gruessen / Best regards
> >
> > Oliver Dzombic
> > IP-Interactive
> >
> > mailto:i...@ip-interactive.de <javascript:;>
> >
> > Anschrift:
> >
> > IP Interactive UG ( haftungsbeschraenkt )
> > Zum Sonnenberg 1-3
> > 63571 Gelnhausen
> >
> > HRB 93402 beim Amtsgericht Hanau
> > Geschäftsführung: Oliver Dzombic
> >
> > Steuer Nr.: 35 236 3622 1
> > UST ID: DE274086107
> >
> >
> > Am 11.07.2016 um 22:24 schrieb Jake Young:
> > > I'm using this setup with ESXi 5.1 and I get very good
> performance.  I
> > > suspect you have other issues.  Reliability is another story (see
> > Nick's
> > 

Re: [ceph-users] ceph + vmware

2016-07-15 Thread Jake Young
I had some odd issues like that due to MTU mismatch.

Keep in mind that the vSwitch and vmkernel port have independent MTU
settings.  Verify you can ping with large size packets without
fragmentation between your host and iscsi target.

If that's not it, you can try to disable VAAI options to see if one of them
is causing issues. I haven't used ESXi 6.0 yet.

Jake


On Friday, July 15, 2016, Oliver Dzombic <i...@ip-interactive.de> wrote:

> Hi,
>
> i am currently trying out the stuff.
>
> My tgt config:
>
> # cat tgtd.conf
> # The default config file
> include /etc/tgt/targets.conf
>
> # Config files from other packages etc.
> include /etc/tgt/conf.d/*.conf
>
> nr_iothreads=128
>
>
> -
>
> # cat iqn.2016-07.tgt.esxi-test.conf
> 
>   initiator-address ALL
>   scsi_sn esxi-test
>   #vendor_id CEPH
>   #controller_tid 1
>   write-cache on
>   read-cache on
>   driver iscsi
>   bs-type rbd
>   
>   lun 1
>   scsi_id cf1c4a71e700506357
>   
>   
>
>
> --
>
>
> If i create a vm inside esxi 6 and try to format the virtual hdd, i see
> in logs:
>
> sd:2:0:0:0: [sda] CDB:
> Write(10): 2a 00 0f 86 a8 80 00 01 40 00
> mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=880068aa5e00)
> mptscsih: ioc0: attempting task abort! ( sc=880068aa4a80)
>
> With the LSI HDD emulation. With the vmware paravirtualization
> everything just freeze.
>
> Any idea with that issue ?
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de <javascript:;>
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 11.07.2016 um 22:24 schrieb Jake Young:
> > I'm using this setup with ESXi 5.1 and I get very good performance.  I
> > suspect you have other issues.  Reliability is another story (see Nick's
> > posts on tgt and HA to get an idea of the awful problems you can have),
> > but for my test labs the risk is acceptable.
> >
> >
> > One change I found helpful is to run tgtd with 128 threads.  I'm running
> > Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the
> > line that read:
> >
> > exec tgtd
> >
> > to
> >
> > exec tgtd --nr_iothreads=128
> >
> >
> > If you're not concerned with reliability, you can enhance throughput
> > even more by enabling rbd client write-back cache in your tgt VM's
> > ceph.conf file (you'll need to restart tgtd for this to take effect):
> >
> > [client]
> > rbd_cache = true
> > rbd_cache_size = 67108864 # (64MB)
> > rbd_cache_max_dirty = 50331648 # (48MB)
> > rbd_cache_target_dirty = 33554432 # (32MB)
> > rbd_cache_max_dirty_age = 2
> > rbd_cache_writethrough_until_flush = false
> >
> >
> >
> >
> > Here's a sample targets.conf:
> >
> >   
> >   initiator-address ALL
> >   scsi_sn Charter
> >   #vendor_id CEPH
> >   #controller_tid 1
> >   write-cache on
> >   read-cache on
> >   driver iscsi
> >   bs-type rbd
> >   
> >   lun 5
> >   scsi_id cfe1000c4a71e700506357
> >   
> >   
> >   lun 6
> >   scsi_id cfe1000c4a71e700507157
> >   
> >   
> >   lun 7
> >   scsi_id cfe1000c4a71e70050da7a
> >   
> >   
> >   lun 8
> >   scsi_id cfe1000c4a71e70050bac0
> >   
> >   
> >
> >
> >
> > I don't have FIO numbers handy, but I have some oracle calibrate io
> > output.
> >
> > We're running Oracle RAC database servers in linux VMs on ESXi 5.1,
> > which use iSCSI to connect to the tgt service.  I only have a single
> > connection setup in ESXi for each LUN.  I tested using multipathing and
> > two tgt VMs presenting identical LUNs/RBD disks, but found that there
> > wasn't a significant performance gain by doing this, even with
> > round-robin path selecting in VMware.
> >
> >
> > These tests were run from two RAC VMs, each on a different host, with
> > both hosts connected to the same tgt instance.  The way we have oracle
> > configured, it would have been using two of the LUNs heavily during this
> > calibrate IO test.
> >
> >
> > This output is with 128 threads in tgtd and rbd client cache enabled:
> >
> > START_TIME   END_TIME   MAX_IOPS   MAX_MBPS
> MAX_PMBPS   LATENCY   DISKS
> > ---

Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Jake Young
We use all Cisco UCS servers (C240 M3 and M4s) with the PCIE VIC 1385 40G
NIC.  The drivers were included in Ubuntu 14.04.  I've had no issues with
the NICs or my network what so ever.

We have two Cisco Nexus 5624Q that the OSD servers connect to.  The
switches are just switching two VLANs (ceph client and cluster networks),
no layer 3 routing.  Those switches connect directly to two pairs of 6248
Fabric Interconnects, which are like TOR switches for UCS Blade Server
Chassis.

On Wed, Jul 13, 2016 at 11:08 AM,  wrote:

> I am using these for other stuff:
> http://www.supermicro.com/products/accessories/addon/AOC-STG-b4S.cfm
>
> If you want NIC, also think of the "network side" : SFP+ switch are very
> common, 40G is less common, 25G is really new (= really few products)
>
>
>
> On 13/07/2016 16:50, Warren Wang - ISD wrote:
> > I¹ve run the Mellanox 40 gig card. Connectx 3-Pro, but that¹s old now.
> > Back when I ran it, the  drivers were kind of a pain to deal with in
> > Ubuntu, primarily during PXE. It should be better now though.
> >
> > If you have the network to support it, 25Gbe is quite a bit cheaper per
> > port, and won¹t be so hard to drive. 40Gbe is very hard to fill. I
> > personally probably would not do 40 again.
> >
> > Warren Wang
> >
> >
> >
> > On 7/13/16, 9:10 AM, "ceph-users on behalf of Götz Reinicke - IT
> > Koordinator"  > goetz.reini...@filmakademie.de> wrote:
> >
> >> Am 13.07.16 um 14:59 schrieb Joe Landman:
> >>>
> >>>
> >>> On 07/13/2016 08:41 AM, c...@jack.fr.eu.org wrote:
>  40Gbps can be used as 4*10Gbps
> 
>  I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
>  ports", but extented to "usage of more than a single 10Gbps port, eg
>  20Gbps etc too"
> 
>  Is there people here that are using more than 10G on an ceph server ?
> >>>
> >>> We have built, and are building Ceph units for some of our customers
> >>> with dual 100Gb links.  The storage box was one of our all flash
> >>> Unison units for OSDs.  Similarly, we have several customers actively
> >>> using multiple 40GbE links on our 60 bay Unison spinning rust disk
> >>> (SRD) box.
> >>>
> >> Now we get closer. Can you tell me which 40G Nic you use?
> >>
> >>/götz
> >>
> >
> > This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed. If
> you have received this email in error destroy it immediately. *** Walmart
> Confidential ***
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Jake Young
My OSDs have dual 40G NICs.  I typically don't use more than 1Gbps on
either network. During heavy recovery activity (like if I lose a whole
server), I've seen up to 12Gbps on the cluster network.

For reference my cluster is 9 OSD nodes with 9x 7200RPM 2TB OSDs. They all
have RAID cards with 4GB of RAM and a BBU. The disks are in single disk
RAID 1 to make use of the card's WB cache.

I can imagine with more servers, the peak recovery BW usage may go up even
more, to the max write rate to the RAID card's cache.

Jake



On Wednesday, July 13, 2016,  wrote:

> 40Gbps can be used as 4*10Gbps
>
> I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
> ports", but extented to "usage of more than a single 10Gbps port, eg
> 20Gbps etc too"
>
> Is there people here that are using more than 10G on an ceph server ?
>
> On 13/07/2016 14:27, Wido den Hollander wrote:
> >
> >> Op 13 juli 2016 om 12:00 schreef Götz Reinicke - IT Koordinator <
> goetz.reini...@filmakademie.de >:
> >>
> >>
> >> Am 13.07.16 um 11:47 schrieb Wido den Hollander:
>  Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator <
> goetz.reini...@filmakademie.de >:
> 
> 
>  Hi,
> 
>  can anybody give some realworld feedback on what hardware
>  (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The
> Ceph
>  Cluster will be mostly rbd images. S3 in the future, CephFS we will
> see :)
> 
>  Thanks for some feedback and hints! Regadrs . Götz
> 
> >>> Why do you think you need 40Gb? That's some serious traffic to the
> OSDs and I doubt it's really needed.
> >>>
> >>> Latency-wise 40Gb isn't much better than 10Gb, so why not stick with
> that?
> >>>
> >>> It's also better to have more smaller nodes than a few big nodes with
> Ceph.
> >>>
> >>> Wido
> >>>
> >> Hi Wido,
> >>
> >> may be my post was misleading. The OSD Nodes do have 10G, the FIleserver
> >> in front to the Clients/Destops should have 40G.
> >>
> >
> > Ah, the fileserver will re-export RBD/Samba? Any Xeon E5 CPU will do
> just fine I think.
> >
> > Still, 40GbE is a lot of bandwidth!
> >
> > Wido
> >
> >>
> >> OSD NODEs/Cluster 2*10Gb Bond  40G Fileserver 40G  1G/10G
> Clients
> >>
> >> /Götz
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com 
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-11 Thread Jake Young
I'm using this setup with ESXi 5.1 and I get very good performance.  I
suspect you have other issues.  Reliability is another story (see Nick's
posts on tgt and HA to get an idea of the awful problems you can have), but
for my test labs the risk is acceptable.


One change I found helpful is to run tgtd with 128 threads.  I'm running
Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the line
that read:

exec tgtd

to

exec tgtd --nr_iothreads=128


If you're not concerned with reliability, you can enhance throughput even
more by enabling rbd client write-back cache in your tgt VM's ceph.conf
file (you'll need to restart tgtd for this to take effect):

[client]
rbd_cache = true
rbd_cache_size = 67108864 # (64MB)
rbd_cache_max_dirty = 50331648 # (48MB)
rbd_cache_target_dirty = 33554432 # (32MB)
rbd_cache_max_dirty_age = 2
rbd_cache_writethrough_until_flush = false




Here's a sample targets.conf:

  
  initiator-address ALL
  scsi_sn Charter
  #vendor_id CEPH
  #controller_tid 1
  write-cache on
  read-cache on
  driver iscsi
  bs-type rbd
  
  lun 5
  scsi_id cfe1000c4a71e700506357
  
  
  lun 6
  scsi_id cfe1000c4a71e700507157
  
  
  lun 7
  scsi_id cfe1000c4a71e70050da7a
  
  
  lun 8
  scsi_id cfe1000c4a71e70050bac0
  
  



I don't have FIO numbers handy, but I have some oracle calibrate io output.


We're running Oracle RAC database servers in linux VMs on ESXi 5.1, which
use iSCSI to connect to the tgt service.  I only have a single connection
setup in ESXi for each LUN.  I tested using multipathing and two tgt VMs
presenting identical LUNs/RBD disks, but found that there wasn't a
significant performance gain by doing this, even with round-robin path
selecting in VMware.


These tests were run from two RAC VMs, each on a different host, with both
hosts connected to the same tgt instance.  The way we have oracle
configured, it would have been using two of the LUNs heavily during this
calibrate IO test.


This output is with 128 threads in tgtd and rbd client cache enabled:

START_TIME   END_TIME   MAX_IOPS   MAX_MBPS
MAX_PMBPS   LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 15:10:50  28-JUN-016 15:20:04   14153658
412   14  75


This output is with the same configuration, but with rbd client cache
disabled:

START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
  LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 22:44:29  28-JUN-016 22:49:057449161219
  20  75

This output is from a directly connected EMC VNX5100 FC SAN with 25 disks
using dual 8Gb FC links on a different lab system:

START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
  LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 22:11:25  28-JUN-016 22:18:486487299224
  19  75


One of our goals for our Ceph cluster is to replace the EMC SANs.  We've
accomplished this performance wise, the next step is to get a plausible
iSCSI HA solution working.  I'm very interested in what Mike Christie is
putting together.  I'm in the process of vetting the SUSE solution now.

BTW - The tests were run when we had 75 OSDs, which are all 7200RPM 2TB
HDs, across 9 OSD hosts.  We have no SSD journals, instead we have all the
disks setup as single disk RAID1 disk groups with WB cache with BBU.  All
OSD hosts have 40Gb networking and the ESXi hosts have 10G.

Jake


On Mon, Jul 11, 2016 at 12:06 PM, Oliver Dzombic 
wrote:

> Hi Mike,
>
> i was trying:
>
> https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
>
> ONE target, from different OSD servers directly, to multiple vmware esxi
> servers.
>
> A config looked like:
>
> #cat iqn.ceph-cluster_netzlaboranten-storage.conf
>
> 
> driver iscsi
> bs-type rbd
> backing-store rbd/vmware-storage
> initiator-address 10.0.0.9
> initiator-address 10.0.0.10
> incominguser vmwaren-storage RPb18P0xAqkAw4M1
> 
>
>
> We had 4 OSD servers. Everyone had this config running.
> We had 2 vmware servers ( esxi ).
>
> So we had 4 paths to this vmware-storage RBD object.
>
> VMware, in the very end, had 8 paths ( 4 path's directly connected to
> the specific vmware server ) + 4 paths this specific vmware servers saw
> via the other vmware server ).
>
> There were very big problems with performance. I am talking about < 10
> MB/s. So the customer was not able to use it, so good old nfs is serving.
>
> At that time we used ceph hammer, and i think esxi 5.5 the customer was
> using, or maybe esxi 6, was somewhere last year the testing.
>
> 
>
> We will make a new attempt now with ceph jewel and esxi 6 and this time
> we will manage the vmware servers.
>
> As soon as we fixed this
>
> "ceph mon 

Re: [ceph-users] Mounting Ceph RBD image to XenServer 7 as SR

2016-06-30 Thread Jake Young
See https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17112.html


On Thursday, June 30, 2016, Mike Jacobacci  wrote:

> So after adding the ceph repo and enabling the cents-7 repo… It fails
> trying to install ceph-common:
>
> Loaded plugins: fastestmirror
> Loading mirror speeds from cached hostfile
>  * base: mirror.web-ster.com
> Resolving Dependencies
> --> Running transaction check
> ---> Package ceph-common.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: python-cephfs = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: python-rados = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librbd1 = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libcephfs1 = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: python-rbd = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librados2 = 1:10.2.2-0.el7 for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: python-requests for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libboost_program_options-mt.so.1.53.0()(64bit)
> for package: 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librgw.so.2()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libradosstriper.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libbabeltrace-ctf.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libboost_regex-mt.so.1.53.0()(64bit) for
> package: 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libboost_iostreams-mt.so.1.53.0()(64bit) for
> package: 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librbd.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libtcmalloc.so.4()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: librados.so.2()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libbabeltrace.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Running transaction check
> ---> Package boost-iostreams.x86_64 0:1.53.0-25.el7 will be installed
> ---> Package boost-program-options.x86_64 0:1.53.0-25.el7 will be installed
> ---> Package boost-regex.x86_64 0:1.53.0-25.el7 will be installed
> ---> Package ceph-common.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: libbabeltrace-ctf.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libbabeltrace.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> ---> Package gperftools-libs.x86_64 0:2.4-7.el7 will be installed
> --> Processing Dependency: libunwind.so.8()(64bit) for package:
> gperftools-libs-2.4-7.el7.x86_64
> ---> Package libcephfs1.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: libboost_random-mt.so.1.53.0()(64bit) for
> package: 1:libcephfs1-10.2.2-0.el7.x86_64
> ---> Package librados2.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: liblttng-ust.so.0()(64bit) for package:
> 1:librados2-10.2.2-0.el7.x86_64
> ---> Package libradosstriper1.x86_64 1:10.2.2-0.el7 will be installed
> ---> Package librbd1.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: liblttng-ust.so.0()(64bit) for package:
> 1:librbd1-10.2.2-0.el7.x86_64
> ---> Package librgw2.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: libfcgi.so.0()(64bit) for package:
> 1:librgw2-10.2.2-0.el7.x86_64
> ---> Package python-cephfs.x86_64 1:10.2.2-0.el7 will be installed
> ---> Package python-rados.x86_64 1:10.2.2-0.el7 will be installed
> ---> Package python-rbd.x86_64 1:10.2.2-0.el7 will be installed
> ---> Package python-requests.noarch 0:2.6.0-1.el7_1 will be installed
> --> Processing Dependency: python-urllib3 >= 1.10.2-1 for package:
> python-requests-2.6.0-1.el7_1.noarch
> --> Processing Dependency: python-chardet >= 2.2.1-1 for package:
> python-requests-2.6.0-1.el7_1.noarch
> --> Running transaction check
> ---> Package boost-random.x86_64 0:1.53.0-25.el7 will be installed
> ---> Package ceph-common.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: libbabeltrace-ctf.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> --> Processing Dependency: libbabeltrace.so.1()(64bit) for package:
> 1:ceph-common-10.2.2-0.el7.x86_64
> ---> Package librados2.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: liblttng-ust.so.0()(64bit) for package:
> 1:librados2-10.2.2-0.el7.x86_64
> ---> Package librbd1.x86_64 1:10.2.2-0.el7 will be installed
> --> Processing Dependency: liblttng-ust.so.0()(64bit) for package:
> 1:librbd1-10.2.2-0.el7.x86_64
> ---> Package librgw2.x86_64 1:10.2.2-0.el7 

Re: [ceph-users] Mounting Ceph RBD image to XenServer 7 as SR

2016-06-30 Thread Jake Young
Can you install the ceph client tools on your server?  They may give you a
more obvious error. Try to install the package and config/keys manually
instead of with ceph-deploy.

Also see this:
http://xenserver.org/blog/entry/tech-preview-of-xenserver-libvirt-ceph.html

Jake


On Thursday, June 30, 2016, Mike Jacobacci <mi...@flowjo.com> wrote:

> Hi Jake,
>
> Interesting… XenServer 7 does has rbd installed but trying to map the rbd
> image with this command:
>
> # echo {ceph_monitor_ip} name={ceph_admin},secret={ceph_key} {ceph_pool}
> {ceph_image} >/sys/bus/rbd/add
>
>  It fails with just an i/o error… I am looking into now.  My cluster
> health is OK, so I am hoping I didn’t miss a configuration or something.
>
>
> On Jun 29, 2016, at 3:28 PM, Jake Young <jak3...@gmail.com
> <javascript:_e(%7B%7D,'cvml','jak3...@gmail.com');>> wrote:
>
>
>
> On Wednesday, June 29, 2016, Mike Jacobacci <mi...@flowjo.com
> <javascript:_e(%7B%7D,'cvml','mi...@flowjo.com');>> wrote:
>
>> Hi all,
>>
>> Is there anyone using rbd for xenserver vm storage?  I have XenServer 7
>> and the latest Ceph, I am looking for the the best way to mount the rbd
>> volume under XenServer.  There is not much recent info out there I have
>> found except for this:
>>
>> http://www.mad-hacking.net/documentation/linux/ha-cluster/storage-area-network/ceph-xen-domu.xml
>>
>> and this plugin (which looks nice):
>> https://github.com/mstarikov/rbdsr
>>
>> I am looking for a way that doesn’t involve too much command line so
>> other admins that don’t know Ceph or XenServer very well can work with it.
>> I am just curious what others are doing… Any help is greatly appreciated!
>>
>> Cheers,
>> Mike
>>
>
> I'm not a XenServer user, so I can't help you there; but I feel your pain
> using Ceph for VMware storage.
>
> I'm surprised that any major Linux distributions haven't considered
> enabling rbd modules in initrd.
>
> I can see having a tiny OS image containing not much more than grub and
> the boot kernel. The trick would be to find a way to manage the boot string
> in the grub conf on a large scale.
>
> Jake
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mounting Ceph RBD image to XenServer 7 as SR

2016-06-29 Thread Jake Young
On Wednesday, June 29, 2016, Mike Jacobacci  wrote:

> Hi all,
>
> Is there anyone using rbd for xenserver vm storage?  I have XenServer 7
> and the latest Ceph, I am looking for the the best way to mount the rbd
> volume under XenServer.  There is not much recent info out there I have
> found except for this:
>
> http://www.mad-hacking.net/documentation/linux/ha-cluster/storage-area-network/ceph-xen-domu.xml
>
> and this plugin (which looks nice):
> https://github.com/mstarikov/rbdsr
>
> I am looking for a way that doesn’t involve too much command line so other
> admins that don’t know Ceph or XenServer very well can work with it.  I am
> just curious what others are doing… Any help is greatly appreciated!
>
> Cheers,
> Mike
>

I'm not a XenServer user, so I can't help you there; but I feel your pain
using Ceph for VMware storage.

I'm surprised that any major Linux distributions haven't considered
enabling rbd modules in initrd.

I can see having a tiny OS image containing not much more than grub and the
boot kernel. The trick would be to find a way to manage the boot string in
the grub conf on a large scale.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD with iSCSI

2015-09-10 Thread Jake Young
On Wed, Sep 9, 2015 at 8:13 AM, Daleep Bais  wrote:

> Hi,
>
> I am following steps from URL 
> *http://www.sebastien-han.fr/blog/2014/07/07/start-with-the-rbd-support-for-tgt/
> *
>   to create a RBD pool  and share to another initiator.
>
> I am not able to get rbd in the backstore list. Please suggest.
>
> below is the output of tgtadm command:
>
> tgtadm --lld iscsi --op show --mode system
> System:
> State: ready
> debug: off
> LLDs:
> iscsi: ready
> iser: error
> Backing stores:
> sheepdog
> bsg
> sg
> null
> ssc
> smc (bsoflags sync:direct)
> mmc (bsoflags sync:direct)
> rdwr (bsoflags sync:direct)
> Device types:
> disk
> cd/dvd
> osd
> controller
> changer
> tape
> passthrough
> iSNS:
> iSNS=Off
> iSNSServerIP=
> iSNSServerPort=3205
> iSNSAccessControl=Off
>
>
> I have installed tgt and tgt-rbd packages till now. Working on Debian
> GNU/Linux 8.1 (jessie)
>
> Thanks.
>
> Daleep Singh Bais
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
Hey Daleep,

The tgt you have installed does not support Ceph rbd.  See the output from
my system using a more recent tgt that supports rbd.

tgtadm --lld iscsi --mode system --op show
System:
State: ready
debug: off
LLDs:
iscsi: ready
iser: error
Backing stores:
*rbd (bsoflags sync:direct)*
sheepdog
bsg
sg
null
ssc
rdwr (bsoflags sync:direct)
Device types:
disk
cd/dvd
osd
controller
changer
tape
passthrough
iSNS:
iSNS=Off
iSNSServerIP=
iSNSServerPort=3205
iSNSAccessControl=Off


You will need a new version of tgt.  I think the earliest version that
supports rbd is 1.0.42

https://github.com/fujita/tgt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-29 Thread Jake Young
On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic frederic.sch...@cea.fr
wrote:

 Hi again,

 So I have tried
 - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
 - changing the memory configuration, from advanced ecc mode to
performance mode, boosting the memory bandwidth from 35GB/s to 40GB/s
 - plugged a second 10GB/s link and setup a ceph internal network
 - tried various tuned-adm profile such as throughput-performance

 This changed about nothing.

 If
 - the CPUs are not maxed out, and lowering the frequency doesn't change a
thing
 - the network is not maxed out
 - the memory doesn't seem to have an impact
 - network interrupts are spread across all 8 cpu cores and receive queues
are OK
 - disks are not used at their maximum potential (iostat shows my dd
commands produce much more tps than the 4MB ceph transfers...)

 Where can I possibly find a bottleneck ?

 I'm /(almost) out of ideas/ ... :'(

 Regards


Frederic,

I was trying to optimize my ceph cluster as well and I looked at all of the
same things you described, which didn't help my performance noticeably.

The following network kernel tuning settings did help me significantly.

This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph osds
and any client that connects to my ceph cluster.

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for
10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
#net.core.rmem_max = 56623104
#net.core.wmem_max = 56623104
# Use 128M buffers
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.optmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.somaxconn = 1024
# Increase the length of the processor input queue
net.core.netdev_max_backlog = 25
net.ipv4.tcp_max_syn_backlog = 3
net.ipv4.tcp_max_tw_buckets = 200
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

# Disable source routing and redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

# Recommended when jumbo frames are enabled
net.ipv4.tcp_mtu_probing = 1

I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything else.

Let me know if that helps.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-29 Thread Jake Young
On Wed, Jul 29, 2015 at 11:23 AM, Mark Nelson mnel...@redhat.com wrote:

 On 07/29/2015 10:13 AM, Jake Young wrote:

 On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic
 frederic.sch...@cea.fr mailto:frederic.sch...@cea.fr wrote:
  
   Hi again,
  
   So I have tried
   - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
   - changing the memory configuration, from advanced ecc mode to
 performance mode, boosting the memory bandwidth from 35GB/s to 40GB/s
   - plugged a second 10GB/s link and setup a ceph internal network
   - tried various tuned-adm profile such as throughput-performance
  
   This changed about nothing.
  
   If
   - the CPUs are not maxed out, and lowering the frequency doesn't
 change a thing
   - the network is not maxed out
   - the memory doesn't seem to have an impact
   - network interrupts are spread across all 8 cpu cores and receive
 queues are OK
   - disks are not used at their maximum potential (iostat shows my dd
 commands produce much more tps than the 4MB ceph transfers...)
  
   Where can I possibly find a bottleneck ?
  
   I'm /(almost) out of ideas/ ... :'(
  
   Regards
  
  
 Frederic,

 I was trying to optimize my ceph cluster as well and I looked at all of
 the same things you described, which didn't help my performance
 noticeably.

 The following network kernel tuning settings did help me significantly.

 This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph
 osds and any client that connects to my ceph cluster.

  # Increase Linux autotuning TCP buffer limits
  # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104)
 for 10GE
  # Don't set tcp_mem itself! Let the kernel scale it based on RAM.
  #net.core.rmem_max = 56623104
  #net.core.wmem_max = 56623104
  # Use 128M buffers
  net.core.rmem_max = 134217728
  net.core.wmem_max = 134217728
  net.core.rmem_default = 67108864
  net.core.wmem_default = 67108864
  net.core.optmem_max = 134217728
  net.ipv4.tcp_rmem = 4096 87380 67108864
  net.ipv4.tcp_wmem = 4096 65536 67108864

  # Make room for more TIME_WAIT sockets due to more clients,
  # and allow them to be reused if we run out of sockets
  # Also increase the max packet backlog
  net.core.somaxconn = 1024
  # Increase the length of the processor input queue
  net.core.netdev_max_backlog = 25
  net.ipv4.tcp_max_syn_backlog = 3
  net.ipv4.tcp_max_tw_buckets = 200
  net.ipv4.tcp_tw_reuse = 1
  net.ipv4.tcp_tw_recycle = 1
  net.ipv4.tcp_fin_timeout = 10

  # Disable TCP slow start on idle connections
  net.ipv4.tcp_slow_start_after_idle = 0

  # If your servers talk UDP, also up these limits
  net.ipv4.udp_rmem_min = 8192
  net.ipv4.udp_wmem_min = 8192

  # Disable source routing and redirects
  net.ipv4.conf.all.send_redirects = 0
  net.ipv4.conf.all.accept_redirects = 0
  net.ipv4.conf.all.accept_source_route = 0

  # Recommended when jumbo frames are enabled
  net.ipv4.tcp_mtu_probing = 1

 I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything
 else.

 Let me know if that helps.


 Hi Jake,

 Could you talk a little bit about what scenarios you've seen tuning this
 help?  I noticed improvement in RGW performance in some cases with similar
 TCP tunings, but it would be good to understand what other folks are seeing
 and in what situations.


 Jake


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
 ceph-users mailing list

 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Hey Mark,

I'm only using RBD.  My clients are all VMware, so I have a few iSCSI proxy
VMs (using rbd enabled tgt).  My workload is typically light random
read/write, except for the periodic eager zeroing of multi terabyte
volumes.  Since there is no VAAI in tgt, this turns into heavy sequential
writing.

I found the network tuning above helped to open up the connection from a
single iSCSI proxy VM to the cluster.

Note that my osd nodes have both a public network interface as well as a
dedicated private network interface, which are both 40G.  I believe the
network tuning also has another effect of improving the performance of the
cluster network (where the replication data is sent across), because
initially I had only applied the kernel tuning to the osd nodes and saw a
performance improvement before I implemented it on the iSCSI proxy VMs.

I should mention that I did all of my testing back in firefly (about 1 year
ago) and I haven't tried to remove these parameters from my cluster to see
if there is a performance degradation now that I'm running Hammer

Re: [ceph-users] Cisco UCS Blades as MONs? Pros cons ...?

2015-05-14 Thread Jake Young
I have 42 OSDs on 6 servers. I'm planning to double that this quarter by
adding 6 more servers to get to 84 OSDs.

I have 3 monitor VMs. Two of them are running on two different blades in
the same chassis, but their networking is on different fabrics. The third
one is on a blade in a different chassis.

My monitor VM cpu, memory and disk io load is very small, as in nearly
idle. The VM images are on local 10k disks on the blade. They share the
disks with a few other low IO VMs.

I've read that the monitors can get busy and need a lot of IO, where it
justifies using SSDs. I imagine those must be very large clusters with at
least hundreds of OSDs.

Jake

On Wednesday, May 13, 2015, Götz Reinicke - IT Koordinator 
goetz.reini...@filmakademie.de wrote:

 Hi Jake,

 we have the fabric interconnects.

 MONs as VM? What setup do you have? and what cluster size?

 Regards . Götz


 Am 13.05.15 um 15:20 schrieb Jake Young:
  I run my mons as VMs inside of UCS blade compute nodes.
 
  Do you use the fabric interconnects or the standalone blade chassis?
 
  Jake
 
  On Wednesday, May 13, 2015, Götz Reinicke - IT Koordinator
  goetz.reini...@filmakademie.de javascript:; mailto:
 goetz.reini...@filmakademie.de javascript:;
  wrote:
 
  Hi Christian,
 
  currently we do get good discounts as an University and the bundles
 were
  worth it.
 
  The chassis do have multiple PSUs and n 10Gb Ports (40Gb is
 possible).
  The switch connection is redundant.
 
  Cuurrently we think of 10 SATA OSD nodes + x SSD Cache Pool Nodes
 and 5
  MONs. For a start.
 
  The main focus with the blaids would be spacesaving in the rack. Till
  now I dont have any prize, but that woucld count to in our decision
 :)
 
  Thanks and regards . Götz
 
 ...


 --
 Götz Reinicke
 IT-Koordinator

 Tel. +49 7141 969 82 420
 E-Mail goetz.reini...@filmakademie.de javascript:;

 Filmakademie Baden-Württemberg GmbH
 Akademiehof 10
 71638 Ludwigsburg
 www.filmakademie.de

 Eintragung Amtsgericht Stuttgart HRB 205016

 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
 Staatssekretär im Ministerium für Wissenschaft,
 Forschung und Kunst Baden-Württemberg

 Geschäftsführer: Prof. Thomas Schadt


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cisco UCS Blades as MONs? Pros cons ...?

2015-05-13 Thread Jake Young
I run my mons as VMs inside of UCS blade compute nodes.

Do you use the fabric interconnects or the standalone blade chassis?

Jake

On Wednesday, May 13, 2015, Götz Reinicke - IT Koordinator 
goetz.reini...@filmakademie.de wrote:

 Hi Christian,

 currently we do get good discounts as an University and the bundles were
 worth it.

 The chassis do have multiple PSUs and n 10Gb Ports (40Gb is possible).
 The switch connection is redundant.

 Cuurrently we think of 10 SATA OSD nodes + x SSD Cache Pool Nodes and 5
 MONs. For a start.

 The main focus with the blaids would be spacesaving in the rack. Till
 now I dont have any prize, but that woucld count to in our decision :)

 Thanks and regards . Götz

 Am 12.05.15 um 14:50 schrieb Christian Balzer:
 
  Hello,
 
  I'm not familiar with Cisco UCS gear (can you cite exact models?),
  but somehow the thought of buying compute gear from Cisco makes me think
 of
  having too much money or very steep discounts. ^o^
 
  That said, I presume the chassis those blades are in have redundancy in
  terms of PSUs (we always have at least 2 independent power circuits per
  rack) and outside connectivity.
  So from where I'm standing (I have deployed plenty of SuperMicro
  MicroCloud chassis/blades) I'd consider the blade a PoF and be done.
 
  What I would do (remembering the scale of your planned deployment) is to
  go with one dedicated MON that will be the primary (lowest IP) 99.8% of
  the time and 4 OSDs with MONs on them. If you want to feel extra good
  about this, give those OSDs a bit more CPU/RAM and most of all fast SSDs
  for the OS (/var/lib/ceph).
 
  Christian
 
  On Tue, 12 May 2015 14:30:58 +0200 Götz Reinicke - IT Koordinator wrote:
 
  Hi,
 
  we have some space in our two blade chassis, so I was thinking of the
  pros and cons of using some blades as MONs. I thought about five MONs.
 
  Pro: space saving in our rack
  Con: just two blade centers. Two points of failures.
 
  From the redundndency POV I'd go with standalone servers, but space
  could be a bit of a problem currently 
 
  Waht do you think?
 
   Regards . Götz
 
 


 --
 Götz Reinicke
 IT-Koordinator

 Tel. +49 7141 969 82 420
 E-Mail goetz.reini...@filmakademie.de javascript:;

 Filmakademie Baden-Württemberg GmbH
 Akademiehof 10
 71638 Ludwigsburg
 www.filmakademie.de

 Eintragung Amtsgericht Stuttgart HRB 205016

 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
 Staatssekretär im Ministerium für Wissenschaft,
 Forschung und Kunst Baden-Württemberg

 Geschäftsführer: Prof. Thomas Schadt


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using RAID Controller for OSD and JNL disks in Ceph Nodes

2015-05-04 Thread Jake Young
On Monday, May 4, 2015, Christian Balzer ch...@gol.com wrote:

 On Mon, 13 Apr 2015 10:39:57 +0530 Sanjoy Dasgupta wrote:

  Hi!
 
  This is an often discussed and clarified topic, but Reason why I am
  asking is because
 
  If We use a RAID controller with Lot of Cache (FBWC) and Configure each
  Drive as Single Drive RAID0, then  Write to disks will benefit by using
  FBWC and accelerate I/O performance. Is this correct assumption ?
 
 In the case of Ceph, the journal writes (assuming journals on the HDDs,
 not separate SSDs) will benefit from this indeed.


Each of my OSD nodes has 7 2TB disks on one RAID card with 1GB of FBWC with
BBU.

I get pretty good performance. With 6 of the above osd nodes and no
replication, using rados bench, I can get around 7500 4k write iops and it
can peak to over 10k iops.

The write throughput at 4MB is around 1000 MB/s, It will peak to around
1309 MB/s.




  Also, it indeed helps, what are the downside of using RAID Controller
  with Cache in above manner (Apart from Cost of RAID controller) ?
 
 If you have too much money or existing HW, knock yourself out. ^.^

 Aside from the cost, the cache should be battery backed, additional costs
 and maintenance issues.
 Setting up RAID0 drives can be painful (megacli must die), and adds
 another step to OSD replacement/deployment.
 The resulting drive may or may not have all SMART features exposed and
 (not applicable in your case) won't support TRIM.

 Lastly the cache tends to be small when shared with many HDDs and there is
 also competition over it by reads and writes, but that's something to keep
 in mind, not a disadvantage per se.


Yes, all of the above.

I think it is worth the money.

I thought I wouldn't need SSD journals, and the raid cards were cheaper
than a few good SSDs.

Due to my odd workload, I end up with all of my writes being below 256k in
size.  I have to use iscsi proxy nodes to connect to my VMware hosts. This
limits my VM's throughput to around 200MB/s which isn't fast enough for my
application.  I'm planning on building new osd nodes with the same raid
cards and SSD journals to help further coalesce the small writes.



 Christian
 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com javascript:;Global OnLine Japan/Fusion
 Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cost- and Powerefficient OSD-Nodes

2015-04-28 Thread Jake Young
On Tuesday, April 28, 2015, Dominik Hannen han...@xplace.de wrote:

 Hi ceph-users,

 I am currently planning a cluster and would like some input specifically
 about the storage-nodes.

 The non-osd systems will be running on more powerful system.

 Interconnect as currently planned:
 4 x 1Gbit LACP Bonds over a pair of MLAG-capable switches (planned: EX3300)


One problem with LACP is that it will only allow you to have 1Gbps between
any two IPs or MACs (depending on your switch config). This will most
likely limit the throughput of any client to 1Gbps, which is equivalent
to 125MBps storage throughput.  It is not really equivalent to a 4Gbps
interface or 2x 2Gbps interfaces (if you plan to have a client network and
cluster network).

So far I would go with Supermicros 5018A-MHN4 offering, rack-space is not
 really a concern, so only 4 OSDs per U is fine.
 (The cluster is planned to start with 8 osd-nodes.)

 osd-node:
 Avoton C2758 - 8 x 2.40GHz
 16 GB RAM ECC
 16 GB SSD - OS - SATA-DOM
 250GB SSD - Journal (MX200 250GB with extreme over-provisioning, staggered
 deployment, monitored for TBW-Value)
 4 x 3 TB OSD - Seagate Surveillance HDD (ST3000VX000) 7200rpm 24/7
 4 x 1 Gbit

 per-osd breakdown:
 3 TB HDD
 2 x 2.40GHz (Avoton-Cores)
 4 GB RAM
 8 GB SSD-Journal (~125 MB/s r/w)
 1 Gbit

 The main question is, will the Avoton CPU suffice? (I recon the common
 1GHz/OSD suggestion are in regards to much more powerful CPUs.)

 I don't have any experience with this CPU, but 8x 2.4GHz cores for 4 OSDs
seems like plenty of CPU.

I have 32GB of RAM for 7 osds, which has been enough for me.

Are there any cost-effective suggestions to improve this configuration?


I have implemented a small cluster with no SSD journals, and the
performance is pretty good.

42 osds, 3x replication, 40Gb NICs rados bench shows me 2000 iops at 4k
writes and 500MBps at 4M writes.

I would trade your SSD journals for 10Gb NICs and switches.  I started out
with the same 4x 1Gb LACP config and things like rebalancing/recovery were
terribly slow, as well as the throughput limit I mentioned above.

When you get more funding next quarter/year, you can choose to add the SSD
journals or more OSD nodes. Moving to 10Gb networking after you get the
cluster up and running will be much harder.


 Will erasure coding be a feasible possibility?

 Does it hurt to run OSD-nodes CPU-capped, if you have enough of them?

 ___
 Dominik Hannen
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cost- and Powerefficient OSD-Nodes

2015-04-28 Thread Jake Young
On Tuesday, April 28, 2015, Nick Fisk n...@fisk.me.uk wrote:





  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
 javascript:;] On Behalf Of
  Dominik Hannen
  Sent: 28 April 2015 15:30
  To: Jake Young
  Cc: ceph-users@lists.ceph.com javascript:;
  Subject: Re: [ceph-users] Cost- and Powerefficient OSD-Nodes
 
   Interconnect as currently planned:
   4 x 1Gbit LACP Bonds over a pair of MLAG-capable switches (planned:
   EX3300)
 
   One problem with LACP is that it will only allow you to have 1Gbps
   between any two IPs or MACs (depending on your switch config). This
   will most likely limit the throughput of any client to 1Gbps, which is
   equivalent to 125MBps storage throughput.  It is not really equivalent
   to a 4Gbps interface or 2x 2Gbps interfaces (if you plan to have a
   client network and cluster network).
 
  2 x (2 x 1Gbit) was on my mind with cluster/public separated, if the
  performance of 4 x 1Gbit LACP would not deliver.
  Regarding source-IP/dest-IP hashing with LACP. Wouldn't it be sufficient
 to
  give each osd-process its own IP for cluster/public then?


I'm not sure this is supported. It would probably require a custom CRUSH
map.  I don't know if a host bucket can support multiple IPs. It is a good
idea though, I wish I thought of it last year!


  I am not sure if 4-link LACP will be problematic with enough systems in
 the
  cluster. Maybe 8 osd-nodes will not be enough to balance it out.
  It is not important if every client is able to get peak performance out
 of
 it.
 
   I have implemented a small cluster with no SSD journals, and the
   performance is pretty good.
  
   42 osds, 3x replication, 40Gb NICs rados bench shows me 2000 iops at
   4k writes and 500MBps at 4M writes.
  
   I would trade your SSD journals for 10Gb NICs and switches.  I started
   out with the same 4x 1Gb LACP config and things like
   rebalancing/recovery were terribly slow, as well as the throughput
 limit
 I
  mentioned above.
 
  The SSDs are about ~100USD a piece. I tried to find cost-efficient 10G-
  switches. There it also the power-efficiency in question, a 10G-T Port
 burns
  about 3~5 Watt on its own. Which would put SFP+ Ports (0.7W/Port) on the
  table.

 I think the latest switches/Nic's reduce this slightly more if you enable
 the power saving options and keep the cable length short.

 
  Can you recommend a 'cheap' 10G-switch/NICs?

 I using the Dell N4032's. they seem to do the job and aren't too expensive.
 For the server side, we got servers with 10GB-T built in for almost the
 same
 cost at the 4x1GB models.


I'm using a pair of Cisco Nexus 5672UP switches. There are other Nexus
5000 models that are less expensive, but it's pretty affordable for 48 10Gb
ports and 6 40Gb uplinks.

I have Cisco UCS servers that have the Cisco VICs.


 
   When you get more funding next quarter/year, you can choose to add the
   SSD journals or more OSD nodes. Moving to 10Gb networking after you
   get the cluster up and running will be much harder.
 
  My thinking was that the switches (EX3300) with their 10G uplinks would
  deliver in the case that I would like to add in some 10G switches and
 hosts
  later.
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com javascript:;
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on Solaris / Illumos

2015-04-17 Thread Jake Young
On Friday, April 17, 2015, Michal Kozanecki mkozane...@evertz.com wrote:

 Performance on ZFS on Linux (ZoL) seems to be fine, as long as you use the
 CEPH generic filesystem implementation (writeahead) and not the specific
 CEPH ZFS implementation, CoW snapshoting that CEPH does with ZFS support
 compiled in absolutely kills performance. I suspect the same would go with
 CEPH on Illumos on ZFS. Otherwise it is comparable to XFS in my own testing
 once tweaked.

 There are a few oddities/quirks with ZFS performance that need to be
 tweaked when using it with CEPH, and yea enabling SA on xattr is one of
 them.

 1. ZFS recordsize - The ZFS sector size, known as within ZFS as the
 recordsize is technically dynamic. It only enforces the maximum size,
 however the way CEPH writes and reads from objects (when working with
 smaller blocks, let's say 4k or 8k via rbd) with default settings seems to
 be affected by the recordsize. With the default 128K I've found lower IOPS
 and higher latency. Setting the recordsize too low will inflate various ZFS
 metadata, so it needs to be balanced against how your CEPH pool will be
 used.

 For rbd pools(where small block performance may be important) a recordsize
 of 32K seems to be a good balance. For pure large object based use (rados,
 etc) the 128K default is fine, throughput is high(small block performance
 isn't important here). See following links for more info about recordsize:
 https://blogs.oracle.com/roch/entry/tuning_zfs_recordsize and
 https://www.joyent.com/blog/bruning-questions-zfs-record-size

 2. XATTR - I didn't do much testing here, I've read that if you do not set
 xattr = sa on ZFS you will get poor performance. There were also stability
 issues in the past with xattr = sa on ZFS though it seems all resolved now
 and I have not encountered any issues myself. I'm unsure what the default
 setting is here, I always enable it.

 Make sure you enable and set xattr = sa on ZFS.

 3. ZIL(ZFS Intent Log, also known as the slog) is a MUST (even with a
 separate ceph journal) - It appears that while the ceph journal
 offloads/absorbs writes nicely and boosts performance, it does not
 consolidate writes enough for ZFS. Without a ZIL/SLOG your performance will
 be very sawtooth like (jumpy, stutter, aka fast then slow, fast than slow
 over a period of 10-15 seconds).

 In theory tweaking the various ZFS TXG sync settings might work, but it is
 overly complicated to maintain and likely would only apply to the specific
 underlying disk model. Disabling sync also resolves this, though you'll
 lose the last TXG on a power failure - this might be okay with CEPH, but
 since I'm unsure I'll just assume it is not. IMHO avoid too much evil
 tuning, just add a ZIL/SLOG.

 4. ZIL/SLOG + on-device ceph journal vs ZIL/SLOG + separate ceph journal -
 Performance is very similar, if you have a ZIL/SLOG you could easily get
 away without a separate ceph journal and leave it on the device/ZFS
 dataset. HOWEVER this causes HUGE amounts of fragmentation due to the CoW
 nature. After only a few days usage, performance tanked with the ceph
 journal on the same device.

 I did find that if you partition and share device/SSD between both
 ZIL/SLOG and a separate ceph journal, the resulting performance is about
 the same in pure throughput/iops, though latency is slightly higher. This
 is what I do in my test cluster.

 5. Fragmentation - once you hit around 80-90% disk usage your performance
 will start to slow down due to fragmentation. This isn't due to CEPH, it’s
 a known ZFS quirk due to its CoW nature. Unfortunately there is no defrag
 in ZFS, and likely never will be (the mythical block point rewrite unicorn
 you'll find people talking about).

 There is one way to delay it and possibly avoid it however, enable
 metaslab_debug, this will put the ZFS spacemaps in memory, allowing ZFS to
 make better placements during CoW operations, but it does use more memory.
 See the following links for more detail about spacemaps and fragmentation:
 http://blog.delphix.com/uday/2013/02/19/78/ and
 http://serverfault.com/a/556892 and
 http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html

 There's alot more to ZFS and things-to-know than that (L2ARC uses ARC
 metadata space, dedupe uses ARC metadata space, etc), but as far as CEPH is
 cocearned the above is a good place to start. ZFS IMHO is a great solution,
 but it requires some time and effort to do it right.

 Cheers,

 Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com
 javascript:;


Thank you for taking the time to share that, Michal!

Jake



 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com javascript:;]
 On Behalf Of Mark Nelson
 Sent: April-15-15 12:22 PM
 To: Jake Young
 Cc: ceph-users@lists.ceph.com javascript:;
 Subject: Re: [ceph-users] Ceph on Solaris / Illumos

 On 04/15/2015 10:36 AM, Jake Young wrote:
 
 
  On Wednesday, April 15, 2015, Mark Nelson mnel

[ceph-users] Ceph on Solaris / Illumos

2015-04-15 Thread Jake Young
Has anyone compiled ceph (either osd or client) on a Solaris based OS?

The thread on ZFS support for osd got me thinking about using solaris as an
osd server. It would have much better ZFS performance and I wonder if the
osd performance without a journal would be 2x better.

A second thought I had was using the Comstar iscsi / fcoe target software
that is part of Solaris. Has anyone done anything with a ceph rbd client
for Solaris based OSs?

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on Solaris / Illumos

2015-04-15 Thread Jake Young
On Wednesday, April 15, 2015, Mark Nelson mnel...@redhat.com wrote:



 On 04/15/2015 08:16 AM, Jake Young wrote:

 Has anyone compiled ceph (either osd or client) on a Solaris based OS?

 The thread on ZFS support for osd got me thinking about using solaris as
 an osd server. It would have much better ZFS performance and I wonder if
 the osd performance without a journal would be 2x better.


 Doubt it.  You may be able to do a little better, but you have to pay the
 piper some how.  If you clone from journal you will introduce
 fragmentation.  If you throw the journal away you'll suffer for everything
 but very large writes unless you throw safety away.  I think if we are
 going to generally beat filestore (not just for optimal benchmarking
 tests!) it's going to take some very careful cleverness. Thankfully Sage is
 very clever and is working on it in newstore. Even there, filestore has
 been proving difficult to beat for writes.


That's interesting. I've been under the impression that the ideal
osd config was using a stable and fast BTRFS (which doesn't exist yet) with
no journal.

In my specific case, I don't want to use an external journal. I've gone
down the path of using RAID controllers with write-back cache and BBUs with
each disk in its own RAID0 group, instead of SSD journals. (Thanks for your
performance articles BTW, they were very helpful!)

My take on your results indicates that IO throughput performance on XFS
with same disk journal and WB cache on the RAID card was basically the same
or better than BTRFS with no journal.  In addition, BTRFS typically used
much more CPU.

Has BTRFS performance gotten any better since you wrote the performance
articles?

Have you compared ZFS (ZoL) performance to BTRFS?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on Solaris / Illumos

2015-04-15 Thread Jake Young
On Wednesday, April 15, 2015, Alexandre Marangone amara...@redhat.com
wrote:

 The LX branded zones might be a way to run OSDs on Illumos:
 https://wiki.smartos.org/display/DOC/LX+Branded+Zones

 For fun, I tried a month or so ago, managed to have a quorum. OSDs
 wouldn't start, I didn't look further as far as debugging. I'll give
 it a go when I have more time.


Hmm. That is a great idea.

I'll give LX branded zones a shot for both server and client use cases.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cores/Memory/GHz recommendation for SSD based OSD servers

2015-04-02 Thread Jake Young
On Thursday, April 2, 2015, Nick Fisk n...@fisk.me.uk wrote:

 I'm probably going to get shot down for saying this...but here goes.

 As a very rough guide, think of it more as you need around 10Mhz for every
 IO, whether that IO is 4k or 4MB it uses roughly the same amount of CPU, as
 most of the CPU usage is around ceph data placement rather than the actual
 read/writes to disk.


That piece of information is, by far, one of the most helpful things I've
ever read on this list regarding hardware configuration. Thanks for sharing
that!

That calculation came close to my cluster's max iops.  I've seen just over
11k iops(under ideal conditions with short bursts of io) the 10Mhz
calculation says 12k iops.

For the record, my cluster is 6 osd nodes, each node has:
2x 4 core, 2.5GHz CPUs
32GB RAM
7x 3.5 7.2k rpms 2TB disks (one for each osd)
RAID card with 1GB write-back cache w/ BBU
2x 40Gb NIC
No ssd journals

What effect does replication have on the 10Mhz/iop number, in your
experience?  My 11k iops was achieved with 2x replication.  I've seen over
10k iops with 3x replication. Typically, I can get 2k - 3k iops with long
sequential io patterns.

I'm getting my budget ready for next quarter, so I've been trying to decide
how to spend money to best improve Ceph performance.

To improve long sequential write io, I've been debating adding a PCI flash
accelerator card to each osd node vs just adding another 6 osd nodes. The
cost is about the same.


 I can nearly saturate 12x2.1ghz cores with a single SSD, doing 4k ios at
 high queue depths.

 Which brings us back to your original question, rather than asking how
 much CPU for x amount of SSD's. How many IOs do you require out your
 cluster?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tgt and krbd

2015-03-06 Thread Jake Young
My initator is also VMware software iscsi.  I had my tgt iscsi targets'
write-cache setting off.

I turned write and read cache on in the middle of creating a large eager
zeroed disk (tgt has no VAAI support, so this is all regular synchronous
IO) and it did give me a clear performance boost.

Not orders of magnitude, but maybe 15% faster.

If the image makes it to the list, the yellow line is write KBps.  It went
from about 85MBps to about 100MBps.  What was more noticeable was that the
latency (grey line) went from around 250 ms to 130ms.
[image: Inline image 1]

I'm pretty sure this IO (zeroing) is always 1MB writes, so I don't think
this caused my write size to change.  Maybe it did something to the iSCSI
packets?

Jake

On Fri, Mar 6, 2015 at 9:04 AM, Nick Fisk n...@fisk.me.uk wrote:





 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jake Young
 Sent: 06 March 2015 12:52
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] tgt and krbd



 On Thursday, March 5, 2015, Nick Fisk n...@fisk.me.uk wrote:
 Hi All,

 Just a heads up after a day’s experimentation.

 I believe tgt with its default settings has a small write cache when
 exporting a kernel mapped RBD. Doing some write tests I saw 4 times the
 write throughput when using tgt aio + krbd compared to tgt with the builtin
 librbd.

 After running the following command against the LUN, which apparently
 disables write cache, Performance dropped back to what I am seeing using
 tgt+librbd and also the same as fio.

 tgtadm --op update --mode logicalunit --tid 2 --lun 3 -P
 mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0

 From that I can only deduce that using tgt + krbd in its default state is
 not 100% safe to use, especially in an HA environment.

 Nick




 Hey Nick,

 tgt actually does not have any caches. No read, no write.  tgt's design is
 to passthrough all commands to the backend as efficiently as possible.

 http://lists.wpkg.org/pipermail/stgt/2013-May/005788.html

 The configuration parameters just inform the initiators whether the
 backend storage has a cache. Clearly this makes a big difference for you.
 What initiator are you using with this test?

 Maybe the kernel is doing the caching.  What tuning parameters do you have
 on the krbd disk?

 It could be that using aio is much more efficient. Maybe built in lib rbd
 isn't doing aio?

 Jake


 Hi Jake,

 Hmm that’s interesting, it’s definitely effecting write behaviour though.

 I was running iometer doing single io depth writes in a windows VM on ESXi
 using its software initiator, which as far as I’m aware should be sending
 sync writes for each request.

 I saw in iostat on the tgt server that my 128kb writes were being
 coalesced into ~1024kb writes, which would explain the performance
 increase. So something somewhere is doing caching, albeit on a small scale.

  The krbd disk was all using default settings. I know the RBD support for
 tgt is using the librbd sync writes which I suppose might explain the
 default difference, but this should be the expected behaviour.

 Nick





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tgt and krbd

2015-03-06 Thread Jake Young
On Thursday, March 5, 2015, Nick Fisk n...@fisk.me.uk wrote:

 Hi All,



 Just a heads up after a day’s experimentation.



 I believe tgt with its default settings has a small write cache when
 exporting a kernel mapped RBD. Doing some write tests I saw 4 times the
 write throughput when using tgt aio + krbd compared to tgt with the builtin
 librbd.



 After running the following command against the LUN, which apparently
 disables write cache, Performance dropped back to what I am seeing using
 tgt+librbd and also the same as fio.



 tgtadm --op update --mode logicalunit --tid 2 --lun 3 -P
 mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0



 From that I can only deduce that using tgt + krbd in its default state is
 not 100% safe to use, especially in an HA environment.



 Nick



Hey Nick,

tgt actually does not have any caches. No read, no write.  tgt's design is
to passthrough all commands to the backend as efficiently as possible.

http://lists.wpkg.org/pipermail/stgt/2013-May/005788.html

The configuration parameters just inform the initiators whether the backend
storage has a cache. Clearly this makes a big difference for you.  What
initiator are you using with this test?

Maybe the kernel is doing the caching.  What tuning parameters do you have
on the krbd disk?

It could be that using aio is much more efficient. Maybe built in lib rbd
isn't doing aio?

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tgt and krbd

2015-03-06 Thread Jake Young
On Fri, Mar 6, 2015 at 10:18 AM, Nick Fisk n...@fisk.me.uk wrote:

 On Fri, Mar 6, 2015 at 9:04 AM, Nick Fisk n...@fisk.me.uk wrote:





 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jake Young
 Sent: 06 March 2015 12:52
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] tgt and krbd



 On Thursday, March 5, 2015, Nick Fisk n...@fisk.me.uk wrote:
 Hi All,

 Just a heads up after a day’s experimentation.

 I believe tgt with its default settings has a small write cache when
 exporting a kernel mapped RBD. Doing some write tests I saw 4 times the
 write throughput when using tgt aio + krbd compared to tgt with the builtin
 librbd.

 After running the following command against the LUN, which apparently
 disables write cache, Performance dropped back to what I am seeing using
 tgt+librbd and also the same as fio.

 tgtadm --op update --mode logicalunit --tid 2 --lun 3 -P
 mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0

 From that I can only deduce that using tgt + krbd in its default state is
 not 100% safe to use, especially in an HA environment.

 Nick




 Hey Nick,

 tgt actually does not have any caches. No read, no write.  tgt's design is
 to passthrough all commands to the backend as efficiently as possible.


 http://xo4t.mj.am/link/xo4t/6jv2q54/1/dy6ksWJtZ-g2UEgyc-v5dA/aHR0cDovL2xpc3RzLndwa2cub3JnL3BpcGVybWFpbC9zdGd0LzIwMTMtTWF5LzAwNTc4OC5odG1s

 The configuration parameters just inform the initiators whether the
 backend storage has a cache. Clearly this makes a big difference for you.
 What initiator are you using with this test?

 Maybe the kernel is doing the caching.  What tuning parameters do you have
 on the krbd disk?

 It could be that using aio is much more efficient. Maybe built in lib rbd
 isn't doing aio?

 Jake


 Hi Jake,

 Hmm that’s interesting, it’s definitely effecting write behaviour though.

 I was running iometer doing single io depth writes in a windows VM on ESXi
 using its software initiator, which as far as I’m aware should be sending
 sync writes for each request.

 I saw in iostat on the tgt server that my 128kb writes were being
 coalesced into ~1024kb writes, which would explain the performance
 increase. So something somewhere is doing caching, albeit on a small scale.

  The krbd disk was all using default settings. I know the RBD support for
 tgt is using the librbd sync writes which I suppose might explain the
 default difference, but this should be the expected behaviour.

 Nick




 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Jake Young
 *Sent:* 06 March 2015 15:07

 *To:* Nick Fisk
 *Cc:* ceph-users@lists.ceph.com
 *Subject:* Re: [ceph-users] tgt and krbd



 My initator is also VMware software iscsi.  I had my tgt iscsi targets'
 write-cache setting off.

 I turned write and read cache on in the middle of creating a large eager
 zeroed disk (tgt has no VAAI support, so this is all regular synchronous
 IO) and it did give me a clear performance boost.

 Not orders of magnitude, but maybe 15% faster.

 If the image makes it to the list, the yellow line is write KBps.  It went
 from about 85MBps to about 100MBps.  What was more noticeable was that the
 latency (grey line) went from around 250 ms to 130ms.

 [image: Inline image 1]

 I'm pretty sure this IO (zeroing) is always 1MB writes, so I don't think
 this caused my write size to change.  Maybe it did something to the iSCSI
 packets?



 Jake




 Hi Jake,



 Good to see it’s not just me.



 I’m guessing that the fact you are doing 1MB writes means that the latency
 difference is having a less noticeable impact on the overall write
 bandwidth. What I have been discovering with Ceph + iSCSi is that due to
 all the extra hops (client-iscsi proxy-pri OSD- sec OSD) is that you get
 a lot of latency serialisation which dramatically impacts single threaded
 iops at small IO sizes.


That makes sense.  I don't really understand how latency is going down if
tgt is not really doing caching.




 A few days back I tested adding a tiny SSD write cache on the iscsi proxy
 and this had a dramatic effect in “hiding” the latency behind it from the
 client.



 Nick


After seeing your results, I've been considering experimenting with that.
Currently, my iSCSI proxy nodes are VMs.

I would like to build a few dedicated servers with fast SSDs or fusion-io
devices.  It depends on my budget, it's hard to justify getting a card that
costs 10x the rest of the server...  I would run all my tgt instances in
containers pointing to the rbd disk+cache device.  A fusion-io device could
support many tgt containers.

I don't really want to go back to krbd.  I have a few rbd's that are format
2 with striping, there aren't any stable kernels that support that (or any
kernels at all yet for fancy striping).  I wish there was a way to
incorporate a local cache device into tgt with librbd backends.

Jake

Re: [ceph-users] tgt and krbd

2015-03-06 Thread Jake Young
On Friday, March 6, 2015, Steffen W Sørensen ste...@me.com wrote:


 On 06/03/2015, at 16.50, Jake Young jak3...@gmail.com javascript:;
 wrote:
 
  After seeing your results, I've been considering experimenting with
 that.  Currently, my iSCSI proxy nodes are VMs.
 
  I would like to build a few dedicated servers with fast SSDs or
 fusion-io devices.  It depends on my budget, it's hard to justify getting a
 card that costs 10x the rest of the server...  I would run all my tgt
 instances in containers pointing to the rbd disk+cache device.  A fusion-io
 device could support many tgt containers.
 
  I don't really want to go back to krbd.  I have a few rbd's that are
 format 2 with striping, there aren't any stable kernels that support that
 (or any kernels at all yet for fancy striping).

  I wish there was a way to incorporate a local cache device into tgt with
 librbd backends.
 What about a ram disk device like rapid disk+cache in front of your rbd
 block device

 http://www.rapiddisk.org/?page_id=15#rapiddisk

 /Steffen


I could try that in my VM to prototype the solution before I buy hardware.

RAM based cache is pretty dangerous for this application. If I reboot the
VM and don't disconnect the initiators, there would most likely be data
corruption, or at the very least data loss.

Thanks for the suggestion,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph, LIO, VMWARE anyone?

2015-01-23 Thread Jake Young
Thanks for the feedback Nick and Zoltan,

I have been seeing periodic kernel panics when I used LIO.  It was either
due to LIO or the kernel rbd mapping.  I have seen this on Ubuntu precise
with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel
(currently 3.16.0-28).  Ironically, this is the primary reason I started
exploring a redundancy solution for my iSCSI proxy node.  So, yes, these
crashes have nothing to do with running the Active/Active setup.

I am moving my entire setup from LIO to rbd enabled tgt, which I've found
to be much more stable and gives equivalent performance.

I've been testing active/active LIO since July of 2014 with VMWare and I've
never seen any vmfs corruption.  I am now convinced (thanks Nick) that it
is possible.  The reason I have not seen any corruption may have to do with
how VMWare happens to be configured.

Originally, I had made a point to use round robin path selection in the
VMware hosts; but as I did performance testing, I found that it actually
didn't help performance.  When the host switches iSCSI targets there is a
short spin up time for LIO to get to 100% IO capability.  Since round
robin switches targets every 30 seconds (60 seconds? I forget), this seemed
to be significant.  A secondary goal for me was to end up with a config
that required minimal tuning from VMWare and the target software; so the
obvious choice is to leave VMWare's path selection at the default which is
Fixed and picks the first target in ASCII-betical order.  That means I am
actually functioning in Active/Passive mode.

Jake




On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy 
zol...@linux.vnet.ibm.com wrote:

  Just to chime in: it will look fine, feel fine, but underneath it's quite
 easy to get VMFS corruption. Happened in our tests.
 Also if you're running LIO, from time to time expect a kernel panic
 (haven't tried with the latest upstream, as I've been using
 Ubuntu 14.04 on my export hosts for the test, so might have improved...).

 As of now I would not recommend this setup without being aware of the
 risks involved.

 There have been a few upstream patches getting the LIO code in better
 cluster-aware shape, but no idea if they have been merged
 yet. I know RedHat has a guy on this.

 On 01/21/2015 02:40 PM, Nick Fisk wrote:

  Hi Jake,



 Thanks for this, I have been going through this and have a pretty good
 idea on what you are doing now, however I maybe missing something looking
 through your scripts, but I’m still not quite understanding how you are
 managing to make sure locking is happening with the ESXi ATS SCSI command.



 From this slide




 http://xo4t.mjt.lu/link/xo4t/gzyhtx3/1/_9gJVMUrSdvzGXYaZfCkVA/aHR0cHM6Ly93aWtpLmNlcGguY29tL0BhcGkvZGVraS9maWxlcy8zOC9oYW1tZXItY2VwaC1kZXZlbC1zdW1taXQtc2NzaS10YXJnZXQtY2x1c3RlcmluZy5wZGY
 (Page 8)



 It seems to indicate that for a true active/active setup the two targets
 need to be aware of each other and exchange locking information for it to
 work reliably, I’ve also watched the video from the Ceph developer summit
 where this is discussed and it seems that Ceph+Kernel need changes to allow
 this locking to be pushed back to the RBD layer so it can be shared, from
 what I can see browsing through the Linux Git Repo, these patches haven’t
 made the mainline kernel yet.



 Can you shed any light on this? As tempting as having active/active is,
 I’m wary about using the configuration until I understand how the locking
 is working and if fringe cases involving multiple ESXi hosts writing to the
 same LUN on different targets could spell disaster.



 Many thanks,

 Nick



 *From:* Jake Young [mailto:jak3...@gmail.com jak3...@gmail.com]
 *Sent:* 14 January 2015 16:54

 *To:* Nick Fisk
 *Cc:* Giuseppe Civitella; ceph-users
 *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone?



 Yes, it's active/active and I found that VMWare can switch from path to
 path with no issues or service impact.





 I posted some config files here: github.com/jak3kaj/misc
 http://xo4t.mjt.lu/link/xo4t/gzyhtx3/2/_P2HWj3RxQZC1v5DQ_206Q/aHR0cDovL2dpdGh1Yi5jb20vamFrM2thai9taXNj



 One set is from my LIO nodes, both the primary and secondary configs so
 you can see what I needed to make unique.  The other set (targets.conf) are
 from my tgt nodes.  They are both 4 LUN configs.



 Like I said in my previous email, there is no performance difference
 between LIO and tgt.  The only service I'm running on these nodes is a
 single iscsi target instance (either LIO or tgt).



 Jake



 On Wed, Jan 14, 2015 at 8:41 AM, Nick Fisk n...@fisk.me.uk wrote:

  Hi Jake,



 I can’t remember the exact details, but it was something to do with a
 potential problem when using the pacemaker resource agents. I think it was
 to do with a potential hanging issue when one LUN on a shared target failed
 and then it tried to kill all the other LUNS to fail the target over to
 another host. This then leaves the TCM part of LIO locking the RBD which
 also can’t

Re: [ceph-users] Ceph, LIO, VMWARE anyone?

2015-01-23 Thread Jake Young
I would go with tgt regardless of your HA solution. I tried to use LIO for
a long time and am glad I finally seriously tested tgt. Two big reasons are

1) latest rbd code will be in tgt
2) two less reasons for a kernel panic in the proxy node (rbd and iscsi)

For me, I'm comfortable with how my system is configured with the
Active/Passive config. This only because of the network architecture and
the fact that I administer the ESXi hosts. I also have separate rbd disks
for each environment, so if I do get VMFS corruption, it is isolated to one
system.

Another thing I forgot is that I disabled all the VAAI accelleration based
on this advice when using tgt:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/039670.html
I was having poor performance with VAAI turned on and tgt. LIO performed
the same with or without VAAI for my workload.  I'm not sure if that
changes the way VMFS locking works enough to sidestep the issue. I think
that I'm falling back to just persistent SCSI reservations instead of ATS.
I think I'm still open to corruption for the same reason.  See here if you
haven't already for more details on VMFS locking:
http://blogs.vmware.com/vsphere/2012/05/vmfs-locking-uncovered.html

Jake

On Friday, January 23, 2015, Nick Fisk n...@fisk.me.uk wrote:

 Thanks for your responses guys,



 I’ve been spending a lot of time looking at this recently and I think I’m
 even more confused than when I started.



 I been looking at trying to adapt a resource agent made by tiger computing
 (
 http://xo4t.mjt.lu/link/xo4t/gv9y7rs/1/7MG13jwJZd0R-D8FrJljFA/aHR0cHM6Ly9naXRodWIuY29tL3RpZ2VyY29tcHV0aW5nL29jZi1saW8)
  to create a HA LIO failover target, Instead of going with the Virtual IP
 failover method it manipulates the ALUA states to present active/standby
 paths. It’s very complicated and am close to giving up.



 What do you reckon accept defeat and go with a much simpler tgt and
 virtual IP failover solution for time being until the Redhat patches make
 their way into the kernel?



 *From:* Jake Young [mailto:jak3...@gmail.com
 javascript:_e(%7B%7D,'cvml','jak3...@gmail.com');]
 *Sent:* 23 January 2015 16:46
 *To:* Zoltan Arnold Nagy
 *Cc:* Nick Fisk; ceph-users
 *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone?



 Thanks for the feedback Nick and Zoltan,



 I have been seeing periodic kernel panics when I used LIO.  It was either
 due to LIO or the kernel rbd mapping.  I have seen this on Ubuntu precise
 with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel
 (currently 3.16.0-28).  Ironically, this is the primary reason I started
 exploring a redundancy solution for my iSCSI proxy node.  So, yes, these
 crashes have nothing to do with running the Active/Active setup.



 I am moving my entire setup from LIO to rbd enabled tgt, which I've found
 to be much more stable and gives equivalent performance.



 I've been testing active/active LIO since July of 2014 with VMWare and
 I've never seen any vmfs corruption.  I am now convinced (thanks Nick) that
 it is possible.  The reason I have not seen any corruption may have to do
 with how VMWare happens to be configured.



 Originally, I had made a point to use round robin path selection in the
 VMware hosts; but as I did performance testing, I found that it actually
 didn't help performance.  When the host switches iSCSI targets there is a
 short spin up time for LIO to get to 100% IO capability.  Since round
 robin switches targets every 30 seconds (60 seconds? I forget), this seemed
 to be significant.  A secondary goal for me was to end up with a config
 that required minimal tuning from VMWare and the target software; so the
 obvious choice is to leave VMWare's path selection at the default which is
 Fixed and picks the first target in ASCII-betical order.  That means I am
 actually functioning in Active/Passive mode.



 Jake









 On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy 
 zol...@linux.vnet.ibm.com
 javascript:_e(%7B%7D,'cvml','zol...@linux.vnet.ibm.com'); wrote:

 Just to chime in: it will look fine, feel fine, but underneath it's quite
 easy to get VMFS corruption. Happened in our tests.
 Also if you're running LIO, from time to time expect a kernel panic
 (haven't tried with the latest upstream, as I've been using
 Ubuntu 14.04 on my export hosts for the test, so might have improved...).

 As of now I would not recommend this setup without being aware of the
 risks involved.

 There have been a few upstream patches getting the LIO code in better
 cluster-aware shape, but no idea if they have been merged
 yet. I know RedHat has a guy on this.

 On 01/21/2015 02:40 PM, Nick Fisk wrote:

 Hi Jake,



 Thanks for this, I have been going through this and have a pretty good
 idea on what you are doing now, however I maybe missing something looking
 through your scripts, but I’m still not quite understanding how you are
 managing to make sure locking is happening with the ESXi ATS SCSI command.



 From

Re: [ceph-users] Ceph, LIO, VMWARE anyone?

2015-01-16 Thread Jake Young
Yes, it's active/active and I found that VMWare can switch from path to
path with no issues or service impact.


I posted some config files here: github.com/jak3kaj/misc

One set is from my LIO nodes, both the primary and secondary configs so you
can see what I needed to make unique.  The other set (targets.conf) are
from my tgt nodes.  They are both 4 LUN configs.

Like I said in my previous email, there is no performance difference
between LIO and tgt.  The only service I'm running on these nodes is a
single iscsi target instance (either LIO or tgt).

Jake

On Wed, Jan 14, 2015 at 8:41 AM, Nick Fisk n...@fisk.me.uk wrote:

 Hi Jake,



 I can’t remember the exact details, but it was something to do with a
 potential problem when using the pacemaker resource agents. I think it was
 to do with a potential hanging issue when one LUN on a shared target failed
 and then it tried to kill all the other LUNS to fail the target over to
 another host. This then leaves the TCM part of LIO locking the RBD which
 also can’t fail over.



 That said I did try multiple LUNS on one target as a test and didn’t
 experience any problems.



 I’m interested in the way you have your setup configured though. Are you
 saying you effectively have an active/active configuration with a path
 going to either host, or are you failing the iSCSI IP between hosts? If
 it’s the former, have you had any problems with scsi
 locking/reservations…etc between the two targets?



 I can see the advantage to that configuration as you reduce/eliminate a
 lot of the troubles I have had with resources failing over.



 Nick



 *From:* Jake Young [mailto:jak3...@gmail.com]
 *Sent:* 14 January 2015 12:50
 *To:* Nick Fisk
 *Cc:* Giuseppe Civitella; ceph-users
 *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone?



 Nick,



 Where did you read that having more than 1 LUN per target causes stability
 problems?



 I am running 4 LUNs per target.



 For HA I'm running two linux iscsi target servers that map the same 4 rbd
 images. The two targets have the same serial numbers, T10 address, etc.  I
 copy the primary's config to the backup and change IPs. This way VMWare
 thinks they are different target IPs on the same host. This has worked very
 well for me.



 One suggestion I have is to try using rbd enabled tgt. The performance is
 equivalent to LIO, but I found it is much better at recovering from a
 cluster outage. I've had LIO lock up the kernel or simply not recognize
 that the rbd images are available; where tgt will eventually present the
 rbd images again.



 I have been slowly adding servers and am expanding my test setup to a
 production setup (nice thing about ceph). I now have 6 OSD hosts with 7
 disks on each. I'm using the LSI Nytro cache raid controller, so I don't
 have a separate journal and have 40Gb networking. I plan to add another 6
 OSD hosts in another rack in the next 6 months (and then another 6 next
 year). I'm doing 3x replication, so I want to end up with 3 racks.



 Jake

 On Wednesday, January 14, 2015, Nick Fisk n...@fisk.me.uk wrote:

 Hi Giuseppe,



 I am working on something very similar at the moment. I currently have it
 working on some test hardware but seems to be working reasonably well.



 I say reasonably as I have had a few instability’s but these are on the HA
 side, the LIO and RBD side of things have been rock solid so far. The main
 problems I have had seem to be around recovering from failure with
 resources ending up in a unmanaged state. I’m not currently using fencing
 so this may be part of the cause.



 As a brief description of my configuration.



 4 Hosts each having 2 OSD’s also running the monitor role

 3 additional host in a HA cluster which act as iSCSI proxy nodes.



 I’m using the IP, RBD, iSCSITarget and iSCSILUN resource agents to provide
 HA iSCSI LUN which maps back to a RBD. All the agents for each RBD are in a
 group so they follow each other between hosts.



 I’m using 1 LUN per target as I read somewhere there are stability
 problems using more than 1 LUN per target.



 Performance seems ok, I can get about 1.2k random IO’s out the iSCSI LUN.
 These seems to be about right for the Ceph cluster size, so I don’t think
 the LIO part is causing any significant overhead.



 We should be getting our production hardware shortly which wil have 40
 OSD’s with journals and a SSD caching tier, so within the next month or so
 I will have a better idea of running it in a production environment and the
 performance of the system.



 Hope that helps, if you have any questions, please let me know.



 Nick



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Giuseppe Civitella
 *Sent:* 13 January 2015 11:23
 *To:* ceph-users
 *Subject:* [ceph-users] Ceph, LIO, VMWARE anyone?



 Hi all,



 I'm working on a lab setup regarding Ceph serving rbd images as ISCSI
 datastores to VMWARE via a LIO box. Is there someone that already did
 something similar

Re: [ceph-users] Ceph, LIO, VMWARE anyone?

2015-01-14 Thread Jake Young
Nick,

Where did you read that having more than 1 LUN per target causes stability
problems?

I am running 4 LUNs per target.

For HA I'm running two linux iscsi target servers that map the same 4 rbd
images. The two targets have the same serial numbers, T10 address, etc.  I
copy the primary's config to the backup and change IPs. This way VMWare
thinks they are different target IPs on the same host. This has worked very
well for me.

One suggestion I have is to try using rbd enabled tgt. The performance is
equivalent to LIO, but I found it is much better at recovering from a
cluster outage. I've had LIO lock up the kernel or simply not recognize
that the rbd images are available; where tgt will eventually present the
rbd images again.

I have been slowly adding servers and am expanding my test setup to a
production setup (nice thing about ceph). I now have 6 OSD hosts with 7
disks on each. I'm using the LSI Nytro cache raid controller, so I don't
have a separate journal and have 40Gb networking. I plan to add another 6
OSD hosts in another rack in the next 6 months (and then another 6 next
year). I'm doing 3x replication, so I want to end up with 3 racks.

Jake

On Wednesday, January 14, 2015, Nick Fisk n...@fisk.me.uk wrote:

 Hi Giuseppe,



 I am working on something very similar at the moment. I currently have it
 working on some test hardware but seems to be working reasonably well.



 I say reasonably as I have had a few instability’s but these are on the HA
 side, the LIO and RBD side of things have been rock solid so far. The main
 problems I have had seem to be around recovering from failure with
 resources ending up in a unmanaged state. I’m not currently using fencing
 so this may be part of the cause.



 As a brief description of my configuration.



 4 Hosts each having 2 OSD’s also running the monitor role

 3 additional host in a HA cluster which act as iSCSI proxy nodes.



 I’m using the IP, RBD, iSCSITarget and iSCSILUN resource agents to provide
 HA iSCSI LUN which maps back to a RBD. All the agents for each RBD are in a
 group so they follow each other between hosts.



 I’m using 1 LUN per target as I read somewhere there are stability
 problems using more than 1 LUN per target.



 Performance seems ok, I can get about 1.2k random IO’s out the iSCSI LUN.
 These seems to be about right for the Ceph cluster size, so I don’t think
 the LIO part is causing any significant overhead.



 We should be getting our production hardware shortly which wil have 40
 OSD’s with journals and a SSD caching tier, so within the next month or so
 I will have a better idea of running it in a production environment and the
 performance of the system.



 Hope that helps, if you have any questions, please let me know.



 Nick



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists.ceph.com');] *On
 Behalf Of *Giuseppe Civitella
 *Sent:* 13 January 2015 11:23
 *To:* ceph-users
 *Subject:* [ceph-users] Ceph, LIO, VMWARE anyone?



 Hi all,



 I'm working on a lab setup regarding Ceph serving rbd images as ISCSI
 datastores to VMWARE via a LIO box. Is there someone that already did
 something similar wanting to share some knowledge? Any production
 deployments? What about LIO's HA and luns' performances?



 Thanks

 Giuseppe


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-06 Thread Jake Young
On Monday, January 5, 2015, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:

  When you shrinking the RBD, most of the time was spent on
 librbd/internal.cc::trim_image(), in this function, client will iterator
 all unnecessary objects(no matter whether it exists) and delete them.



 So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
   there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
 170,227,200 Objects need to be deleted.That will definitely take a long
 time since rbd client need to send a delete request to OSD, OSD need to
 find out the object context and delete(or doesn’t exist at all). The time
 needed to trim an image is ratio to the size needed to trim.



 make another image of the correct size and copy your VM's file system to
 the new image, then delete the old one will  NOT help in general, just
 because delete the old volume will take exactly the same time as shrinking
 , they both need to call trim_image().



 The solution in my mind may be we can provide a “—skip-triming” flag to
 skip the trimming. When the administrator absolutely sure there is no
 written have taken place in the shrinking area(that means there is no
 object created in these area), they can use this flag to skip the time
 consuming trimming.



 How do you think?


That sounds like a good solution. Like doing undo grow image




 *From:* Jake Young [mailto:jak3...@gmail.com
 javascript:_e(%7B%7D,'cvml','jak3...@gmail.com');]
 *Sent:* Monday, January 5, 2015 9:45 PM
 *To:* Chen, Xiaoxi
 *Cc:* Edwin Peer; ceph-users@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');
 *Subject:* Re: [ceph-users] rbd resize (shrink) taking forever and a day





 On Sunday, January 4, 2015, Chen, Xiaoxi xiaoxi.c...@intel.com
 javascript:_e(%7B%7D,'cvml','xiaoxi.c...@intel.com'); wrote:

 You could use rbd info volume_name  to see the block_name_prefix, the
 object name consist like block_name_prefix.sequence_number,  so for
 example, rb.0.ff53.3d1b58ba.e6ad should be the e6adth object  of
 the volume with block_name_prefix rb.0.ff53.3d1b58ba.

  $ rbd info huge
 rbd image 'huge':
  size 1024 TB in 268435456 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.8a14.2ae8944a
  format: 1

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Edwin Peer
 Sent: Monday, January 5, 2015 3:55 AM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

 Also, which rbd objects are of interest?

 snip
 ganymede ~ # rados -p client-disk-img0 ls | wc -l
 1672636
 /snip

 And, all of them have cryptic names like:

 rb.0.ff53.3d1b58ba.e6ad
 rb.0.6d386.1d545c4d.00011461
 rb.0.50703.3804823e.1c28
 rb.0.1073e.3d1b58ba.b715
 rb.0.1d76.2ae8944a.022d

 which seem to bear no resemblance to the actual image names that the rbd
 command line tools understands?

 Regards,
 Edwin Peer

 On 01/04/2015 08:48 PM, Jake Young wrote:
 
 
  On Sunday, January 4, 2015, Dyweni - Ceph-Users
  6exbab4fy...@dyweni.com mailto:6exbab4fy...@dyweni.com wrote:
 
  Hi,
 
  If its the only think in your pool, you could try deleting the
  pool instead.
 
  I found that to be faster in my testing; I had created 500TB when
  I meant to create 500GB.
 
  Note for the Devs: I would be nice if rbd create/resize would
  accept sizes with units (i.e. MB GB TB PB, etc).
 
 
 
 
  On 2015-01-04 08:45, Edwin Peer wrote:
 
  Hi there,
 
  I did something stupid while growing an rbd image. I accidentally
  mistook the units of the resize command for bytes instead of
  megabytes
  and grew an rbd image to 650PB instead of 650GB. This all
 happened
  instantaneously enough, but trying to rectify the mistake is
  not going
  nearly as well.
 
  snip
  ganymede ~ # rbd resize --size 665600 --allow-shrink
  client-disk-img0/vol-x318644f-0
  Resizing image: 1% complete...
  /snip
 
  It took a couple days before it started showing 1% complete
  and has
  been stuck on 1% for a couple more. At this rate, I should be
  able to
  shrink the image back to the intended size in about 2016.
 
  Any ideas?
 
  Regards,
  Edwin Peer
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
  You can just delete the rbd header. See Sebastien's excellent blog:
 
  http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your
  -ceph-cluster/
 
  Jake

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-05 Thread Jake Young
On Sunday, January 4, 2015, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:

 You could use rbd info volume_name  to see the block_name_prefix, the
 object name consist like block_name_prefix.sequence_number,  so for
 example, rb.0.ff53.3d1b58ba.e6ad should be the e6adth object  of
 the volume with block_name_prefix rb.0.ff53.3d1b58ba.

  $ rbd info huge
 rbd image 'huge':
  size 1024 TB in 268435456 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.8a14.2ae8944a
  format: 1

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com javascript:;]
 On Behalf Of Edwin Peer
 Sent: Monday, January 5, 2015 3:55 AM
 To: ceph-users@lists.ceph.com javascript:;
 Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

 Also, which rbd objects are of interest?

 snip
 ganymede ~ # rados -p client-disk-img0 ls | wc -l
 1672636
 /snip

 And, all of them have cryptic names like:

 rb.0.ff53.3d1b58ba.e6ad
 rb.0.6d386.1d545c4d.00011461
 rb.0.50703.3804823e.1c28
 rb.0.1073e.3d1b58ba.b715
 rb.0.1d76.2ae8944a.022d

 which seem to bear no resemblance to the actual image names that the rbd
 command line tools understands?

 Regards,
 Edwin Peer

 On 01/04/2015 08:48 PM, Jake Young wrote:
 
 
  On Sunday, January 4, 2015, Dyweni - Ceph-Users
  6exbab4fy...@dyweni.com javascript:; mailto:6exbab4fy...@dyweni.com
 javascript:; wrote:
 
  Hi,
 
  If its the only think in your pool, you could try deleting the
  pool instead.
 
  I found that to be faster in my testing; I had created 500TB when
  I meant to create 500GB.
 
  Note for the Devs: I would be nice if rbd create/resize would
  accept sizes with units (i.e. MB GB TB PB, etc).
 
 
 
 
  On 2015-01-04 08:45, Edwin Peer wrote:
 
  Hi there,
 
  I did something stupid while growing an rbd image. I accidentally
  mistook the units of the resize command for bytes instead of
  megabytes
  and grew an rbd image to 650PB instead of 650GB. This all
 happened
  instantaneously enough, but trying to rectify the mistake is
  not going
  nearly as well.
 
  snip
  ganymede ~ # rbd resize --size 665600 --allow-shrink
  client-disk-img0/vol-x318644f-0
  Resizing image: 1% complete...
  /snip
 
  It took a couple days before it started showing 1% complete
  and has
  been stuck on 1% for a couple more. At this rate, I should be
  able to
  shrink the image back to the intended size in about 2016.
 
  Any ideas?
 
  Regards,
  Edwin Peer
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com javascript:;
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com javascript:;
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
  You can just delete the rbd header. See Sebastien's excellent blog:
 
  http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your
  -ceph-cluster/
 
  Jake
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com javascript:;
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Sorry, I misunderstood.

The simplest approach to me is to make another image of the correct size
and copy your VM's file system to the new image, then delete the old one.

The safest thing to do would be to mount the new file system from the VM
and do all the formatting / copying from there (the same way you'd move a
physical server's root disk to a new physical disk)

I would not attempt to hack the rbd header. You open yourself up to some
unforeseen problems.

Unless one of the ceph developers can comment there is a safe way to shrink
an image, assuming we know that the file system has not grown since growing
the disk.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-04 Thread Jake Young
On Sunday, January 4, 2015, Dyweni - Ceph-Users 6exbab4fy...@dyweni.com
wrote:

 Hi,

 If its the only think in your pool, you could try deleting the pool
 instead.

 I found that to be faster in my testing; I had created 500TB when I meant
 to create 500GB.

 Note for the Devs: I would be nice if rbd create/resize would accept sizes
 with units (i.e. MB GB TB PB, etc).




 On 2015-01-04 08:45, Edwin Peer wrote:

 Hi there,

 I did something stupid while growing an rbd image. I accidentally
 mistook the units of the resize command for bytes instead of megabytes
 and grew an rbd image to 650PB instead of 650GB. This all happened
 instantaneously enough, but trying to rectify the mistake is not going
 nearly as well.

 snip
 ganymede ~ # rbd resize --size 665600 --allow-shrink
 client-disk-img0/vol-x318644f-0
 Resizing image: 1% complete...
 /snip

 It took a couple days before it started showing 1% complete and has
 been stuck on 1% for a couple more. At this rate, I should be able to
 shrink the image back to the intended size in about 2016.

 Any ideas?

 Regards,
 Edwin Peer
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


You can just delete the rbd header. See Sebastien's excellent blog:

http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your-ceph-cluster/

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Double-mounting of RBD

2014-12-17 Thread Jake Young
On Wednesday, December 17, 2014, Josh Durgin josh.dur...@inktank.com
wrote:

 On 12/17/2014 03:49 PM, Gregory Farnum wrote:

 On Wed, Dec 17, 2014 at 2:31 PM, McNamara, Bradley
 bradley.mcnam...@seattle.gov wrote:

 I have a somewhat interesting scenario.  I have an RBD of 17TB formatted
 using XFS.  I would like it accessible from two different hosts, one
 mapped/mounted read-only, and one mapped/mounted as read-write.  Both are
 shared using Samba 4.x.  One Samba server gives read-only access to the
 world for the data.  The other gives read-write access to a very limited
 set
 of users who occasionally need to add data.


 However, when testing this, when changes are made to the read-write Samba
 server the changes don’t seem to be seen by the read-only Samba server.
 Is
 there some file system caching going on that will eventually be flushed?



 Am I living dangerously doing what I have set up?  I thought I would
 avoid
 most/all potential file system corruption by making sure there is only
 one
 read-write access method.  Thanks for any answers.


 Well, you'll avoid corruption by only having one writer, but the other
 reader is still caching data in-memory that will prevent it from
 seeing the writes on the disk.
 Plus I have no idea if mounting xfs read-only actually prevents it
 from making any writes to the disk; I think some FSes will do stuff
 like defragment internal data structures in that mode, maybe?
 -Greg


 FSes mounted read-only still do tend to do things like journal replay,
 but since the block device is mapped read-only that won't be a problem
 in this case.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Someone commented that the OS with the readonly mount will still do
something potentially damaging to the filesystem at mount time. Something
along the lines of replaying the xfs journal and the read write OS being
unaware of it.

Dig through the ceph mailing list archives.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tgt / rbd performance

2014-12-13 Thread Jake Young
On Friday, December 12, 2014, Mike Christie mchri...@redhat.com wrote:

 On 12/11/2014 11:39 AM, ano nym wrote:
 
  there is a ceph pool on a hp dl360g5 with 25 sas 10k (sda-sdy) on a
  msa70 which gives me about 600 MB/s continous write speed with rados
  write bench. tgt on the server with rbd backend uses this pool. mounting
  local(host) with iscsiadm, sdz is the virtual iscsi device. As you can
  see, sdz max out with 100%util at ~55MB/s when writing to it.
 
  I know that tgt-rbd is more a proof-of-concept then production-ready.
 
  Anyway, is someone using it and/or are there any hints to speed it up?
 

 Increasing the tgt nr_threads setting helps. Try 64 or 128.


Do you just add this to the targets.conf?

nr_threads 128



I have seen my tgt implementation give me very good performance. Much
better than lio and kernel rbd for certain workloads.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant osd problems - loss of IO

2014-12-06 Thread Jake Young
Forgot to copy the list.

I basically cobbled together the settings from examples on the internet.

 I basically modified this sysctl.conf file with his suggestion for 10gb
 nics
 http://www.nateware.com/linux-network-tuning-for-2013.html#.VIG_44eLTII

 I found these sites helpful as well:

 http://fasterdata.es.net/host-tuning/linux/

 This may be of interest to you, it has suggestions for your Mellanox
 hardware:
 https://fasterdata.es.net/host-tuning/nic-tuning/mellanox-connectx-3/

 Fermilab website, link to university research papaer

 https://indico.fnal.gov/getFile.py/access?contribId=30sessionId=19resId=0materialId=paperconfId=3377

 This has a great answer that explains different configurations for servers
 vs clients.  It seems to me that osds are both servers and clients, so
 maybe some of the client tuning would benefit osds as well.  This is where
 I got the somaxconn setting from.

 http://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux


 I forgot to mention, I'm also setting the txqueuelen for my ceph public
 nic and ceph private nic in the /etc/rc.local file:
 /sbin/ifconfig eth0 txqueuelen 1
 /sbin/ifconfig eth1 txqueuelen 1



 I do push the same sysctl.conf and rc.local to all of my clients as well.
 The clients are iSCSI servers which serve vmware hosts.  My ceph cluster is
 rbd only and I currently only have the iSCSI proxy server clients.  We'll
 be adding some KVM hypervisors soon, I'm interested to see how they perform
 vs my vmware -- iSCSI Server -- Ceph setup.


 Regarding your sysctl.conf file:

 I've read on a few different sites that net.ipv4.tcp_mem should not be
 tuned, since the defaults are good.  I have not set it, and I can't speak
 to the benefit/problems with setting it.

 You're configured to only use a 4MB TCP buffer, which is very small.  It
 is actually smaller than the defaults for tcp_wmem, which is 6MB.  The link
 above suggests up to a 128MB TCP buffer for the 40gb Mellanox and/or 10gb
 over a WAN (not sure how to read that).  I'm using a 54MB buffer, but I may
 increase mine to 128MB to see if there is any benefit.  That 4MB buffer may
 be your problem.

 Your net.core.netdev_max_backlog is 5x bigger than mine.  I think I'll
 increase my setting to 25 as well.

 Our issue looks like http://tracker.ceph.com/issues/9844 and my crash
 looks like http://tracker.ceph.com/issues/9788



 On Fri, Dec 5, 2014 at 5:35 AM, Andrei Mikhailovsky and...@arhont.com
 javascript:_e(%7B%7D,'cvml','and...@arhont.com'); wrote:

 Jake,

 very usefull indeed.

 It looks like I had a similar problem regarding the heartbeat and as you'
 have mentioned, I've not seen such issues on Firefly. However, i've not
 seen any osd crashes.



 Could you please let me know where you got the sysctrl.conf tunings from?
 Was it recommended by the network vendor?

 Also, did you make similar sysctrl.conf changes to your host servers?

 A while ago i've read the tunning guide for IP over Infiniband and the
 Mellanox recommends setting something like this:

 net.ipv4.tcp_timestamps = 0
 net.ipv4.tcp_sack = 1
 net.core.netdev_max_backlog = 25
 net.core.rmem_max = 4194304
 net.core.wmem_max = 4194304
 net.core.rmem_default = 4194304
 net.core.wmem_default = 4194304
 net.core.optmem_max = 4194304
 net.ipv4.tcp_rmem = 4096 87380 4194304
 net.ipv4.tcp_wmem = 4096 65536 4194304
 net.ipv4.tcp_mem =4194304 4194304 4194304
 net.ipv4.tcp_low_latency=1


 which is what I have. Not sure if these are optimal.

 I can see that the values are pretty conservative compare to yours. I
 guess my values should be different as I am running a 40gbit/s network with
 ipoib. The actual throughput on ipoib is about 20gbit/s according iperf and
 alike.

 Andrei


 --

 *From: *Jake Young jak3...@gmail.com
 javascript:_e(%7B%7D,'cvml','jak3...@gmail.com');
 *To: *Andrei Mikhailovsky and...@arhont.com
 javascript:_e(%7B%7D,'cvml','and...@arhont.com');
 *Cc: *ceph-users@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');
 *Sent: *Thursday, 4 December, 2014 4:57:47 PM
 *Subject: *Re: [ceph-users] Giant osd problems - loss of IO



 On Fri, Nov 14, 2014 at 4:38 PM, Andrei Mikhailovsky and...@arhont.com
 javascript:_e(%7B%7D,'cvml','and...@arhont.com'); wrote:
 
  Any other suggestions why several osds are going down on Giant and
 causing IO to stall? This was not happening on Firefly.
 
  Thanks
 
 

 I had a very similar probem to yours which started after upgrading from
 Firefly to Giant and then later  I added two new osd nodes, with 7 osds on
 each.

 My cluster originally had 4 nodes, with 7 osds on each node, 28 osds
 total, running Gian.  I did not have any problems at this time.

 My problems started after adding two new nodes, so I had 6 nodes and 42
 total osds.  It would run fine on low load, but when the request load
 increased, osds started to fall over.


 I was able to set the debug_ms to 10 and capture

Re: [ceph-users] running as non-root

2014-12-06 Thread Jake Young
On Saturday, December 6, 2014, Sage Weil sw...@redhat.com wrote:

 While we are on the subject of init systems and packaging, I would *love*
 to fix things up for hammer to

  - create a ceph user and group
  - add various users to ceph group (like qemu or kvm user and
 apache/www-data?)


Maybe a calamari user too

 - fix permissions on /var/log/ceph and /var/run/ceph (770?) so that qemu
 and rgw can write logs and asok files there


Yes

 - make daemons run as ceph user instead of root


I think this is the right approach



 The main hangup is with that last one.  As I understand it, when packages
 create users, they get a semi-random UID assigned.  That means that all
 the data on a ceph-osd disk would have a semi-random UID.  If it were
 hot-swapped into another host, the uid would be wrong.  Is there a way
 use a fixed uid?


There's no guarantee that any given uid will be available across any two
unix systems. You could pick 6789 or something uncommon, but I'm sure
someone somewhere is using any given uid.

I would take the approach that the uid shouldn't matter. Add a standard
tool to assist with osd hot swaps that would change the file permissions on
the new osd disk.  I think the osd hot swap process requires some manual
intervention anyway. The only downside is the tool would need to be run
with root permissions.

I haven't tried moving an osd disk from one node to another. Can someone
describe the process?


 Also on the roadmap is defining proper selinux policies so that these
 dameons are confined into the appropriate directories etc., but I imagine
 running as non-root is a big help (or even prerequisite?) to making that
 happen?

 Suggestions or comments?  Or volunteers?  We haven't had time to look at
 this yet but I think it's important!

 sage

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant osd problems - loss of IO

2014-12-04 Thread Jake Young
On Fri, Nov 14, 2014 at 4:38 PM, Andrei Mikhailovsky and...@arhont.com
wrote:

 Any other suggestions why several osds are going down on Giant and
causing IO to stall? This was not happening on Firefly.

 Thanks



I had a very similar probem to yours which started after upgrading from
Firefly to Giant and then later  I added two new osd nodes, with 7 osds on
each.

My cluster originally had 4 nodes, with 7 osds on each node, 28 osds total,
running Gian.  I did not have any problems at this time.

My problems started after adding two new nodes, so I had 6 nodes and 42
total osds.  It would run fine on low load, but when the request load
increased, osds started to fall over.


I was able to set the debug_ms to 10 and capture the logs from a failed
OSD.  There were a few different reasons the osds were going down.  This
example shows it terminating normally for an unspecified reason a minute
after it notices it is marked down in the map.

Osd 25 actually marks this osd (osd 35) down.  For some reason many osds
cannot communicate with each other.

There are other examples where I see the heartbeat_check: no reply from
osd.blah message for long periods of time (hours) and neither osd crashes
or terminates.

2014-12-01 16:27:06.772616 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:06.056972 (cutoff 2014-12-01 16:26:46.772608)
2014-12-01 16:27:07.772767 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:06.056972 (cutoff 2014-12-01 16:26:47.772759)
2014-12-01 16:27:08.772990 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:06.056972 (cutoff 2014-12-01 16:26:48.772982)
2014-12-01 16:27:09.559894 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:06.056972 (cutoff 2014-12-01 16:26:49.559891)
2014-12-01 16:27:09.773177 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:09.559087 (cutoff 2014-12-01 16:26:49.773173)
2014-12-01 16:27:10.773307 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:09.559087 (cutoff 2014-12-01 16:26:50.773299)
2014-12-01 16:27:11.261557 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:09.559087 (cutoff 2014-12-01 16:26:51.261554)
2014-12-01 16:27:11.773512 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:11.260129 (cutoff 2014-12-01 16:26:51.773504)
2014-12-01 16:27:12.773741 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:11.260129 (cutoff 2014-12-01 16:26:52.773733)
2014-12-01 16:27:13.773884 7f8b642d1700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:11.260129 (cutoff 2014-12-01 16:26:53.773876)
2014-12-01 16:27:14.163369 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check: no
reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
16:27:11.260129 (cutoff 2014-12-01 16:26:54.163366)
2014-12-01 16:27:14.507632 7f8b4fb7f700  0 -- 172.1.2.6:6802/5210 
172.1.2.5:6802/2755 pipe(0x2af06940 sd=57 :51521 s=2 pgs=384 cs=1 l=0
c=0x2af094a0).fault with nothing to send, going to standby
2014-12-01 16:27:14.511704 7f8b37af1700  0 -- 172.1.2.6:6802/5210 
172.1.2.2:6812/34015988 pipe(0x2af06c00 sd=69 :41512 s=2 pgs=38842 cs=1 l=0
c=0x2af09600).fault with nothing to send, going to standby
2014-12-01 16:27:14.511966 7f8b5030c700  0 -- 172.1.2.6:6802/5210 
172.1.2.4:6802/40022302 pipe(0x30cbcdc0 sd=93 :6802 s=2 pgs=66722 cs=3 l=0
c=0x2af091e0).fault with nothing to send, going to standby
2014-12-01 16:27:14.514744 7f8b548a5700  0 -- 172.1.2.6:6802/5210 
172.1.2.2:6800/9016639 pipe(0x2af04dc0 sd=38 :60965 s=2 pgs=11747 cs=1 l=0
c=0x2af086e0).fault with nothing to send, going to standby
2014-12-01 16:27:14.516712 7f8b349c7700  0 -- 172.1.2.6:6802/5210 
172.1.2.2:6802/25277 pipe(0x2b04cc00 sd=166 :6802 s=2 pgs=62 cs=1 l=0
c=0x2b043080).fault with nothing to send, going to standby
2014-12-01 16:27:14.516814 7f8b2bd3b700  0 -- 172.1.2.6:6802/5210 
172.1.2.4:6804/16770 pipe(0x30cbd600 sd=79 :6802 s=2 pgs=607 cs=3 l=0
c=0x2af08c60).fault with nothing to send, going to standby
2014-12-01 16:27:14.518439 7f8b2a422700  0 -- 172.1.2.6:6802/5210 
172.1.2.5:6806/31172 pipe(0x30cbc840 sd=28 :6802 s=2 pgs=22 cs=1 l=0
c=0x3041f5a0).fault with nothing to send, going to standby
2014-12-01 16:27:14.518883 7f8b589ba700  0 -- 172.1.2.6:6802/5210 
172.1.2.1:6803/4031631 pipe(0x2af042c0 sd=32 :58296 s=2 pgs=35500 cs=3 l=0
c=0x2af08160).fault with nothing to 

Re: [ceph-users] Admin Node Best Practices

2014-10-31 Thread Jake Young
On Friday, October 31, 2014, Massimiliano Cuttini m...@phoenixweb.it wrote:

  Any hint?


 Il 30/10/2014 15:22, Massimiliano Cuttini ha scritto:

 Dear Ceph users,

 I just received 2 fresh new servers and i'm starting to develop my Ceph
 Cluster.
 The first step is: create the admin node in order to controll all the
 cluster by remote.
 I have a big cluster of XEN servers and I'll setup there a new VM only for
 this.
 I need some info:
 1) As far as i know admin-node need only to deploy, it doesn't support any
 kind of service. Is it so or i missed something?
 2) All my servers for the OSD nodes will be CENTOS7. Then do I need to
 setup the admin-node with the same OS or i can mix-up?
 3) Can I delete the admin-node in the future and recreate it whenever i
 need it  or there are some unique informations (such keys) that i
 need always to preserve?
 4) is it good having more than 1 ADMIN NODE or completly useless?
 5) do you have some best practice to share? :)

 Thanks,
 Max


 ___
 ceph-users mailing listceph-us...@lists.ceph.com 
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



I have one VM with minimal CPU and memory provisioned that I used to deploy
my ceph cluster. It doesn't run any ceph services, but I do use it to
monitor and troubleshoot the cluster.

It is connected to the ceph public network (which is a non-routable
network) and to my corporate network.

Most of the ceph cluster is not on the corporate network, so I use this VM
as a jumpbox to get into the rest of the network.

I use the same OS (Ubuntu 12.04) on the deploy VM as all my
other ceph servers/VMs.   This is nice so I can test new packages and
kernels on this VM first. That way I don't take down the ceph cluster when
a kernel upgrade goes bad. I just tested 3.18rc2 and found it made my VM
unbootable.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PERC H710 raid card

2014-07-17 Thread Jake Young
There are two command line tools for Linux for LSI cards: megacli and
storcli

You can do pretty much everything from those tools.

Jake

On Thursday, July 17, 2014, Dennis Kramer (DT) den...@holmes.nl wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,

 What do you recommend in case of a disk failure in this kind of
 configuration? Are you bringing down the host when you replace the
 disk and re-create the raid-0 for the replaced disk? I reckon that
 linux doesn't automatically get the disk replacement either...

 Dennis

 On 07/16/2014 11:02 PM, Shain Miley wrote:
  Robert, We use those cards here in our Dell R-720 servers.
 
  We just ended up creating a bunch of single disk RAID-0 units,
  since there was no jbod option available.
 
  Shain
 
 
  On 07/16/2014 04:55 PM, Robert Fantini wrote:
  I've 2 dell systems with PERC H710 raid cards. Those are very
  good end cards , but do not support jbod .
 
  They support raid 0, 1, 5, 6, 10, 50, 60 .
 
  lspci shows them as:  LSI Logic / Symbios Logic MegaRAID SAS 2208
   [Thunderbolt] (rev 05)
 
  The firmware Dell uses on the card does not support jbod.
 
  My question is how can this be best used for Ceph? Or should it
  not be used?
 
 
 
 
 
  ___ ceph-users
  mailing list ceph-users@lists.ceph.com javascript:;
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  -- Shain Miley | Manager of Systems and Infrastructure, Digital
  Media | smi...@npr.org javascript:; | 202.513.3649
 


 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.15 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iEYEARECAAYFAlPHZ1MACgkQiJDTKUBxIRusogCeJ+jnADW/KBoQAxnDSz62yT3P
 FNoAnin3A52AqiA+KlFJQoc5bdQRoyYe
 =/MPE
 -END PGP SIGNATURE-
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-24 Thread Jake Young
On Mon, Jun 23, 2014 at 3:03 PM, Mark Nelson mark.nel...@inktank.com
wrote:

 Well, for random IO you often can't do much coalescing.  You have to bite
 the bullet and either parallelize things or reduce per-op latency.  Ceph
 already handles parallelism very well.  You just throw more disks at the
 problem and so long as there are enough client requests it more or less
 just scales (limited by things like network bisection bandwidth or other
 complications).  On the latency side, spinning disks aren't fast enough for
 Ceph's extra latency overhead to matter much, but with SSDs the story is
 different.  That's why we are very interested in reducing latency.

 Regarding journals:  Journal writes are always sequential (even for random
 IO!), but are O_DIRECT so they'll skip linux buffer cache.  If you have
 hardware that is fast at writing sequential small IO (say a controller with
 WB cache or an SSD), you can do journal writes very quickly.  For bursts of
 small random IO, performance can be quite good.  The downsides is that you
 can hit journal limits very quickly, meaning you have to flush and wait for
 the underlying filestore to catch up. This results in performance that
 starts out super fast, then stalls once the journal limits are hit, back to
 super fast again for a bit, then another stall, etc.  This is less than
 ideal given the way crush distributes data across OSDs.  The alternative is
 setting a soft limit on how much data is in the journal and flushing
 smaller amounts of data more quickly to limit the spikey behaviour.  On the
 whole, that can be good but limits the burst potential and also limits the
 amount of data that could potentially be coalesced in the journal.


Mark,

What settings are you suggesting for setting a soft limit on journal size
and flushing smaller amounts of data?

Something like this?
filestore_queue_max_bytes: 10485760
filestore_queue_committing_max_bytes: 10485760
journal_max_write_bytes: 10485760
journal_queue_max_bytes: 10485760
ms_dispatch_throttle_bytes: 10485760
objecter_infilght_op_bytes: 10485760

(see Small bytes in
http://ceph.com/community/ceph-bobtail-jbod-performance-tuning)



 Luckily with RBD you can (when applicable) coalesce on the client with RBD
 cache instead, which is arguably better anyway since you can send bigger
 IOs to the OSDs earlier in the write path.  So long as you are ok with what
 RBD cache does and does not guarantee, it's definitely worth enabling imho.


Thanks,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving Ceph cluster to different network segment

2014-06-13 Thread Jake Young
I recently changed IP and hostname of an osd node running dumpling and had
no problems.

You do need to have your ceph.conf file built correctly or your osds won't
start. Make sure the new IPs and new hostname are in there before you
change the IP.

The crushmap showed a new bucket (host name) containing the osds that were
moved and the original bucket remained in the crushmap, but with no
children. I was able to unlink the original bucket with no problem.

Jake

On Friday, June 13, 2014, Fred Yang frederic.y...@gmail.com wrote:

 Wido,
 So the cluster reference osd based on the hostname, or the
 GUID(hopefully)? Note that I mentioned in original email the hostname
 associated to the IP will also be changed as well, it will be as simple as
 changing IP and restart osd? I remembered I tested in Dumpling a while ago
 and it didn't work, this cluster is running on Emperor and not sure whether
 that will make any difference.

 Fred
 On Jun 13, 2014 7:51 AM, Wido den Hollander w...@42on.com
 javascript:_e(%7B%7D,'cvml','w...@42on.com'); wrote:

 On 06/13/2014 01:41 PM, Fred Yang wrote:

 Thanks, John.

 That seems will take care of monitors, how about osd? Any idea how to
 change IP addresses without triggering a resync?


 IPs of OSDs are dynamic. Their IP is no part of the data distribution.
 Simply renumber them and restart the daemon.

 I suggest:

 1. Stop OSD(s)
 2. Renumber machine
 3. Start OSD(s)

 That should be all. There will be some recovery due to I/Os which
 occurred between 1 and 3.

 Wido

  Fred

 Sent from my Samsung Galaxy S3

 On Jun 12, 2014 1:21 PM, John Wilkins john.wilk...@inktank.com
 javascript:_e(%7B%7D,'cvml','john.wilk...@inktank.com');
 mailto:john.wilk...@inktank.com
 javascript:_e(%7B%7D,'cvml','john.wilk...@inktank.com'); wrote:

 Fred,

 I'm not sure it will completely answer your question, but I would
 definitely have a look at:
 http://ceph.com/docs/master/rados/operations/add-or-rm-
 mons/#changing-a-monitor-s-ip-address

 There are some important steps in there for monitors.


 On Wed, Jun 11, 2014 at 12:08 PM, Fred Yang frederic.y...@gmail.com
 javascript:_e(%7B%7D,'cvml','frederic.y...@gmail.com');
 mailto:frederic.y...@gmail.com
 javascript:_e(%7B%7D,'cvml','frederic.y...@gmail.com'); wrote:

 We need to move Ceph cluster to different network segment for
 interconnectivity between mon and osc, anybody has the procedure
 regarding how that can be done? Note that the host name
 reference will be changed, so originally the osd host referenced
 as cephnode1, in the new segment it will be cephnode1-n.

 Thanks,
 Fred

 Sent from my Samsung Galaxy S3


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com'); mailto:
 ceph-users@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 John Wilkins
 Senior Technical Writer
 Intank
 john.wilk...@inktank.com
 javascript:_e(%7B%7D,'cvml','john.wilk...@inktank.com'); mailto:
 john.wilk...@inktank.com
 javascript:_e(%7B%7D,'cvml','john.wilk...@inktank.com');
 (415) 425-9599 tel:%28415%29%20425-9599
 http://inktank.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with VMWare / XenServer

2014-05-12 Thread Jake Young
Hello Andrei,

I'm trying to accomplish the same thing with VMWare. So far I'm still doing
lab testing, but we've gotten as far as simulating a production workload.
 Forgive the lengthy reply, I happen to be sitting on an airplane .

My existing solution is using NFS servers running in ESXi VMs. Each VM
serves one or two large (2-4 TB) rbd images. These images are for vmdk
storage as well as oracle RAC disks.

I tested using multiple NFS servers serving a single rbd, but kept on
seeing xfs corruption (which was recoverable with xfs_repair). I initially
blamed ceph, but eventually realized that the problem is actually with xfs;
well in fact, the problem was with my configuration. It is generally a very
bad idea to write to the same xfs file system from two separate computers,
whether it is to a ceph rbd or to a physical disk in a shared disk array.
What would be required would be a way to synchronize writes between the
servers mounting the rbd. There are protocols available to do this, but all
of them would introduce more latency, which I'm already struggling to
control.

My environment is all Cisco UCS hardware. C240 rack mount servers for OSDs
and B200 blade servers for VMWare ESXi. The entire network is 10Gb or
better.  After carefully examining my nfs servers (which are VMs running in
ESXi on local storage), I found that I had a tremendous amount of kernel
IO. This was because of the high volume of TCP packets it had to constantly
process for BOTH the NFS traffic and the ceph traffic.

One thing that helped was to enable jumbo frames on every device in the
path from ESXi to the OSDs. This is not as simple as it sounds. In ESXi,
the vmk port and the vSwitch the vmk is on must have the mtu set to 9000.
In the switches, the VLANs and the interfaces need to have the mtu set to
9128 (don't forget about vlan tagging overhead). In the UCSM (Cisco GUI for
configuring the Blades and networking), all the vnics and the qos policies
must be set to 9000. The Linux interfaces in the nfs servers, mons, and
osds all needed to be set to 9000 as well.

My kernel io was still high, so I just gave the NFS VM more vCPUs (8
vCPUs, 8 GB RAM).  This helped as well.

With that all in place, my lab environment is doing a sustained 200 iops
bursting up to 500 iops (from VMWare's perspective) on one NFS server VM.
The IO is mostly small writes. My lab cluster just has 11 osds in a single
node.  I have 3x replication as well, so the cluster is actually doing more
like 600 - 1400 iops. The osds have an LSI 2208 controller (2GB cache) with
each disk in separate single disk RAID1 virtual drives (necessary to take
advantage of the write back cache). The OSDs have no separate journal;
which means the disks are actually writing at 1200 - 2800 iops (journal +
data). Not bad for one node with 11x 7k disks.

I still have high latency (though it is much better than before enabling
jumbo frames). VMWare shows between 10,000 microseconds and 200,000
microseconds of latency.  That is acceptable for this application.  IO is
mostly asynchronous: alarming/logging writes, database updates. I don't
notice the latency on the VMs running in the ceph-NFS datastore.

I believe the latency is actually from the osd node being pretty much maxed
out. I have 4 more osd servers on order to hopefully smooth out the latency
spikes.


One huge problem with the NFS server gateway approach is that you have many
layers of file systems that are introduced in each OS. My current
solution's file system stack looks like this:

ext4 - VMs file systems
VMFS - ESXi
NFS - between ESXi and nfs server
XFS - NFS server to mounted rbd disk
Rados - NFS server ceph kernel client to OSDs
XFS - OSDs to local file system

Yuck!  Four journaling file systems to write through: VMFS, XFS, OSD, XFS.


Clearly the best approach would be for the VMs to directly access the ceph
cluster:

ext4 - VMs file systems
Rados - VM ceph kernel client to OSDs
XFS - OSDs to local file system

Due to the packaging/deployment procedure of my application (and the
ancient RHEL 5 kernel), that won't be possible any time soon. The
application will be migrated to openstack, off of VMWare, first.

Since I'm using UCS hardware, there is native FCoE built in (with FC frame
offload and I can even boot off of FCoE); I am going to build a pair
of fiber channel gateways to replace the NFS server. The the filesystem
stack will look like this:

ext4 - VMs file systems
VMFS - ESXi
FC - between UCS vHBA and FC Target
Rados - FC target via LIO, ceph kernel client to OSDs
XFS - OSDs to local file system

I had some issues with getting a B200 blade to work in FC target mode (it
was only designed to be an initiator), so I'll have to use a C240 in
independent mode connected to a nexus 5k switch.

As an alternative (while I wait for my new osd nodes and nexus switches to
arrive), I was interested in trying tgt with fcoe. I've seen some negative
performance reports due to using userland ceph client vs kernel client.
More 

Re: [ceph-users] Manually mucked up pg, need help fixing

2014-05-05 Thread Jake Young
I was in a similar situation where I could see the PGs data on an osd, but
there was nothing I could do to force the pg to use that osd's copy.

I ended up using the rbd_restore tool to create my rbd on disk and then I
reimported it into the pool.

See this thread for info on rbd_restore:
http://www.spinics.net/lists/ceph-devel/msg11552.html

Of course, you have to copy all of the pieces of the rbd image on one file
system somewhere (thank goodness for thin provisioning!) for the tool to
work.

There really should be a better way.

Jake

On Monday, May 5, 2014, Jeff Bachtel jbach...@bericotechnologies.com
wrote:

 Well, that'd be the ideal solution. Please check out the github gist I
 posted, though. It seems that despite osd.4 having nothing good for pg
 0.2f, the cluster does not acknowledge any other osd has a copy of the pg.
 I've tried downing osd.4 and manually deleting the pg directory in question
 with the hope that the cluster would roll back epochs for 0.2f, but all it
 does is recreate the pg directory (empty) on osd.4.

 Jeff

 On 05/05/2014 04:33 PM, Gregory Farnum wrote:

 What's your cluster look like? I wonder if you can just remove the bad
 PG from osd.4 and let it recover from the existing osd.1
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Sat, May 3, 2014 at 9:17 AM, Jeff Bachtel
 jbach...@bericotechnologies.com wrote:

 This is all on firefly rc1 on CentOS 6

 I had an osd getting overfull, and misinterpreting directions I downed it
 then manually removed pg directories from the osd mount. On restart and
 after a good deal of rebalancing (setting osd weights as I should've
 originally), I'm now at

  cluster de10594a-0737-4f34-a926-58dc9254f95f
   health HEALTH_WARN 2 pgs backfill; 1 pgs incomplete; 1 pgs stuck
 inactive; 308 pgs stuck unclean; recov
 ery 1/2420563 objects degraded (0.000%); noout flag(s) set
   monmap e7: 3 mons at
 {controller1=10.100.2.1:6789/0,controller2=10.100.2.2:6789/
 0,controller3=10.100.2.
 3:6789/0}, election epoch 556, quorum 0,1,2
 controller1,controller2,controller3
   mdsmap e268: 1/1/1 up {0=controller1=up:active}
   osdmap e3492: 5 osds: 5 up, 5 in
  flags noout
pgmap v4167420: 320 pgs, 15 pools, 4811 GB data, 1181 kobjects
  9770 GB used, 5884 GB / 15654 GB avail
  1/2420563 objects degraded (0.000%)
 3 active
12 active+clean
 2 active+remapped+wait_backfill
 1 incomplete
   302 active+remapped
client io 364 B/s wr, 0 op/s

 # ceph pg dump | grep 0.2f
 dumped all in format plain
 0.2f0   0   0   0   0   0   0 incomplete
 2014-05-03 11:38:01.526832 0'0  3492:23 [4] 4   [4] 4
 2254'20053  2014-04-28 00:24:36.504086  2100'18109 2014-04-26
 22:26:23.699330

 # ceph pg map 0.2f
 osdmap e3492 pg 0.2f (0.2f) - up [4] acting [4]

 The pg query for the downed pg is at
 https://gist.github.com/jeffb-bt/c8730899ff002070b325

 Of course, the osd I manually mucked with is the only one the cluster is
 picking up as up/acting. Now, I can query the pg and find epochs where
 other
 osds (that I didn't jack up) were acting. And in fact, the latest of
 those
 entries (osd.1) has the pg directory in its osd mount, and it's a good
 healthy 59gb.

 I've tried manually rsync'ing (and preserving attributes) that set of
 directories from osd.1 to osd.4 without success. Likewise I've tried
 copying
 the directories over without attributes set. I've done many, many deep
 scrubs but the pg query does not show the scrub timestamps being
 affected.

 I'm seeking ideas for either fixing metadata on the directory on osd.4 to
 cause this pg to be seen/recognized, or ideas on forcing the cluster's pg
 map to point to osd.1 for the incomplete pg (basically wiping out the
 cluster's memory that osd.4 ever had 0.2f). Or any other solution :) It's
 only 59g, so worst case I'll mark it lost and recreate the pg, but I'd
 prefer to learn enough of the innards to understand what is going on, and
 possible means of fixing it.

 Thanks for any help,

 Jeff

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot create a file system on the RBD

2014-04-08 Thread Jake Young
Maybe different kernel versions between the box that can format and the box
that can't.

When you created the rbd image, was it format 1 or 2?

Jake

On Thursday, April 3, 2014, Thorvald Hallvardsson 
thorvald.hallvards...@gmail.com wrote:

 Hi,

 I have found that problem is somewhere within the pool itself. I created
 another pool and created an RBD within the new pool and it worked fine.

 Can anyone point me out on how can I find the problem with the pool and
 why any RBD assigned to it fails to be formatted ?

 Thank you.


 On 3 April 2014 13:51, Thorvald Hallvardsson 
 thorvald.hallvards...@gmail.com wrote:

 Hi guys,

 I have got a problem. I created a new 1TB RBD device and mapped in on the
 box. I tried to create a file system on that device but it failed:

 root@export01:~# mkfs.ext4 /dev/rbd/pool/server1
 mke2fs 1.42
 (29-Nov-2011)
 Filesystem
 label=
 OS type:
 Linux
 Block size=4096
 (log=2)
 Fragment size=4096
 (log=2)
 Stride=1024 blocks, Stripe width=1024
 blocks
 64004096 inodes, 25600
 blocks
 1280 blocks (5.00%) reserved for the super
 user
 First data
 block=0
 Maximum filesystem
 blocks=4294967296
 7813 block
 groups
 32768 blocks per group, 32768 fragments per
 group
 8192 inodes per
 group
 Superblock backups stored on
 blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
 2654208,
 4096000, 7962624, 11239424, 2048, 23887872, 71663616,
 78675968,
 10240,
 214990848

 Allocating group tables: done
 Writing inode tables: done
 Creating journal (32768 blocks): done
 Writing superblocks and filesystem accounting information:3/7813
 Warning, had trouble writing out superblocks.

 I tried XFS and it also failed:
 root@export01:~# mkfs.xfs /dev/rbd/pool/server1

 log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
 log stripe unit adjusted to 32KiB
 meta-data=/dev/rbd/pool/server1 isize=256agcount=17, agsize=1599488
 blks
  =   sectsz=512   attr=2,
 projid32bit=0
 data =   bsize=4096   blocks=2560,
 imaxpct=25
  =   sunit=1024   swidth=1024
 blks
 naming   =version 2  bsize=4096
 ascii-ci=0
 log  =internal log   bsize=4096   blocks=12504,
 version=2
  =   sectsz=512   sunit=8 blks,
 lazy-count=1
 realtime =none   extsz=4096   blocks=0,
 rtextents=0
 mkfs.xfs: pwrite64 failed: Input/output error

 No errors in any logs. Dmesg is shouting:
 [514937.022686] rbd: rbd22:   result -1 xferred
 1000
 [514937.022686]

 [514937.022742] rbd: rbd22: write 1000 at e6
 (0)
 [514937.022742]

 [514937.022744] rbd: rbd22:   result -1 xferred
 1000
 [514937.022744]

 [514937.034529] rbd: rbd22: write 1000 at f2
 (0)
 [514937.034529]

 [514937.034533] rbd: rbd22:   result -1 xferred
 1000
 [514937.034533]

 [514937.417367] rbd: rbd22: write 1000 at ca8000
 (0)
 [514937.417367]

 [514937.417373] rbd: rbd22:   result -1 xferred
 1000
 [514937.417373]

 [514937.417460] r


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No more Journals ?

2014-03-14 Thread Jake Young
You should take a look at this blog post:

http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/

The test results shows that using a RAID card with a write-back cache
without journal disks can perform better or equivalent to using journal
disks with XFS.

*As to whether or not it's better to buy expensive controllers and use all
of your drive bays for spinning disks or cheap controllers and use some
portion of your bays for SSDs/Journals, there are trade-offs.  If built
right, systems with SSD journals provide higher large block write
throughput, while putting journals on the data disks provides higher
storage density.  Without any tuning both solutions currently provide
similar IOP throughput*.

Jake


On Friday, March 14, 2014, Markus Goldberg goldb...@uni-hildesheim.de
wrote:

 Sorry,
 i should have asked a little bit clearer:
 Can ceph (or OSDs) be used without journals now ?
 The Journal-Parameter seems to be optional ( because of '[...]' )

 Markus
 Am 14.03.2014 12:19, schrieb John Spray:

 Journals have not gone anywhere, and ceph-deploy still supports
 specifying them with exactly the same syntax as before.

 The page you're looking at is the simplified quick start, the detail
 on osd creation including journals is here:
 http://eu.ceph.com/docs/v0.77/rados/deployment/ceph-deploy-osd/

 Cheers,
 John

 On Fri, Mar 14, 2014 at 9:47 AM, Markus Goldberg
 goldb...@uni-hildesheim.de wrote:

 Hi,
 i'm a little bit surprised. I read through the new manuals of 0.77
 (http://eu.ceph.com/docs/v0.77/start/quick-ceph-deploy/)
 In the section of creating the osd the manual says:

 Then, from your admin node, use ceph-deploy to prepare the OSDs.

 ceph-deploy osd prepare {ceph-node}:/path/to/directory

 For example:

 ceph-deploy osd prepare node2:/var/local/osd0 node3:/var/local/osd1

 Finally, activate the OSDs.

 ceph-deploy osd activate {ceph-node}:/path/to/directory

 For example:

 ceph-deploy osd activate node2:/var/local/osd0 node3:/var/local/osd1


 In former versions the osd was created like:

 ceph-deploy -v --overwrite-conf osd --fs-type btrfs prepare
 bd-0:/dev/sdb:/dev/sda5

 ^^ Journal
 As i remember defining and creating a journal for each osd was a must.

 So the question is: Are Journals obsolet now ?

 --
 MfG,
Markus Goldberg

 
 --
 Markus Goldberg   Universität Hildesheim
Rechenzentrum
 Tel +49 5121 88392822 Marienburger Platz 22, D-31141 Hildesheim, Germany
 Fax +49 5121 88392823 email goldb...@uni-hildesheim.de
 
 --


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 MfG,
   Markus Goldberg

 --
 Markus Goldberg   Universität Hildesheim
   Rechenzentrum
 Tel +49 5121 88392822 Marienburger Platz 22, D-31141 Hildesheim, Germany
 Fax +49 5121 88392823 email goldb...@uni-hildesheim.de
 --


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Running a mon on a USB stick

2014-03-08 Thread Jake Young
I was planning to setup a small Ceph cluster with 5 nodes. Each node will
have 12 disks and run 12 osds.

I want to run 3 mons on 3 of the nodes. The servers have an internal SD
card that I'll use for the OS and an internal 16GB USB port that I want to
mount the mon files to.

From what I understand, the mons don't need much space.

Is there an issue with IO performance?

Thanks,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com