Re: [ceph-users] RBD journaling benchmarks
On Thu, Jul 13, 2017 at 10:58 AM, Maged Mokhtarwrote: > The case also applies to active/passive iSCSI.. you still have many > initiators/hypervisors writing concurrently to the same rbd image using a > clustered file system (csv/vmfs). Except from that point-of-view, there is only a single RBD client --- the active iSCSI target. -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD journaling benchmarks
-- From: "Jason Dillaman" <jdill...@redhat.com> Sent: Thursday, July 13, 2017 4:45 AM To: "Maged Mokhtar" <mmokh...@petasan.org> Cc: "Mohamad Gebai" <mge...@suse.com>; "ceph-users" <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] RBD journaling benchmarks > On Mon, Jul 10, 2017 at 3:41 PM, Maged Mokhtar <mmokh...@petasan.org> wrote: >> On 2017-07-10 20:06, Mohamad Gebai wrote: >> >> >> On 07/10/2017 01:51 PM, Jason Dillaman wrote: >> >> On Mon, Jul 10, 2017 at 1:39 PM, Maged Mokhtar <mmokh...@petasan.org> wrote: >> >> These are significant differences, to the point where it may not make sense >> to use rbd journaling / mirroring unless there is only 1 active client. >> >> I interpreted the results as the same RBD image was being concurrently >> used by two fio jobs -- which we strongly recommend against since it >> will result in the exclusive-lock ping-ponging back and forth between >> the two clients / jobs. Each fio RBD job should utilize its own >> backing image to avoid such a scenario. >> >> >> That is correct. The single job runs are more representative of the >> overhead of journaling only, and it is worth noting the (expected) >> inefficiency of multiple clients for the same RBD image, as explained by >> Jason. >> >> Mohamad >> >> Yes i expected a penalty but not as large. There are some use cases that >> would benefit from concurrent access to the same block device, in vmware ad >> hyper-v several hypervisors could share the same device which is formatted >> via a clustered file system like MS CSV ( clustered shared volumes ) or >> VMFS, which creates a volume/datastore that houses many VMs. > > Both of these use-cases would first need support for active/active > iSCSI. While A/A iSCSI via MPIO is trivial to enable, getting it to > properly handle failure conditions without the possibility of data > corruption is not since it relies heavily on arbitrary initiator and > target-based timers. The only realistic and safe solution is to rely > on an MCS-based active/active implementation. The case also applies to active/passive iSCSI.. you still have many initiators/hypervisors writing concurrently to the same rbd image using a clustered file system (csv/vmfs). >> I was wondering if such a setup could be supported in the future and maybe >> there could be a way to minimize the overhead of the exclusive lock..for >> example by having a distributed sequence number to the different active >> client writers and have each writer maintain its own journal, i doubt that >> the overhead will reach the values you showed. > > The journal used by the librbd mirroring feature was designed to > support multiple concurrent writers. Of course, that original design > was more inline with the goal of supporting multiple images within a > consistency group. Yes but they will still suffer performance penalty , my understanding is that they would need the lock while writing the data to the journal entries and thus will be waiting turns, or do they need the lock only for journal metadata like generating a sequence number ? >> Maged >> >> > > -- > Jason___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD journaling benchmarks
On Mon, Jul 10, 2017 at 3:41 PM, Maged Mokhtarwrote: > On 2017-07-10 20:06, Mohamad Gebai wrote: > > > On 07/10/2017 01:51 PM, Jason Dillaman wrote: > > On Mon, Jul 10, 2017 at 1:39 PM, Maged Mokhtar wrote: > > These are significant differences, to the point where it may not make sense > to use rbd journaling / mirroring unless there is only 1 active client. > > I interpreted the results as the same RBD image was being concurrently > used by two fio jobs -- which we strongly recommend against since it > will result in the exclusive-lock ping-ponging back and forth between > the two clients / jobs. Each fio RBD job should utilize its own > backing image to avoid such a scenario. > > > That is correct. The single job runs are more representative of the > overhead of journaling only, and it is worth noting the (expected) > inefficiency of multiple clients for the same RBD image, as explained by > Jason. > > Mohamad > > Yes i expected a penalty but not as large. There are some use cases that > would benefit from concurrent access to the same block device, in vmware ad > hyper-v several hypervisors could share the same device which is formatted > via a clustered file system like MS CSV ( clustered shared volumes ) or > VMFS, which creates a volume/datastore that houses many VMs. Both of these use-cases would first need support for active/active iSCSI. While A/A iSCSI via MPIO is trivial to enable, getting it to properly handle failure conditions without the possibility of data corruption is not since it relies heavily on arbitrary initiator and target-based timers. The only realistic and safe solution is to rely on an MCS-based active/active implementation. > I was wondering if such a setup could be supported in the future and maybe > there could be a way to minimize the overhead of the exclusive lock..for > example by having a distributed sequence number to the different active > client writers and have each writer maintain its own journal, i doubt that > the overhead will reach the values you showed. The journal used by the librbd mirroring feature was designed to support multiple concurrent writers. Of course, that original design was more inline with the goal of supporting multiple images within a consistency group. > Maged > > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD journaling benchmarks
On 2017-07-10 20:06, Mohamad Gebai wrote: > On 07/10/2017 01:51 PM, Jason Dillaman wrote: On Mon, Jul 10, 2017 at 1:39 > PM, Maged Mokhtarwrote: These are significant > differences, to the point where it may not make sense > to use rbd journaling / mirroring unless there is only 1 active client. I > interpreted the results as the same RBD image was being concurrently > used by two fio jobs -- which we strongly recommend against since it > will result in the exclusive-lock ping-ponging back and forth between > the two clients / jobs. Each fio RBD job should utilize its own > backing image to avoid such a scenario. That is correct. The single job runs are more representative of the overhead of journaling only, and it is worth noting the (expected) inefficiency of multiple clients for the same RBD image, as explained by Jason. Mohamad Yes i expected a penalty but not as large. There are some use cases that would benefit from concurrent access to the same block device, in vmware ad hyper-v several hypervisors could share the same device which is formatted via a clustered file system like MS CSV ( clustered shared volumes ) or VMFS, which creates a volume/datastore that houses many VMs. I was wondering if such a setup could be supported in the future and maybe there could be a way to minimize the overhead of the exclusive lock..for example by having a distributed sequence number to the different active client writers and have each writer maintain its own journal, i doubt that the overhead will reach the values you showed. Maged___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD journaling benchmarks
On 07/10/2017 01:51 PM, Jason Dillaman wrote: > On Mon, Jul 10, 2017 at 1:39 PM, Maged Mokhtarwrote: >> These are significant differences, to the point where it may not make sense >> to use rbd journaling / mirroring unless there is only 1 active client. > I interpreted the results as the same RBD image was being concurrently > used by two fio jobs -- which we strongly recommend against since it > will result in the exclusive-lock ping-ponging back and forth between > the two clients / jobs. Each fio RBD job should utilize its own > backing image to avoid such a scenario. > That is correct. The single job runs are more representative of the overhead of journaling only, and it is worth noting the (expected) inefficiency of multiple clients for the same RBD image, as explained by Jason. Mohamad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD journaling benchmarks
On Mon, Jul 10, 2017 at 1:39 PM, Maged Mokhtarwrote: > These are significant differences, to the point where it may not make sense > to use rbd journaling / mirroring unless there is only 1 active client. I interpreted the results as the same RBD image was being concurrently used by two fio jobs -- which we strongly recommend against since it will result in the exclusive-lock ping-ponging back and forth between the two clients / jobs. Each fio RBD job should utilize its own backing image to avoid such a scenario. -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD journaling benchmarks
On 2017-07-10 18:14, Mohamad Gebai wrote: > Resending as my first try seems to have disappeared. > > Hi, > > We ran some benchmarks to assess the overhead caused by enabling > client-side RBD journaling in Luminous. The tests consists of: > - Create an image with journaling enabled (--image-feature journaling) > - Run randread, randwrite and randrw workloads sequentially from a > single client using fio > - Collect IOPS > > More info: > - Feature exclusive-lock is enabled with journaling (required) > - Queue depth of 128 for fio > - With 1 and 2 threads > > Cluster 1 > > > - 5 OSD nodes > - 6 OSDs per node > - 3 monitors > - All SSD > - Bluestore + WAL > - 10GbE NIC > - Ceph version 12.0.3-1380-g6984d41b5d > (6984d41b5d142ce157216b6e757bcb547da2c7d2) luminous (dev) > > Results: > > DefaultJournalingJour width 32 > JobsIOPSIOPSSlowdownIOPSSlowdown > RW > 1195219104 2.1x160671.2x > 230575726 42.1x 48862.6x > Read > 12277522946 0.9x236010.9x > 2359551078 33.3x 44680.2x > Write > 1185156054 3.0x 97651.9x > 2295861188 24.9x 53455.4x > > - "Default" is the baseline (with journaling disabled) > - "Journaling" is with journaling enabled > - "Jour width 32" is with a journal data width of 32 objects > (--journal-splay-width 32) > - The major slowdown for two jobs is due to locking > - With a journal width of 32, the 0.9x slowdown (which is actually a > speedup) is due to the read-only workload, which doesn't exercise the > journaling code. > - The randwrite workload exercises the journaling code the most, and is > expected to have the highest slowdown, which is 1.9x in this case. > > Cluster 2 > > > - 3 OSD nodes > - 10 OSDs per node > - 1 monitor > - All HDD > - Filestore > - 10GbE NIC > - Ceph version 12.1.0-289-g117b171715 > (117b1717154e1236b2d37c405a86a9444cf7871d) luminous (dev) > > Results: > > DefaultJournaling Jour width 32 > Jobs IOPSIOPS Slowdown IOPS Slowdown > RW > 11186936743.2x 4914 2.4x > 213127 736 17.8x432 30.4x > Read > 114500 147001.0x 14703 1.0x > 21667338934.3x307 54.3x > Write > 1 826719254.3x 2591 3.2x > 2 828310128.2x417 19.9x > > - The number of IOPS for the write workload is quite low, which is due > to HDDs and filestore > > Mohamad > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com These are significant differences, to the point where it may not make sense to use rbd journaling / mirroring unless there is only 1 active client. Could there be in the future enhancement that will try to make active/active possible ? Would it help if each active writer maintained their own queue and only lock for a sequence number / counter to try to minimize the lock overhead writing in the same journal queue ? Maged___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com