Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brendan Moloney Sent: 23 March 2015 21:02 To: Noah Mehl Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid This would be in addition to having the journal on SSD. The journal doesn't help at all with small random reads and has a fairly limited ability to coalesce writes. In my case, the SSDs we are using for journals should have plenty of bandwidth/IOPs/space to spare, so I want to see if I can get a little more out of them. -Brendan From: Noah Mehl [noahm...@combinedpublic.com] Sent: Monday, March 23, 2015 1:45 PM To: Brendan Moloney Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid We deployed with just putting the journal on an SSD directly, why would this not work for you? Just wondering really :) Thanks! ~Noah On Mar 23, 2015, at 4:36 PM, Brendan Moloney molo...@ohsu.edu wrote: I have been looking at the options for SSD caching for a bit now. Here is my take on the current options: 1) bcache - Seems to have lots of reliability issues mentioned on mailing list with little sign of improvement. 2) flashcache - Seems to be no longer (or minimally?) developed/maintained, instead folks are working on the fork enhanceio. 3) enhanceio - Fork of flashcache. Dropped the ability to skip caching on sequential writes, which many folks have claimed is important for Ceph OSD caching performance. (see: https://github.com/stec- inc/EnhanceIO/issues/32) 4) LVM cache (dm-cache) - There is now a user friendly way to use dm- cache, through LVM. Allows sequential writes to be skipped. You need a pretty recent kernel. I am going to be trying out LVM cache on my own cluster in the next few weeks. I will share my results here on the mailing list. If anyone else has tried it out I would love to hear about it. -Brendan In a long term use I also had some issues with flashcache and enhanceio. I've noticed frequent slow requests. Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
I have been looking at the options for SSD caching for a bit now. Here is my take on the current options: 1) bcache - Seems to have lots of reliability issues mentioned on mailing list with little sign of improvement. 2) flashcache - Seems to be no longer (or minimally?) developed/maintained, instead folks are working on the fork enhanceio. 3) enhanceio - Fork of flashcache. Dropped the ability to skip caching on sequential writes, which many folks have claimed is important for Ceph OSD caching performance. (see: https://github.com/stec-inc/EnhanceIO/issues/32) 4) LVM cache (dm-cache) - There is now a user friendly way to use dm-cache, through LVM. Allows sequential writes to be skipped. You need a pretty recent kernel. I am going to be trying out LVM cache on my own cluster in the next few weeks. I will share my results here on the mailing list. If anyone else has tried it out I would love to hear about it. -Brendan In a long term use I also had some issues with flashcache and enhanceio. I've noticed frequent slow requests. Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
Just to add, the main reason it seems to make a difference is the metadata updates which lie on the actual OSD. When you are doing small block writes, these metadata updates seem to take almost as long as the actual data, so although the writes are getting coalesced, the actual performance isn't much better. I did a blktrace a week ago, writing 500MB in 64k blocks to an OSD. You could see that the actual data was flushed to the OSD in a couple of seconds, another 30 seconds was spent writing out metadata and doing EXT4/XFS journal writes. Normally I have found flashcache to perform really poorly as it does everything in 4kb blocks, meaning that when you start throwing larger blocks at it, it can actually slow things down. However for the purpose of OSD's you can set the IO cutoff size limit to around 16-32kb and then it should only cache the metadata updates. I'm hoping to do some benchmarks before and after flashcache on a SSD Journaled OSD this week, so will post results when I have them. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brendan Moloney Sent: 23 March 2015 21:02 To: Noah Mehl Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid This would be in addition to having the journal on SSD. The journal doesn't help at all with small random reads and has a fairly limited ability to coalesce writes. In my case, the SSDs we are using for journals should have plenty of bandwidth/IOPs/space to spare, so I want to see if I can get a little more out of them. -Brendan From: Noah Mehl [noahm...@combinedpublic.com] Sent: Monday, March 23, 2015 1:45 PM To: Brendan Moloney Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid We deployed with just putting the journal on an SSD directly, why would this not work for you? Just wondering really :) Thanks! ~Noah On Mar 23, 2015, at 4:36 PM, Brendan Moloney molo...@ohsu.edu wrote: I have been looking at the options for SSD caching for a bit now. Here is my take on the current options: 1) bcache - Seems to have lots of reliability issues mentioned on mailing list with little sign of improvement. 2) flashcache - Seems to be no longer (or minimally?) developed/maintained, instead folks are working on the fork enhanceio. 3) enhanceio - Fork of flashcache. Dropped the ability to skip caching on sequential writes, which many folks have claimed is important for Ceph OSD caching performance. (see: https://github.com/stec- inc/EnhanceIO/issues/32) 4) LVM cache (dm-cache) - There is now a user friendly way to use dm- cache, through LVM. Allows sequential writes to be skipped. You need a pretty recent kernel. I am going to be trying out LVM cache on my own cluster in the next few weeks. I will share my results here on the mailing list. If anyone else has tried it out I would love to hear about it. -Brendan In a long term use I also had some issues with flashcache and enhanceio. I've noticed frequent slow requests. Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
We deployed with just putting the journal on an SSD directly, why would this not work for you? Just wondering really :) Thanks! ~Noah On Mar 23, 2015, at 4:36 PM, Brendan Moloney molo...@ohsu.edu wrote: I have been looking at the options for SSD caching for a bit now. Here is my take on the current options: 1) bcache - Seems to have lots of reliability issues mentioned on mailing list with little sign of improvement. 2) flashcache - Seems to be no longer (or minimally?) developed/maintained, instead folks are working on the fork enhanceio. 3) enhanceio - Fork of flashcache. Dropped the ability to skip caching on sequential writes, which many folks have claimed is important for Ceph OSD caching performance. (see: https://github.com/stec-inc/EnhanceIO/issues/32) 4) LVM cache (dm-cache) - There is now a user friendly way to use dm-cache, through LVM. Allows sequential writes to be skipped. You need a pretty recent kernel. I am going to be trying out LVM cache on my own cluster in the next few weeks. I will share my results here on the mailing list. If anyone else has tried it out I would love to hear about it. -Brendan In a long term use I also had some issues with flashcache and enhanceio. I've noticed frequent slow requests. Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
This would be in addition to having the journal on SSD. The journal doesn't help at all with small random reads and has a fairly limited ability to coalesce writes. In my case, the SSDs we are using for journals should have plenty of bandwidth/IOPs/space to spare, so I want to see if I can get a little more out of them. -Brendan From: Noah Mehl [noahm...@combinedpublic.com] Sent: Monday, March 23, 2015 1:45 PM To: Brendan Moloney Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid We deployed with just putting the journal on an SSD directly, why would this not work for you? Just wondering really :) Thanks! ~Noah On Mar 23, 2015, at 4:36 PM, Brendan Moloney molo...@ohsu.edu wrote: I have been looking at the options for SSD caching for a bit now. Here is my take on the current options: 1) bcache - Seems to have lots of reliability issues mentioned on mailing list with little sign of improvement. 2) flashcache - Seems to be no longer (or minimally?) developed/maintained, instead folks are working on the fork enhanceio. 3) enhanceio - Fork of flashcache. Dropped the ability to skip caching on sequential writes, which many folks have claimed is important for Ceph OSD caching performance. (see: https://github.com/stec-inc/EnhanceIO/issues/32) 4) LVM cache (dm-cache) - There is now a user friendly way to use dm-cache, through LVM. Allows sequential writes to be skipped. You need a pretty recent kernel. I am going to be trying out LVM cache on my own cluster in the next few weeks. I will share my results here on the mailing list. If anyone else has tried it out I would love to hear about it. -Brendan In a long term use I also had some issues with flashcache and enhanceio. I've noticed frequent slow requests. Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
A. I see now. Has anyone used cachecadehttp://www.lsi.com/products/raid-controllers/pages/megaraid-cachecade-pro-software.aspx from LSI for both the read and write cache to SSD? I don’t know if you can attach a cachecade device to a JBOD, but if you could it would probably perform really well…. I submit this because I really haven’t seen an opensouce read and write SSD cache that performs as well as ZFS for instance. And for ZFS, I don’t know if you can add a SSD cache to a single drive? Thanks! ~Noah On Mar 23, 2015, at 5:43 PM, Nick Fisk n...@fisk.me.ukmailto:n...@fisk.me.uk wrote: Just to add, the main reason it seems to make a difference is the metadata updates which lie on the actual OSD. When you are doing small block writes, these metadata updates seem to take almost as long as the actual data, so although the writes are getting coalesced, the actual performance isn't much better. I did a blktrace a week ago, writing 500MB in 64k blocks to an OSD. You could see that the actual data was flushed to the OSD in a couple of seconds, another 30 seconds was spent writing out metadata and doing EXT4/XFS journal writes. Normally I have found flashcache to perform really poorly as it does everything in 4kb blocks, meaning that when you start throwing larger blocks at it, it can actually slow things down. However for the purpose of OSD's you can set the IO cutoff size limit to around 16-32kb and then it should only cache the metadata updates. I'm hoping to do some benchmarks before and after flashcache on a SSD Journaled OSD this week, so will post results when I have them. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brendan Moloney Sent: 23 March 2015 21:02 To: Noah Mehl Cc: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid This would be in addition to having the journal on SSD. The journal doesn't help at all with small random reads and has a fairly limited ability to coalesce writes. In my case, the SSDs we are using for journals should have plenty of bandwidth/IOPs/space to spare, so I want to see if I can get a little more out of them. -Brendan From: Noah Mehl [noahm...@combinedpublic.commailto:noahm...@combinedpublic.com] Sent: Monday, March 23, 2015 1:45 PM To: Brendan Moloney Cc: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid We deployed with just putting the journal on an SSD directly, why would this not work for you? Just wondering really :) Thanks! ~Noah On Mar 23, 2015, at 4:36 PM, Brendan Moloney molo...@ohsu.edumailto:molo...@ohsu.edu wrote: I have been looking at the options for SSD caching for a bit now. Here is my take on the current options: 1) bcache - Seems to have lots of reliability issues mentioned on mailing list with little sign of improvement. 2) flashcache - Seems to be no longer (or minimally?) developed/maintained, instead folks are working on the fork enhanceio. 3) enhanceio - Fork of flashcache. Dropped the ability to skip caching on sequential writes, which many folks have claimed is important for Ceph OSD caching performance. (see: https://github.com/stec- inc/EnhanceIO/issues/32) 4) LVM cache (dm-cache) - There is now a user friendly way to use dm- cache, through LVM. Allows sequential writes to be skipped. You need a pretty recent kernel. I am going to be trying out LVM cache on my own cluster in the next few weeks. I will share my results here on the mailing list. If anyone else has tried it out I would love to hear about it. -Brendan In a long term use I also had some issues with flashcache and enhanceio. I've noticed frequent slow requests. Andrei ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
In a long term use I also had some issues with flashcache and enhanceio. I've noticed frequent slow requests. Andrei - Original Message - From: Robert LeBlanc rob...@leblancnet.us To: Nick Fisk n...@fisk.me.uk Cc: ceph-users@lists.ceph.com Sent: Friday, 20 March, 2015 8:14:16 PM Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid We tested bcache and abandoned it for two reasons. 1. Didn't give us any better performance than journals on SSD. 2. We had lots of corruption of the OSDs and were rebuilding them frequently. Since removing them, the OSDs have been much more stable. On Fri, Mar 20, 2015 at 4:03 AM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf Of Burkhard Linke Sent: 20 March 2015 09:09 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid Hi, On 03/19/2015 10:41 PM, Nick Fisk wrote: I'm looking at trialling OSD's with a small flashcache device over them to hopefully reduce the impact of metadata updates when doing small block io. Inspiration from here:- http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 One thing I suspect will happen, is that when the OSD node starts up udev could possibly mount the base OSD partition instead of flashcached device, as the base disk will have the ceph partition uuid type. This could result in quite nasty corruption. I ran into this problem with an enhanceio based cache for one of our database servers. I think you can prevent this problem by using bcache, which is also integrated into the official kernel tree. It does not act as a drop in replacement, but creates a new device that is only available if the cache is initialized correctly. A GPT partion table on the bcache device should be enough to allow the standard udev rules to kick in. I haven't used bcache in this scenario yet, and I cannot comment on its speed and reliability compared to other solutions. But from the operational point of view it is safer than enhanceio/flashcache. I did look at bcache, but there are a lot of worrying messages on the mailing list about hangs and panics that has discouraged me slightly from it. I do think it is probably the best solution, but I'm not convinced about the stability. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
Hi, On 03/19/2015 10:41 PM, Nick Fisk wrote: I'm looking at trialling OSD's with a small flashcache device over them to hopefully reduce the impact of metadata updates when doing small block io. Inspiration from here:- http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 One thing I suspect will happen, is that when the OSD node starts up udev could possibly mount the base OSD partition instead of flashcached device, as the base disk will have the ceph partition uuid type. This could result in quite nasty corruption. I ran into this problem with an enhanceio based cache for one of our database servers. I think you can prevent this problem by using bcache, which is also integrated into the official kernel tree. It does not act as a drop in replacement, but creates a new device that is only available if the cache is initialized correctly. A GPT partion table on the bcache device should be enough to allow the standard udev rules to kick in. I haven't used bcache in this scenario yet, and I cannot comment on its speed and reliability compared to other solutions. But from the operational point of view it is safer than enhanceio/flashcache. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
We tested bcache and abandoned it for two reasons. 1. Didn't give us any better performance than journals on SSD. 2. We had lots of corruption of the OSDs and were rebuilding them frequently. Since removing them, the OSDs have been much more stable. On Fri, Mar 20, 2015 at 4:03 AM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Burkhard Linke Sent: 20 March 2015 09:09 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid Hi, On 03/19/2015 10:41 PM, Nick Fisk wrote: I'm looking at trialling OSD's with a small flashcache device over them to hopefully reduce the impact of metadata updates when doing small block io. Inspiration from here:- http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 One thing I suspect will happen, is that when the OSD node starts up udev could possibly mount the base OSD partition instead of flashcached device, as the base disk will have the ceph partition uuid type. This could result in quite nasty corruption. I ran into this problem with an enhanceio based cache for one of our database servers. I think you can prevent this problem by using bcache, which is also integrated into the official kernel tree. It does not act as a drop in replacement, but creates a new device that is only available if the cache is initialized correctly. A GPT partion table on the bcache device should be enough to allow the standard udev rules to kick in. I haven't used bcache in this scenario yet, and I cannot comment on its speed and reliability compared to other solutions. But from the operational point of view it is safer than enhanceio/flashcache. I did look at bcache, but there are a lot of worrying messages on the mailing list about hangs and panics that has discouraged me slightly from it. I do think it is probably the best solution, but I'm not convinced about the stability. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Burkhard Linke Sent: 20 March 2015 09:09 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid Hi, On 03/19/2015 10:41 PM, Nick Fisk wrote: I'm looking at trialling OSD's with a small flashcache device over them to hopefully reduce the impact of metadata updates when doing small block io. Inspiration from here:- http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 One thing I suspect will happen, is that when the OSD node starts up udev could possibly mount the base OSD partition instead of flashcached device, as the base disk will have the ceph partition uuid type. This could result in quite nasty corruption. I ran into this problem with an enhanceio based cache for one of our database servers. I think you can prevent this problem by using bcache, which is also integrated into the official kernel tree. It does not act as a drop in replacement, but creates a new device that is only available if the cache is initialized correctly. A GPT partion table on the bcache device should be enough to allow the standard udev rules to kick in. I haven't used bcache in this scenario yet, and I cannot comment on its speed and reliability compared to other solutions. But from the operational point of view it is safer than enhanceio/flashcache. I did look at bcache, but there are a lot of worrying messages on the mailing list about hangs and panics that has discouraged me slightly from it. I do think it is probably the best solution, but I'm not convinced about the stability. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
On Thu, Mar 19, 2015 at 2:41 PM, Nick Fisk n...@fisk.me.uk wrote: I'm looking at trialling OSD's with a small flashcache device over them to hopefully reduce the impact of metadata updates when doing small block io. Inspiration from here:- http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 One thing I suspect will happen, is that when the OSD node starts up udev could possibly mount the base OSD partition instead of flashcached device, as the base disk will have the ceph partition uuid type. This could result in quite nasty corruption. I have had a look at the Ceph udev rules and can see that something similar has been done for encrypted OSD's. Am I correct in assuming that what I need to do is to create a new partition uuid type for flashcached OSD's and then create a udev rule to activate these new uuid'd OSD's once flashcache has finished assembling them? I haven't worked with the udev rules in a while, but that sounds like the right way to go. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com