We currently have a Gluster array of three baremetal servers in a Replicate 1x3 configuration. This single brick has about 1.1TB of data and is configured for 3.7 TB of total space. This array is mostly hosting mail in Maildir format, although we'd like it to also host some Proxmox VMs - the problem with doing that is that the performance of the Gluster array is so slow that booting VMs from Gluster makes Proxmox time out! We've instead started experimenting with using Gluster's NFS server to host the VMs which is much faster, but there are obvious issues with stability. We're not really hosting anything important yet, this is still an experiment. Except for all our mail, of course.

The e-mail performance isn't spectacularly fast, but mostly bearable at the moment.

The real meat of this post however, is "What do we do about this?" I figured that I had built a slow RAID configuration (disk utilization was very high), so I took down one of the Gluster nodes and rebuilt it as a RAID 0 array. This meant starting again with a completely empty disk, but after rebuilding the node, and starting the volume heal, it absolutely slaughtered performance. Our mail server had gotten so slow as to make webmail unusable. The process to heal the volume takes days to move 1.1 TB of data and we couldn't just let it run with performance that bad, so I stopped the Gluster daemon during the day and only ran it at night. It took two whole weeks to completely heal the volume in this fashion, even when allowing the heal to run over the weekend for two days straight.

So what happens when we add more Gluster nodes to this array? Or if we wanted to upgrade the hardware in the array in any way? Or if I wanted to make any other changes to the array? It seems that first, Gluster's promise of high availability is "things will keep working, but they'll be so slow in the meantime that nobody wants to use the services built on top of it", and the same is true when you have to take a node offline for an extended period of time and you have to heal the array again.

This is a serious issue with the performance of heal operations. What can I do to fix it?

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to