Re: [ceph-users] XFS and nobarriers on Intel SSD

2016-03-03 Thread Maxime Guyot
Hello, It looks like this thread is one of the main google hit on this issue, so let me bring some update. I experienced the same symptoms with Intel S3610 and LSI2208. The logs reported “task abort!” messages on a daily basis since November: Write(10): 2a 00 0e 92 88 90 00 00 10 00 scsi

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Adam Heczko
fisk.me.uk> wrote: > > > > > > > > > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of > >> Christian Balzer > >> Sent: 14 September 2015 09:43 > >> To: c

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Christian Balzer
-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Richard Bade Sent: 14 September 2015 01:31 > Cc: ceph-us...@ceph.com > Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD > > > > Hi Everyone, > > I updated the firmware on 3 S3710 drive

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Nick Fisk
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Christian Balzer > Sent: 14 September 2015 09:43 > To: ceph-us...@ceph.com > Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD > > > Hello, >

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Jan Schermer
> >> >> That said, having to disable barriers to make Avago/LSI happy is not >> something that gives me the warm fuzzies. >> >> Christian >>> >>> >>> Maybe running with the NOOP scheduler and nobarriers maybe safe, but >>> unless someon

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-14 Thread Nick Fisk
nobarriers with CFQ or Deadline. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Richard Bade Sent: 14 September 2015 01:31 Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] XFS and nobarriers on Intel SSD Hi Everyone, I updated the firmware on 3 S3710 drives (one host

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-13 Thread Richard Bade
Hi Everyone, I updated the firmware on 3 S3710 drives (one host) last Tuesday and have not seen any ATA resets or Task Aborts on that host in the 5 days since. I also set nobarriers on another host on Wednesday and have only seen one Task Abort, and that was on an S3710. I have seen 18 ATA

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Jan Schermer
Take a look at this: http://monolight.cc/2011/06/barriers-caches-filesystems/ LSI's answer just makes no sense to me... Jan > On 07 Sep 2015, at 11:07, Jan Schermer wrote: > > Are you absolutely sure there's

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Jan Schermer
ilto:j...@schermer.cz>> > Отправлено: 7 сентября 2015 г. 12:07 > Кому: Richard Bade > Копия: ceph-us...@ceph.com <mailto:ceph-us...@ceph.com> > Тема: Re: [ceph-users] XFS and nobarriers on Intel SSD > > Are you absolutely sure there's nothing in dmesg before this? The

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Andrey Korolyov
On Mon, Sep 7, 2015 at 12:54 PM, Paul Mansfield wrote: > > > On 04/09/15 20:55, Richard Bade wrote: >> We have a Ceph pool that is entirely made up of Intel S3700/S3710 >> enterprise SSD's. >> >> We are seeing some significant I/O delays on the disks causing a

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Christian Balzer
Hello, Note that I see exactly your errors (in a non-Ceph environment) with both Samsung 845DC EVO and Intel DC S3610. Though I need to stress things quite a bit to make it happen. Also setting nobarrier did alleviate it, but didn't fix it 100%, so I guess something still issues flushes at some

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Richard Bade
Thanks guys for the pointers to this Intel thread: https://communities.intel.com/thread/77801 It looks promising. I intend to update the firmware on disks in one node tonight and will report back after a few days to a week on my findings. I've also posted to that forum and will update there

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Richard Bade
Hi Christian, On 8 September 2015 at 14:02, Christian Balzer wrote: > > Indeed. But first a word about the setup where I'm seeing this. > These are 2 mailbox server clusters (2 nodes each), replicating via DRBD > over Infiniband (IPoIB at this time), LSI 3008 controller. One

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Richard Bade
Hi Christian, Thanks for the info. I'm just wondering, have you updated your S3610's with the new firmware that was released on 21/08 as referred to in the thread? We thought we weren't seeing the issue on the intel controller also to start with, but after further investigation it turned out we

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Christian Balzer
Hello, On Tue, 8 Sep 2015 13:40:36 +1200 Richard Bade wrote: > Hi Christian, > Thanks for the info. I'm just wondering, have you updated your S3610's > with the new firmware that was released on 21/08 as referred to in the > thread? I did so earlier today, see below. >We thought we weren't

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-07 Thread Paul Mansfield
On 04/09/15 20:55, Richard Bade wrote: > We have a Ceph pool that is entirely made up of Intel S3700/S3710 > enterprise SSD's. > > We are seeing some significant I/O delays on the disks causing a “SCSI > Task Abort” from the OS. This seems to be triggered by the drive > receiving a “Synchronize

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-04 Thread Jan Schermer
>> We are seeing some significant I/O delays on the disks causing a “SCSI Task >> Abort” from the OS. This seems to be triggered by the drive receiving a >> “Synchronize cache command”. >> >> How exactly do you know this is the cause? This is usually just an effect of something going wrong

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-04 Thread Richard Bade
Hi Jan, Thanks for your response. > *How exactly do you know this is the cause? This is usually just an effect > of something going wrong and part of error recovery process.**Preceding > this event should be the real error/root cause...* We have been working with LSI/Avago to resolve this. We