Re: ata errors in dmesg/syslog - any pointers from the more ATA/AHCI literate?
On Thu, Jan 21, 2016 at 04:34:39AM +, Anthony wrote: > Would I be right in thinking that this kind of smart failure could not > be triggered by the controller, and rather it's a drive fault, because > the tests are run wholly within the drive itself and all that goes > between drive and computer is the request to start the test, and the > test results? yep, the smart tests are run on the drive itself. hope you've got a backup. > My thoughts are that besides the bitching of the Marvell virtual ATA > device (which perhaps was passing stuff through to the 1TB device), > all the errors could be attributed to the now failed drive? possibly. but IIRC you said it was whinging about the DVD drive for ages too. craig -- craig sandersBOFH excuse #447: According to Microsoft, it's by design ___ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
Re: ata errors in dmesg/syslog - any pointers from the more ATA/AHCI literate?
It got worse.. writing this from webmail :) > > It has been awhile since I got this machine (April '10), though the 3TB > > drive is a lot newer than the rest of it. > > > > Today's blargh (turns out knowing basic SMTP is handy in these situations > > :)): > > [200860.130029] [ cut here ] > > [200860.130035] WARNING: CPU: 3 PID: 26390 at > > /build/linux-AFqQDb/linux-4.2.0/fs/buffer.c:1160 > > mark_buffer_dirty+0xf3/0x100() > > [200860.130036] Modules linked in: nls_utf8 btrfs xor raid6_pq ufs qnx4 > > hfsplus hfs minix ntfs msdos jfs > > [200860.130044] Buffer I/O error on dev sdb1, logical block 0, lost sync > > page write > > [200860.130045] xfs libcrc32c cpuid binfmt_misc nfsv3 nfs_acl > > rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache bnep rfcomm > > bluetooth uas usb_storage pci_stub vboxpci(OE) vboxnetadp(OE) > > vboxnetflt(OE) vboxdrv(OE) nvidia(POE) coretemp kvm_intel mxm_wmi > > snd_hda_codec_realtek snd_hda_codec_generic i7core_edac kvm snd_hda_intel > > snd_hda_codec snd_hda_core gpio_ich snd_hwdep snd_pcm drm edac_core > > input_leds snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq serio_raw > > snd_seq_device snd_timer wmi snd 8250_fintek shpchp soundcore lpc_ich > > mac_hid sunrpc parport_pc ppdev lp parport autofs4 pata_acpi hid_generic > > usbhid hid firewire_ohci firewire_core r8169 pata_it8213 crc_itu_t mii ahci > > libahci > > [200860.130079] CPU: 3 PID: 26390 Comm: Cache2 I/O Tainted: P OE > > 4.2.0-23-generic #28-Ubuntu > > [200860.130081] Hardware name: Gigabyte Technology Co., Ltd. > > P55A-UD4/P55A-UD4, BIOS F15 09/16/2010 > > [200860.130082] 7621f8ae 8801486a7b48 > > 817e94c9 > > [200860.130084] 8801486a7b88 > > 8107b3d6 > > [200860.130086] 81ac2d38 81d2a8a0 880211ff80d0 > > 012e8320 > > [200860.130087] Call Trace: > > ... > > not good. there's something really messed up with your system, and my best > guess is that it's the motherboardor, at least, the sata controllers on > it. My computer auto-boots if it's shut down, so that if I go to work and I turned it off overnight, it'll be online by the time I'm in the office. This morning, when I woke up, I heard the GPU fan running high (which only happens when GPU driver hasn't been loaded yet) and was greeted by a disk error explosion on the 1TB drive. Would I be right in thinking that this kind of smart failure could not be triggered by the controller, and rather it's a drive fault, because the tests are run wholly within the drive itself and all that goes between drive and computer is the request to start the test, and the test results? === START OF INFORMATION SECTION === Model Family: Western Digital Black Device Model: WDC WD1002FAEX-00Z3A0 Serial Number:WD-WCATR056 LU WWN Device Id: 5 0014ee 2044331c2 Firmware Version: 05.01D05 User Capacity:1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Device is:In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 2.6, 6.0 Gb/s Local Time is:Thu Jan 21 04:23:23 2016 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled ... 200 Multi_Zone_Error_Rate 0x0008 200 197 000Old_age Offline - 8 ... SMART Self-test log structure revision number 1 Num Test_DescriptionStatus Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 43110 486750687 # 2 Conveyance offline Completed: read failure 90% 43110 486750687 # 3 Extended offlineCompleted: read failure 90% 43110 486750687 My thoughts are that besides the bitching of the Marvell virtual ATA device (which perhaps was passing stuff through to the 1TB device), all the errors could be attributed to the now failed drive? ___ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
debugging disk performance problems
http://etbe.coker.com.au/2016/01/21/storage-performance-problems/ The above blog post might be useful for some people. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ ___ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
Re: Mail Server Really Slow
James Harper via luv-main wrote: On Wed, Jan 20, 2016 at 07:28:38AM +, James Harper wrote: As long as I remember to replace the To: with luv-main each time I reply, I guess it's workable. that happens even on just plain Replies, too - not just Reply-All? that's weird because the list munges the From: address, so a reply should go to the list. Yep. Reply and Reply-All from Outlook 2016. Not sure who to blame for standards violation here... Well from SeaMonkey-mail " Reply" would just go to the "Reply-To:" address (James Harperin this case) "Reply All" is going to "Reply-To" (as above ) AND "To" address (luv-main@luv.asn.au in this case ) I am assumimg James doesn't want two emails, so I am deleting his direct address ! As far as I can recall SeaMonkey-mail has always behaved this way and nothing Russel has done has changed this ! Does this add anything to the discussion ? regards Rohan McLeod ___ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
Re: Mail Server Really Slow
On Wed, Jan 20, 2016 at 07:28:38AM +, James Harper wrote: > As long as I remember to replace the To: with luv-main each time I > reply, I guess it's workable. that happens even on just plain Replies, too - not just Reply-All? that's weird because the list munges the From: address, so a reply should go to the list. > > 233 Remaining_Lifetime_Perc 0x 067 067 000Old_age Offline > > - > > 67 > > 233 is reported as Media Wearout Indicator on the drives I just > checked on a BSD box, so I guess it's the same thing but with a > different description for whatever reason. i dunno if that name comes from the drive itself or from the smartctl software. that could be the difference. > > I assume that means I´ve used up about 1/3rd of its expected life. Not > > bad, considering i've been running it for 500 days total so far: > > > > 9 Power_On_Hours 0x 100 100 000Old_age Offline > > - > > 12005 > > > > 12005 hours is 500 days. or 1.3 years. > > I just checked the server that burned out the disks pretty quick last > time (RAID1 zfs cache, so both went around the same time), and it i suppose read performance is doubled, but there's not really any point in RAIDing L2ARC. it's transient data that gets wiped on boot anyway. better to have two l2arc cache partitions and two ZIL partitions. and not raiding the l2arc should spread the write load over the 2 SSDs and probably increase longevity. my pair of OCZ drives have mdadm RAID-1 (xfs) for the OS + /home and another 1GB RAID1 (ext4) for /boot, and just partitions for L2ARC and ZIL. zfs mirrors the ZIL (essential for safety, don't want to lose the ZIL if one drive dies!) if you give it two or more block devices anyway, and it uses two or more block devices as independent L2ARCs (so double the capacity). $ zpool status export -v pool: export state: ONLINE scan: scrub repaired 0 in 4h50m with 0 errors on Sat Jan 16 06:03:30 2016 config: NAMESTATE READ WRITE CKSUM export ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 logs sdh7 ONLINE 0 0 0 sdj7 ONLINE 0 0 0 cache sdh6 ONLINE 0 0 0 sdj6 ONLINE 0 0 0 errors: No known data errors this pool is 4 x 1TB. i'll probably replace them later this year with one or two mirrored pairs of 4TB drives. I've gone off RAID-5 and RAID-Z. even with ZIL and L2ARC, performance isn't great, nowhere near what RAID-10 (or two mirrored pairs in zfs-speak) is. like my backup pool. $ zpool status backup -v pool: backup state: ONLINE scan: scrub repaired 0 in 4h2m with 0 errors on Sat Jan 16 05:15:20 2016 config: NAMESTATE READ WRITE CKSUM backup ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdi ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sdc ONLINE 0 0 0 errors: No known data errors this pool has the 4 x 4TB Seagate SSHDs i mentioned recently. it stores backups for all machines on my home network. > > and that's for an OCZ Vertex, one of the last decent drives OCZ made > > before they started producing crap and went bust (and subsequently > > got sorry, my mistake. i meant OCZ Vector. sdh OCZ-VECTOR_OCZ-0974C023I4P2G1B8 sdj OCZ-VECTOR_OCZ-8RL5XW08536INH7R > I've seen too many OCZ's fail within months of purchase recently, but > not enough data points to draw conclusions from. Maybe a bad batch or > something? They were all purchased within a month or so of each other, > late last year. The failure mode was that the system just can't see > the disk, except very occasionally, and then not for long enough to > actually boot from. i've read that the Toshiba-produced OCZs are pretty good now, so possibly a bad batch. or sounds like you abuse the poor things with too many writes. even so, my next SSD will probably be a Samsung. > Yep. I just got a 500GB 850 EVO for my laptop and it doesn't have > any of the wearout indicators that I can see, but I doubt I'll get > anywhere near close to wearing it out before it becomes obsolete. that's not good. i wish disk vendors would stop crippling their SMART implementations and treat it seriously. craig -- craig sanders___ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
Re: Mail Server Really Slow
On Wed, 20 Jan 2016 08:15:00 PM Craig Sanders via luv-main wrote: > that's weird because the list munges the From: address, so a reply > should go to the list. On the other hand Reply-To: is set back to the original poster. So I guess it depends whether your MUA prefers List-Post: over Reply-To: (which this version of Kmail seems to) or vice-versa. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
RE: Mail Server Really Slow
> > On Wed, Jan 20, 2016 at 07:28:38AM +, James Harper wrote: > > As long as I remember to replace the To: with luv-main each time I > > reply, I guess it's workable. > > that happens even on just plain Replies, too - not just Reply-All? > > that's weird because the list munges the From: address, so a reply should go > to the list. > Yep. Reply and Reply-All from Outlook 2016. Not sure who to blame for standards violation here... > > > 233 Remaining_Lifetime_Perc 0x 067 067 000Old_age > > > Offline > - > > > 67 > > > > 233 is reported as Media Wearout Indicator on the drives I just > > checked on a BSD box, so I guess it's the same thing but with a > > different description for whatever reason. > > i dunno if that name comes from the drive itself or from the smartctl > software. that could be the difference. smartctl. It has a vendor database concerning what each of the values are, so if the manufacturer of you drives says 233 = "Remaining Lifetime Percent", and the manufacturer of my drive says 233 = "Media Wearout Indicator", and the authors of smartmontools were aware of this, then that's what goes in the database and that's what gets reported. > > > > I assume that means I´ve used up about 1/3rd of its expected life. > > > Not bad, considering i've been running it for 500 days total so far: > > > > > > 9 Power_On_Hours 0x 100 100 000Old_age > > > Offline - > > > 12005 > > > > > > 12005 hours is 500 days. or 1.3 years. > > > > I just checked the server that burned out the disks pretty quick last > > time (RAID1 zfs cache, so both went around the same time), and it > > i suppose read performance is doubled, but there's not really any point in > RAIDing L2ARC. it's transient data that gets wiped on boot anyway. > better to have two l2arc cache partitions and two ZIL partitions. > > and not raiding the l2arc should spread the write load over the 2 SSDs and > probably increase longevity. > > > > I've seen too many OCZ's fail within months of purchase recently, but > > not enough data points to draw conclusions from. Maybe a bad batch or > > something? They were all purchased within a month or so of each other, > > late last year. The failure mode was that the system just can't see > > the disk, except very occasionally, and then not for long enough to > > actually boot from. > > i've read that the Toshiba-produced OCZs are pretty good now, so possibly a > bad batch. or sounds like you abuse the poor things with too many writes. > Nah these particular ones were just in PC's, and were definitely not warn out (on the one occasion where I actually got one to read for a bit, the SMART values were all fine. Servers get SSD's with supercaps :) > even so, my next SSD will probably be a Samsung. Despite initial reservations (funny how you can easily find bad reports on any brand!) I have been impressed with the performance and longevity of the Samsungs, but I still don't have enough datapoints. James ___ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
Re: Mail Server Really Slow
On Tue, 19 Jan 2016 10:12:12 AM Piers Rowan via luv-main wrote: > The server is a VM on a host server that also provides http / mysql > services. The host server runs cron jobs to poll the email server > (importing data from mail boxes into the CRM) so - to clutch at straws - > I am not sure if the host and guest are competing for the disk IO at the > same time with these calls. Contrary to that is that the host server > does not experience any slow downs. Some ideas that I've not seen mentioned before yet: 1) perf top - to see where the system is spending time as a whole (and if you need to drill down on a process you can do perf top -p $PID). 2) latencytop - as long as your kernel has CONFIG_LATENCYTOP 3) iotop - if your version is new enough then the -o option will hide idle processes, otherwise just press 'o' when you get the main display. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main