Re: ata errors in dmesg/syslog - any pointers from the more ATA/AHCI literate?

2016-01-20 Thread Craig Sanders via luv-main
On Thu, Jan 21, 2016 at 04:34:39AM +, Anthony wrote:
> Would I be right in thinking that this kind of smart failure could not
> be triggered by the controller, and rather it's a drive fault, because
> the tests are run wholly within the drive itself and all that goes
> between drive and computer is the request to start the test, and the
> test results?

yep, the smart tests are run on the drive itself.

hope you've got a backup.

> My thoughts are that besides the bitching of the Marvell virtual ATA
> device (which perhaps was passing stuff through to the 1TB device),
> all the errors could be attributed to the now failed drive?

possibly. but IIRC you said it was whinging about the DVD drive for ages
too.

craig

-- 
craig sanders 

BOFH excuse #447:

According to Microsoft, it's by design
___
luv-main mailing list
luv-main@luv.asn.au
http://lists.luv.asn.au/listinfo/luv-main


Re: ata errors in dmesg/syslog - any pointers from the more ATA/AHCI literate?

2016-01-20 Thread Anthony via luv-main
It got worse.. writing this from webmail :)

> > It has been awhile since I got this machine (April '10), though the 3TB
> > drive is a lot newer than the rest of it.
> >
> > Today's blargh (turns out knowing basic SMTP is handy in these situations 
> > :)):
> > [200860.130029] [ cut here ]
> > [200860.130035] WARNING: CPU: 3 PID: 26390 at 
> > /build/linux-AFqQDb/linux-4.2.0/fs/buffer.c:1160 
> > mark_buffer_dirty+0xf3/0x100()
> > [200860.130036] Modules linked in: nls_utf8 btrfs xor raid6_pq ufs qnx4 
> > hfsplus hfs minix ntfs msdos jfs
> > [200860.130044] Buffer I/O error on dev sdb1, logical block 0, lost sync 
> > page write
> > [200860.130045]  xfs libcrc32c cpuid binfmt_misc nfsv3 nfs_acl 
> > rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache bnep rfcomm 
> > bluetooth uas usb_storage pci_stub vboxpci(OE) vboxnetadp(OE) 
> > vboxnetflt(OE) vboxdrv(OE) nvidia(POE) coretemp kvm_intel mxm_wmi 
> > snd_hda_codec_realtek snd_hda_codec_generic i7core_edac kvm snd_hda_intel 
> > snd_hda_codec snd_hda_core gpio_ich snd_hwdep snd_pcm drm edac_core 
> > input_leds snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq serio_raw 
> > snd_seq_device snd_timer wmi snd 8250_fintek shpchp soundcore lpc_ich 
> > mac_hid sunrpc parport_pc ppdev lp parport autofs4 pata_acpi hid_generic 
> > usbhid hid firewire_ohci firewire_core r8169 pata_it8213 crc_itu_t mii ahci 
> > libahci
> > [200860.130079] CPU: 3 PID: 26390 Comm: Cache2 I/O Tainted: P   OE  
> >  4.2.0-23-generic #28-Ubuntu
> > [200860.130081] Hardware name: Gigabyte Technology Co., Ltd. 
> > P55A-UD4/P55A-UD4, BIOS F15 09/16/2010
> > [200860.130082]   7621f8ae 8801486a7b48 
> > 817e94c9
> > [200860.130084]    8801486a7b88 
> > 8107b3d6
> > [200860.130086]  81ac2d38 81d2a8a0 880211ff80d0 
> > 012e8320
> > [200860.130087] Call Trace:
> > ...
>
> not good.  there's something really messed up with your system, and my best
> guess is that it's the motherboardor, at least, the sata controllers on 
> it.

My computer auto-boots if it's shut down, so that if I go to work and
I turned it off overnight, it'll be online by the time I'm in the
office.

This morning, when I woke up, I heard the GPU fan running high (which
only happens when GPU driver hasn't been loaded yet) and was greeted
by a disk error explosion on the 1TB drive.

Would I be right in thinking that this kind of smart failure could not
be triggered by the controller, and rather it's a drive fault, because
the tests are run wholly within the drive itself and all that goes
between drive and computer is the request to start the test, and the
test results?

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Black
Device Model: WDC WD1002FAEX-00Z3A0
Serial Number:WD-WCATR056
LU WWN Device Id: 5 0014ee 2044331c2
Firmware Version: 05.01D05
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 6.0 Gb/s
Local Time is:Thu Jan 21 04:23:23 2016 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
200 Multi_Zone_Error_Rate   0x0008   200   197   000Old_age
Offline  -   8
...
SMART Self-test log structure revision number 1
Num  Test_DescriptionStatus  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline   Completed: read failure   90% 43110
  486750687
# 2  Conveyance offline  Completed: read failure   90% 43110
  486750687
# 3  Extended offlineCompleted: read failure   90% 43110
  486750687

My thoughts are that besides the bitching of the Marvell virtual ATA
device (which perhaps was passing stuff through to the 1TB device),
all the errors could be attributed to the now failed drive?
___
luv-main mailing list
luv-main@luv.asn.au
http://lists.luv.asn.au/listinfo/luv-main


debugging disk performance problems

2016-01-20 Thread Russell Coker via luv-main
http://etbe.coker.com.au/2016/01/21/storage-performance-problems/

The above blog post might be useful for some people.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
___
luv-main mailing list
luv-main@luv.asn.au
http://lists.luv.asn.au/listinfo/luv-main


Re: Mail Server Really Slow

2016-01-20 Thread Rohan McLeod via luv-main

James Harper via luv-main wrote:

On Wed, Jan 20, 2016 at 07:28:38AM +, James Harper wrote:

As long as I remember to replace the To: with luv-main each time I
reply, I guess it's workable.

that happens even on just plain Replies, too - not just Reply-All?

that's weird because the list munges the From: address, so a reply should go
to the list.


Yep. Reply and Reply-All from Outlook 2016. Not sure who to blame for standards 
violation here...


Well from SeaMonkey-mail " Reply" would  just go to the "Reply-To:" 
address (James Harper  in this case)
   "Reply All" is going to 
"Reply-To" (as above ) AND "To" address (luv-main@luv.asn.au in this case )
I am assumimg James doesn't want two emails, so I am deleting his direct 
address !
As far as I can recall SeaMonkey-mail has always behaved this way and 
nothing Russel has done has changed this  !

Does this add anything to the discussion ?


regards 
Rohan McLeod

___
luv-main mailing list
luv-main@luv.asn.au
http://lists.luv.asn.au/listinfo/luv-main


Re: Mail Server Really Slow

2016-01-20 Thread Craig Sanders via luv-main
On Wed, Jan 20, 2016 at 07:28:38AM +, James Harper wrote:
> As long as I remember to replace the To: with luv-main each time I
> reply, I guess it's workable.

that happens even on just plain Replies, too - not just Reply-All?

that's weird because the list munges the From: address, so a reply
should go to the list.


> >   233 Remaining_Lifetime_Perc 0x   067   067   000Old_age   Offline 
> >  -
> > 67
>
> 233 is reported as Media Wearout Indicator on the drives I just
> checked on a BSD box, so I guess it's the same thing but with a
> different description for whatever reason.

i dunno if that name comes from the drive itself or from the smartctl
software. that could be the difference.

> > I assume that means I´ve used up about 1/3rd of its expected life.  Not
> > bad, considering i've been running it for 500 days total so far:
> > 
> > 9 Power_On_Hours  0x   100   100   000Old_age   Offline 
> >  -
> > 12005
> > 
> > 12005 hours is 500 days.  or 1.3 years.
> 
> I just checked the server that burned out the disks pretty quick last
> time (RAID1 zfs cache, so both went around the same time), and it

i suppose read performance is doubled, but there's not really any point
in RAIDing L2ARC. it's transient data that gets wiped on boot anyway.
better to have two l2arc cache partitions and two ZIL partitions.

and not raiding the l2arc should spread the write load over the 2 SSDs
and probably increase longevity.


my pair of OCZ drives have mdadm RAID-1 (xfs) for the OS + /home and
another 1GB RAID1 (ext4) for /boot, and just partitions for L2ARC and
ZIL. zfs mirrors the ZIL (essential for safety, don't want to lose the
ZIL if one drive dies!) if you give it two or more block devices anyway,
and it uses two or more block devices as independent L2ARCs (so double
the capacity).


$ zpool status export -v
  pool: export
 state: ONLINE
  scan: scrub repaired 0 in 4h50m with 0 errors on Sat Jan 16 06:03:30 2016
config:

NAMESTATE READ WRITE CKSUM
export  ONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
sda ONLINE   0 0 0
sde ONLINE   0 0 0
sdf ONLINE   0 0 0
sdg ONLINE   0 0 0
logs
  sdh7  ONLINE   0 0 0
  sdj7  ONLINE   0 0 0
cache
  sdh6  ONLINE   0 0 0
  sdj6  ONLINE   0 0 0

errors: No known data errors

this pool is 4 x 1TB. i'll probably replace them later this year with
one or two mirrored pairs of 4TB drives.  I've gone off RAID-5 and
RAID-Z.  even with ZIL and L2ARC, performance isn't great, nowhere near
what RAID-10 (or two mirrored pairs in zfs-speak) is.  like my backup pool.

$ zpool status backup -v
  pool: backup
 state: ONLINE
  scan: scrub repaired 0 in 4h2m with 0 errors on Sat Jan 16 05:15:20 2016
config:

NAMESTATE READ WRITE CKSUM
backup  ONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
sdb ONLINE   0 0 0
sdi ONLINE   0 0 0
  mirror-1  ONLINE   0 0 0
sdd ONLINE   0 0 0
sdc ONLINE   0 0 0

errors: No known data errors

this pool has the 4 x 4TB Seagate SSHDs i mentioned recently.  it stores
backups for all machines on my home network.


> > and that's for an OCZ Vertex, one of the last decent drives OCZ made
> > before they started producing crap and went bust (and subsequently
> > got

sorry, my mistake.  i meant OCZ Vector.

sdh OCZ-VECTOR_OCZ-0974C023I4P2G1B8
sdj OCZ-VECTOR_OCZ-8RL5XW08536INH7R


> I've seen too many OCZ's fail within months of purchase recently, but
> not enough data points to draw conclusions from. Maybe a bad batch or
> something? They were all purchased within a month or so of each other,
> late last year. The failure mode was that the system just can't see
> the disk, except very occasionally, and then not for long enough to
> actually boot from.

i've read that the Toshiba-produced OCZs are pretty good now, so
possibly a bad batch. or sounds like you abuse the poor things with too
many writes.

even so, my next SSD will probably be a Samsung.

> Yep. I just got a 500GB 850 EVO for my laptop and it doesn't have
> any of the wearout indicators that I can see, but I doubt I'll get
> anywhere near close to wearing it out before it becomes obsolete.

that's not good. i wish disk vendors would stop crippling their SMART
implementations and treat it seriously.


craig

-- 
craig sanders 
___
luv-main mailing list
luv-main@luv.asn.au
http://lists.luv.asn.au/listinfo/luv-main


Re: Mail Server Really Slow

2016-01-20 Thread Chris Samuel via luv-main
On Wed, 20 Jan 2016 08:15:00 PM Craig Sanders via luv-main wrote:

> that's weird because the list munges the From: address, so a reply
> should go to the list.

On the other hand Reply-To: is set back to the original poster.

So I guess it depends whether your MUA prefers List-Post: over Reply-To: 
(which this version of Kmail seems to) or vice-versa.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
luv-main mailing list
luv-main@luv.asn.au
http://lists.luv.asn.au/listinfo/luv-main


RE: Mail Server Really Slow

2016-01-20 Thread James Harper via luv-main
> 
> On Wed, Jan 20, 2016 at 07:28:38AM +, James Harper wrote:
> > As long as I remember to replace the To: with luv-main each time I
> > reply, I guess it's workable.
> 
> that happens even on just plain Replies, too - not just Reply-All?
> 
> that's weird because the list munges the From: address, so a reply should go
> to the list.
> 

Yep. Reply and Reply-All from Outlook 2016. Not sure who to blame for standards 
violation here...

> > >   233 Remaining_Lifetime_Perc 0x   067   067   000Old_age   
> > > Offline
> -
> > > 67
> >
> > 233 is reported as Media Wearout Indicator on the drives I just
> > checked on a BSD box, so I guess it's the same thing but with a
> > different description for whatever reason.
> 
> i dunno if that name comes from the drive itself or from the smartctl
> software. that could be the difference.

smartctl. It has a vendor database concerning what each of the values are, so 
if the manufacturer of you drives says 233 = "Remaining Lifetime Percent", and 
the manufacturer of my drive says 233 = "Media Wearout Indicator", and the 
authors of smartmontools were aware of this, then that's what goes in the 
database and that's what gets reported.

> 
> > > I assume that means I´ve used up about 1/3rd of its expected life.
> > > Not bad, considering i've been running it for 500 days total so far:
> > >
> > > 9 Power_On_Hours  0x   100   100   000Old_age   
> > > Offline  -
> > > 12005
> > >
> > > 12005 hours is 500 days.  or 1.3 years.
> >
> > I just checked the server that burned out the disks pretty quick last
> > time (RAID1 zfs cache, so both went around the same time), and it
> 
> i suppose read performance is doubled, but there's not really any point in
> RAIDing L2ARC. it's transient data that gets wiped on boot anyway.
> better to have two l2arc cache partitions and two ZIL partitions.
> 
> and not raiding the l2arc should spread the write load over the 2 SSDs and
> probably increase longevity.
> 



> 
> > I've seen too many OCZ's fail within months of purchase recently, but
> > not enough data points to draw conclusions from. Maybe a bad batch or
> > something? They were all purchased within a month or so of each other,
> > late last year. The failure mode was that the system just can't see
> > the disk, except very occasionally, and then not for long enough to
> > actually boot from.
> 
> i've read that the Toshiba-produced OCZs are pretty good now, so possibly a
> bad batch. or sounds like you abuse the poor things with too many writes.
> 

Nah these particular ones were just in PC's, and were definitely not warn out 
(on the one occasion where I actually got one to read for a bit, the SMART 
values were all fine. Servers get SSD's with supercaps :)

> even so, my next SSD will probably be a Samsung.

Despite initial reservations (funny how you can easily find bad reports on any 
brand!) I have been impressed with the performance and longevity of the 
Samsungs, but I still don't have enough datapoints.

James
___
luv-main mailing list
luv-main@luv.asn.au
http://lists.luv.asn.au/listinfo/luv-main


Re: Mail Server Really Slow

2016-01-20 Thread Chris Samuel via luv-main
On Tue, 19 Jan 2016 10:12:12 AM Piers Rowan via luv-main wrote:

> The server is a VM on a host server that also provides http / mysql 
> services. The host server runs cron jobs to poll the email server 
> (importing data from mail boxes into the CRM) so - to clutch at straws - 
> I am not sure if the host and guest are competing for the disk IO at the 
> same time with these calls. Contrary to that is that the host server 
> does not experience any slow downs.

Some ideas that I've not seen mentioned before yet:

1) perf top - to see where the system is spending time as a whole (and if you 
need to drill down on a process you can do perf top -p $PID).

2) latencytop - as long as your kernel has CONFIG_LATENCYTOP

3) iotop - if your version is new enough then the -o option will hide idle 
processes, otherwise just press 'o' when you get the main display.

Best of luck!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
luv-main mailing list
luv-main@luv.asn.au
http://lists.luv.asn.au/listinfo/luv-main