I have an old tower which I use to test multiple operating systems.
Each OS lives on a separate drive in a removable tray, so the drives can
be swapped as needed.  Once in a while the system would hang when the
BIOS was set to auto-detect the drives at every boot, or I would see an
occasional failure to mount the ATA boot device when Linux was started
in verbose mode--and Windows would simply freeze randomly.  The problem
was traced to the power connector on a drive tray:  I had to extract the
pins from the connector with a special tool, cut off the wires, soak the
pins in contact cleaner, and solder them back on, because the crimped
connection and the corrosion made it unreliable.

http://en.wikipedia.org/wiki/Molex_connector#Disk_drive_connector_.28AMP_MATE-N-LOK_1-480424-0_Power_Connector.29

http://www.molex.com/molex/products/family?key=disk_drive_power_connector&channel=PRODUCTS&chanName=family&pageTitle=Introduction

I never had a problem with these connectors before, except for the ones
in the Enermax trays (which seem to be made of the cheapest materials
they could find.)  Before I repaired the power connector, I encountered
that read-only bug in Ubuntu.  When this occurred, ALL physical volumes
attached to the machine became read-only, including other hard drives
and all external USB storage devices.  Even new USB devices attached
later were not writable.  The only thing I could write to was a network
share.  If this happens on all affected platforms, it might give
developers some idea of what to look for in the source code.  I also
wonder if some power management feature could be involved:

GRUB_CMDLINE_LINUX="libata.dma=0 libata.noacpi=1"
http://ubuntuforums.org/showthread.php?t=1892483

I believe this bug can be triggered by other things too, such as system
BIOS bug or AHCI preference, drive firmware bug, defective electrolytic
capacitors on a old mainboard, bad solder joints just about anywhere, a
defective (or overloaded) power supply.  But in the case of SSD drives
it could also be a latency issue:

Why Solid-State Drives Slow Down As You Fill Them Up (Ubuntu should warn about 
this)
 "When filling up an empty drive, they found high write performance very early 
in the process and a significant drop as the write operations continued to fill 
up the drive...  If you have a solid-state drive, you should try to avoid using 
more than 75% of its capacity."
http://www.howtogeek.com/165542/why-solid-state-drives-slow-down-as-you-fill-them-up/

(for general reference on dual-boot systems):
12 Things You Must Do When Running a Solid State Drive in Windows 7
http://www.maketecheasier.com/12-things-you-must-do-when-running-a-solid-state-drive-in-windows-7/

I suspect that people who experience read-only issues today were
experiencing silent write retries in previous kernel versions and simply
did not notice because the retry was successful.  It seems like the
common thread is that the drive was not ready to accept writes for some
reason, and the kernel did not detect this condition.  I tried to
simulate this by removing power to the drive momentarily.  During this
time, CPU usage was very high, but it returned to normal when power was
applied, and the read-only bug was not triggered.

On various other platforms I have seen S.M.A.R.T. drives which are NOT
defective logging an "Interface CRC error" when a 'READ DMA EXT' command
was issued, due to a cable or connector fault.  When the drive was moved
to another system, the errors stopped.  So the drive is not necessarily
failing just because you see the error count going up.

I think that a S.M.A.R.T. status monitor should be included with the
base installation: the S.M.A.R.T. feature is not only useful to diagnose
faults within the drive, it sometimes permits you to infer something
about the quality of the power & data connection over time.  If you can
consistently correlate some particular S.M.A.R.T. error code with the
behavior that causes the volume to turn read-only, then you may have
found a way to distinguish a cable fault from a kernel or firmware bug,
and the OS could use it to generate more helpful error messages.  So it
might be good to report which (if any) of the drives S.M.A.R.T. counters
were incremented when you experience that read-only problem.

I am not too familiar with the specifications, but developers might also
want to investigate the possibility of using the System Management bus
or Power Management bus to assist in characterizing these failures if
the platform collects any useful information.  For those who solved the
problem by disabling NCQ: there was an NCQ drive blacklist for the Linux
kernel until (I believe) 2.6.24.  This implies some incompatibility with
particular models.

"there are drives with firmware bugs that deliberately lie about when data has 
been physically written."
http://serverfault.com/questions/460864/safety-of-write-cache-on-sata-drives-with-barriers
_____

"One little-known feature of NCQ is that the host can specify whether it
wants to be notified of completion when the data hits the disk's
platters or when it hits the disk's buffer (on-board cache)." (Does the
kernel do this correctly?)

"NCQ can negatively interfere with the operating system's I/O scheduler,
actually decreasing performance; this has been observed in practice on
Linux with RAID-5.  There is no mechanism in NCQ for the host to specify
any sort of deadlines for an I/O, like how many times a request can be
ignored in favor of others.  In theory, a NCQ-ed request can be delayed
by the drive an arbitrary amount of time while it is serving other
(possibly new) requests under I/O pressure.  Since the algorithms used
inside drive firmware for NCQ dispatch ordering are generally not
publicly known, this introduces another level of uncertainty for
hardware/firmware performance.  Tests at Google around 2008 have shown
that NCQ can delay an I/O for up to 1-2 seconds."

http://en.wikipedia.org/wiki/Native_Command_Queuing
_____

Test if NCQ is enabled: dmesg | grep -i ncq
Write-protect & cache status: dmesg | grep sda
_____

Operational theory / Educational resources:

Modern disk write caches and how they get dealt with
http://utcc.utoronto.ca/~cks/space/blog/tech/ModernDiskWriteCaches

How to force a disk write cache flush operation on Linux
http://utcc.utoronto.ca/~cks/space/blog/linux/ForceDiskFlushes

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1063354

Title:
  [Dell Studio XPS 1640] Sudden Read-Only Filesystems

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1063354/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to