Bug#833054: e2fsprogs: fsck.ext4 does not repair filesystem

2016-09-07 Thread Theodore Ts'o
On Wed, Sep 07, 2016 at 10:21:34PM +0100, ael wrote:
> Of course, I can just reformat the card, and I am confident that it will
> then work, but this looks like an opportunity to uncover a "feature"
> lurking somewhere. I imagine that you will ask for an dumpe2fs run next?

Actually, what I'd like is a compressed e2image dump of the file
system, and reproduction instructions so I can try to replicate the
problem in an environment where I can observe it.

See "man e2image" and look for the section for RAW IMAGE FILES for
more details.  This will allow you to send me just the metadata of the
file (so I won't see any of the data blocks), but that should be
enough to replicate the problem.  It would leak out the directory
names to me, which might or might not be an issue that you would be
concerned about.

There is a "scramble" option which will scramble the filenames, but
that will screw up the htree directory structures, so I'd prefer not
having the filename unless you feel strongly about the filenames being
privacy sensitive --- in which case an e2image with scrambled
filenames is better than nothing.

Cheers,

- Ted



Bug#833054: e2fsprogs: fsck.ext4 does not repair filesystem

2016-09-07 Thread ael
On Sun, Jul 31, 2016 at 03:14:28PM -0400, Theodore Ts'o wrote:
> 
> My suggestion at this point is to do an image copy of the entire
> contents of the sdhc card using either dd or ddrescue, to a known-good
> hard drive, and then run fsck.ext4 on that that image copy of your
> file system.  I suspect the root cause of this is simple hardware
> failure of the sdhc card.  So doing an image copy is a good idea
> before you do anything else, since it reduces the risk of (further)
> data loss.

First my apologies for the long silence: I was away from home.

I have done as you suggested. Taken an image of the sdhc card (on
another machine) and written that to a file (as it happens on an
usbstick). I then ran fsck.ext4 which claimed to correct a problem.

Next I loop-mounted that image. I also ran fsck.ext4 on the sdhc card
on the other machine) which similarly claimed to fix a problem and then
also mounted that.

I then tested using rsync as:

rsync -na /mnt/testing/ /sdhc/
  ^   ^
  |   |
loop mounted image   faulty card

That is I asked rsync to do a dry run copy from the loop mounted image to the 
card
so that it would read the image filesystem and not write to the card.

On a subset of files, I saw the warnings: Structure needs cleaning (117)
and /var/log/messages included 
   EXT4-fs error: 32 callbacks suppressed
Just to confirm, I used a simple cat of one of the problem files:-

# cat /mnt/testing/lost+found/#528000: Structure needs cleaning

So it seems that fsck.ext4 is unable to correct the problems even on the
image, although it claims to have done so.

It is possible that the original problem with the card was perhaps a
hardware fault (I have mentioned that it occasionally gets disturbed in
its slot), but whatever happened the current state of the file system
is confusing fsck.ext4. Or so it seems to me.

> 
> At this point, unless you can replicate the problem after copying the
> file system image off the sdhc to something more reliable (and in
> particular, cheap sd cards in the checkout counter line of "The Micro
> Center", or deep discount cards purchased from a direct-from-a-no-name-
> Shenzhen manufacturer don't count as reliable), I very much doubt it
> is a software bug, but rather a hardware issue.

So yes, I can replicate the problem after copying: I am satisfied that
the usbstick is fine, but I can do similar tests using other disk
hardware if you remain suspicious.

Of course, I can just reformat the card, and I am confident that it will
then work, but this looks like an opportunity to uncover a "feature"
lurking somewhere. I imagine that you will ask for an dumpe2fs run next?


Thanks for the help so far. 



Bug#833054: e2fsprogs: fsck.ext4 does not repair filesystem

2016-08-01 Thread ael
On Sun, Jul 31, 2016 at 03:14:28PM -0400, Theodore Ts'o wrote:
> On Sun, Jul 31, 2016 at 10:41:29AM +0100, ael wrote:
> > Package: e2fsprogs
> > Version: 1.43.1-1
> > Severity: normal
> > 
> At this point, unless you can replicate the problem after copying the
> file system image off the sdhc to something more reliable (and in
> particular, cheap sd cards in the checkout counter line of "The Micro
> Center", or deep discount cards purchased from a direct-from-a-no-name-
> Shenzhen manufacturer don't count as reliable), I very much doubt it
> is a software bug, but rather a hardware issue.

Thanks for the reply and analysis. I will follow your suggestions.

The sdhc card was not cheap and I checked it with f3
(http://oss.digirati.com.br/f3/) to check for a fake before
installation. I believe it is a genuine Kingston 32GB card. I always
check ssd technology cards or drives before trusting them these days.
But you were right to suggest this given how many dubious devices there
are around. Thanks again.

What I think is more likely is that I think that card occasionally
gets disturbed in its slot when the netbook gets carried around, so a
connector problem is a more likely cause of a hardware issue. I reseat
the card when I remember to try to avoid that happening. I don't know
whether the device driver can defend against that sort of problem.

Thank you so much for all your work: I have admired it from a distnce
for many years.

I will report further when I have followed your suggestions.



Bug#833054: e2fsprogs: fsck.ext4 does not repair filesystem

2016-07-31 Thread Theodore Ts'o
On Sun, Jul 31, 2016 at 10:41:29AM +0100, ael wrote:
> Package: e2fsprogs
> Version: 1.43.1-1
> Severity: normal
> 
> I have an ext4 filesystem (on an sdhc card) which has been working
> for several years. Yesterday, I became aware that the file system
> was damaged. Systemd seems to try to hide such details from users these
> days, so I am not sure when and why the problem occurred.
> 
> I ran fsck.ext4 which made extensive repairs, but eventually declared
> the system clean. But accessing some parts of the file system
> gives a (d?)message:
>   " Cannot stat: Structure needs cleaning."
>  *and* (I think) a kernel oops:

This was a kernel warning, not a oops.  However, it was preceeded by
an EXT4-fs error, and unfortunately there were enough before that
point that that apparently the rate limiters had kicked in:

> Jul 30 21:21:52 elf kernel: [  938.971363] EXT4-fs error: 2 callbacks 
> suppressed
> Jul 30 21:23:22 elf kernel: [ 1028.614023] [ cut here 
> ]
> Jul 30 21:23:22 elf kernel: [ 1028.614052] WARNING: CPU: 0 PID: 23014 at 
> /build/linux-sI189k/linux-4.6.4/fs/inode.c:279 drop_nlink+0x3f/0x50

My suggestion at this point is to do an image copy of the entire
contents of the sdhc card using either dd or ddrescue, to a known-good
hard drive, and then run fsck.ext4 on that that image copy of your
file system.  I suspect the root cause of this is simple hardware
failure of the sdhc card.  So doing an image copy is a good idea
before you do anything else, since it reduces the risk of (further)
data loss.

> Running fsck.ext4 after such problems claims to clear the filesystem
> again, but the problems remain.

At this point, unless you can replicate the problem after copying the
file system image off the sdhc to something more reliable (and in
particular, cheap sd cards in the checkout counter line of "The Micro
Center", or deep discount cards purchased from a direct-from-a-no-name-
Shenzhen manufacturer don't count as reliable), I very much doubt it
is a software bug, but rather a hardware issue.

Cheers,

- Ted



Bug#833054: e2fsprogs: fsck.ext4 does not repair filesystem

2016-07-31 Thread ael
Package: e2fsprogs
Version: 1.43.1-1
Severity: normal

I have an ext4 filesystem (on an sdhc card) which has been working
for several years. Yesterday, I became aware that the file system
was damaged. Systemd seems to try to hide such details from users these
days, so I am not sure when and why the problem occurred.

I ran fsck.ext4 which made extensive repairs, but eventually declared
the system clean. But accessing some parts of the file system
gives a (d?)message:
  " Cannot stat: Structure needs cleaning."
 *and* (I think) a kernel oops:

--
Jul 30 21:21:52 elf kernel: [  938.971363] EXT4-fs error: 2 callbacks suppressed
Jul 30 21:23:22 elf kernel: [ 1028.614023] [ cut here ]
Jul 30 21:23:22 elf kernel: [ 1028.614052] WARNING: CPU: 0 PID: 23014 at 
/build/linux-sI189k/linux-4.6.4/fs/inode.c:279 drop_nlink+0x3f/0x50
Jul 30 21:23:22 elf kernel: [ 1028.614060] Modules linked in: cpuid(E) cbc(E) 
dm_crypt(E) dm_mod(E) uas(E) usb_storage(E) nls_utf8(E) nls_cp437(E) vfat(E) 
fat(E) xt_recent(E) iptable_nat(E) nf_nat_ipv4(E) xt_comment(E) ipt_REJECT(E) 
nf_reject_ipv4(E) xt_addrtype(E) xt_mark(E) iptable_mangle(E) xt_tcpudp(E) 
xt_CT(E) iptable_raw(E) xt_multiport(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) 
xt_conntrack(E) xt_NFLOG(E) nfnetlink_log(E) xt_LOG(E) nf_log_ipv4(E) 
nf_log_common(E) nf_nat_tftp(E) nf_nat_snmp_basic(E) nf_conntrack_snmp(E) 
nf_nat_sip(E) nf_nat_pptp(E) nf_nat_proto_gre(E) nf_nat_irc(E) nf_nat_h323(E) 
nf_nat_ftp(E) nf_nat_amanda(E) ts_kmp(E) nf_conntrack_amanda(E) nf_nat(E) 
nf_conntrack_sane(E) nf_conntrack_tftp(E) nf_conntrack_sip(E) 
nf_conntrack_proto_udplite(E) nf_conntrack_proto_sctp(E) nf_conntrack_pptp(E) 
nf_conntrack_proto_gre(E) nf_conntrack_netlink(E) nfnetlink(E) 
nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_irc(E) 
nf_conntrack_h323(E) nf_conntrack_ftp(E
 ) nf_conntrack(E) iptable_filter(E) ip_tables(E) x_tables(E) snd_hrtimer(E) 
snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) 
snd_seq_device(E) cpufreq_powersave(E) cpufreq_userspace(E) cpufreq_stats(E) 
cpufreq_conservative(E) iTCO_wdt(E) iTCO_vendor_support(E) sparse_keymap(E) 
arc4(E) acerhdf(E) uvcvideo(E) videobuf2_vmalloc(E) videobuf2_memops(E) 
videobuf2_v4l2(E) videobuf2_core(E) coretemp(E) videodev(E) media(E) joydev(E) 
evdev(E) serio_raw(E) pcspkr(E) sg(E) ath5k(E) ath(E) snd_hda_codec_realtek(E) 
snd_hda_codec_generic(E) i2c_i801(E) lpc_ich(E) mfd_core(E) rng_core(E) 
mac80211(E) i915(E) snd_hda_intel(E) snd_hda_codec(E) jmb38x_ms(E) 
snd_hda_core(E) snd_hwdep(E) snd_pcm_oss(E) memstick(E) cfg80211(E) 
snd_mixer_oss(E) rfkill(E) drm_kms_helper(E) snd_pcm(E) snd_timer(E) drm(E) 
snd(E) soundcore(E) i2c_algo_bit(E) shpchp(E) video(E) wmi(E) ac(E) battery(E) 
acpi_cpufreq(E) tpm_tis(E) tpm(E) button(E) processor(E) nfsd(E) auth_rpcgss(E) 
nfs_acl(E) lockd(E) grace(E) sun
 rpc(E) loop(E) autofs4(E) ext4(E) ecb(E) xts(E) lrw(E) gf128mul(E) 
ablk_helper(E) cryptd(E) aes_i586(E) crc16(E) jbd2(E) crc32c_generic(E) 
mbcache(E) mmc_block(E) sd_mod(E) ata_generic(E) psmouse(E) ata_piix(E) 
libata(E) scsi_mod(E) ehci_pci(E) sdhci_pci(E) sdhci(E) mmc_core(E) r8169(E) 
mii(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) usb_common(E)
Jul 30 21:23:22 elf kernel: [ 1028.614449] CPU: 0 PID: 23014 Comm: offlineimap 
Tainted: GE   4.6.0-1-686-pae #1 Debian 4.6.4-1
Jul 30 21:23:22 elf kernel: [ 1028.614458] Hardware name: Acer AOA110/, 
BIOS v0.3309 10/06/2008
Jul 30 21:23:22 elf kernel: [ 1028.614466]  00200286 164f904c f4195de4 c12dbeac 
  c106cc4a c1660208
Jul 30 21:23:22 elf kernel: [ 1028.614486]   59e6 c167a04c 0117 
c11e513f 0117 0009 c11e513f
Jul 30 21:23:22 elf kernel: [ 1028.614507]  c3d2b8d8 0001  f4195df8 
c106cd5a 0009  
Jul 30 21:23:22 elf kernel: [ 1028.614527] Call Trace:
Jul 30 21:23:22 elf kernel: [ 1028.614549]  [] ? dump_stack+0x55/0x79
Jul 30 21:23:22 elf kernel: [ 1028.614564]  [] ? __warn+0xea/0x110
Jul 30 21:23:22 elf kernel: [ 1028.614578]  [] ? drop_nlink+0x3f/0x50
Jul 30 21:23:22 elf kernel: [ 1028.614591]  [] ? drop_nlink+0x3f/0x50
Jul 30 21:23:22 elf kernel: [ 1028.614604]  [] ? 
warn_slowpath_null+0x2a/0x30
Jul 30 21:23:22 elf kernel: [ 1028.614617]  [] ? drop_nlink+0x3f/0x50
Jul 30 21:23:22 elf kernel: [ 1028.614660]  [] ? 
ext4_rename2+0x65a/0xcc0 [ext4]
Jul 30 21:23:22 elf kernel: [ 1028.614705]  [] ? 
ext4_tmpfile+0x1c0/0x1c0 [ext4]
Jul 30 21:23:22 elf kernel: [ 1028.614721]  [] ? 
vfs_rename+0x5a0/0x930
Jul 30 21:23:22 elf kernel: [ 1028.614762]  [] ? 
ext4_tmpfile+0x1c0/0x1c0 [ext4]
Jul 30 21:23:22 elf kernel: [ 1028.614777]  [] ? 
SyS_rename+0x366/0x380
Jul 30 21:23:22 elf kernel: [ 1028.614794]  [] ? 
do_fast_syscall_32+0x8d/0x140
Jul 30 21:23:22 elf kernel: [ 1028.614806]  [] ? 
sysenter_past_esp+0x47/0x75
Jul 30 21:23:22 elf kernel: [ 1028.614818] ---[ end trace