Total loss of data

Eric Jaw Tue, 01 Apr 2014 08:04:03 -0700


On Tuesday, April 1, 2014 12:07:29 AM UTC-4, Gregg wrote:
>
>  The long and the short of it, is that most likely you have a failing 
> disk or controller/connector more than anything.  I used to run an 8-disk, 
> 4 mirrored pair pool on a small box without good airflow and slow, SATA-150 
> controllers that were supported by Solaris 10.  I ended up replacing the 
> whole system with a new large box with 140mm fans as well as sata-300 
> controllers to get better cooling.  Over time, every disk has failed 
> because of heat issues.  Many of my SATA cables failed too.  They were 
> cheap junk.  
>


I have my HDD at a steady 40 degrees or below. I thought about replacing 
the SATA cables, but I have two of them using new ones and the rest using 
old ones, and from the checksum errors I'm seeing, it would mean all the 
cables need replacing, which I don't believe could be the case in this 
build. A failing disk controller on all four drives that were barely used? 
I have higher confidence in HDD production than that. I feel certain it's 
something else, but thank you for your input. I'll keep it as a 
consideration if all else fails. 

I'm running this all through a VM, which is where I believe could be the 
issue, but we need to figure out why and how to work around it if this is 
the case.
 

>
> Equipment has to be selected carefully.  I do not see any failing bits for 
> the 3+ years now that I have been running on the new hardware with all of 
> the disks being replaced 2 years ago, so I have been making no changes for 
> the past 2 years.   All is good for me with ZFS and non-ECC ram.
>

That's very good to hear. I'm still trying to gather more data, but I'm 
getting closer to finding an answer. It seems to point somewhere in the 
memory realm.

 

>
> If I build another system, I will build a new system with ECC RAM and will 
> get new controllers and new cables just because.  
>
> My current select is to use ZFS on Linux, because I haven't had a disk 
> array/container that I could hook up to the Macs in the house.
>
> My new ZFS array might end up being Mac Pro based with some of the 
> thunderbolt based disk carriers.
>
> I have about 8TB of stuff that I need to be able to keep safe.
>
> Amazon Glacier is on my radar.   At some point I may just get a 4TB USB3.0 
> drive to copy stuff to and ship off to Glacier.
>
> Gregg
>
> On 3/31/2014 9:41 PM, Eric Jaw wrote:
>  
>
>
> On Monday, March 31, 2014 5:55:21 PM UTC-4, Daniel Becker wrote: 
>>
>> On Mar 31, 2014, at 2:23 PM, Eric Jaw <nais...@gmail.com> wrote:
>>
>>  Doing a scrub is just obliterating my pool. 
>>  
>>
>>  Is it? I don’t think so:
>>  
>
> Thanks for the response! Here's some more details on the setup: 
> https://forums.virtualbox.org/viewtopic.php?f=6&t=60975
>
> I started using ZFS about a few weeks ago, so a lot of it is still new to 
> me. I'm actually not completely certain about "proper procedure" for 
> repairing a pool. I'm not sure if I'm supposed to clear the errors after 
> the scrub, before or after (little things). I'm not sure if it even 
> matters. When I restarted the VM, the checksum counts cleared on its own.
>
> I wasn't expecting to run into any issues. But I drew a part of my 
> conclusion from the high numbers of checksum errors that never happened 
> until I started reading from the dataset and that number went up in the 
> tens' when I scrubbed the pool; almost doubling when scrubbed for a second 
> time. 
>
>   
>>   scan: scrub in progress since Mon Mar 31 10:14:52 2014
>>>         1.83T scanned out of 2.43T at 75.2M/s, 2h17m to go
>>>         *0 repaired*, 75.55% done
>>>
>>   
>>  Note the “0 repaired.”
>>  
>
> On the first scrub it repaired roughly 1.65MB. None on the second scub. 
> Even after the scrub there were still 43 data errors. I was expecting they 
> were going to go away.
>
> errors: 43 data errors, use '-v' for a list
>>
>
>
>  
>  
>>  
>>  I'm also running ZFS on FreeBSD 10.0 (RELEASE) in VirtualBox on Windows 
>> 7 Ultimate.
>>
>>
>>  Are the disks that the VM sees file-backed or passed-through raw disks?
>>   
>
> This is an excellent question. They're in 'Normal' mode. I remember 
> looking in to this before and decided normal mode should be fine. I might 
> be wrong. So thanks for bringing this up. I'll have to check it out again.
>
>  
>
>>  
>>   Things seem to be pointing to non-ECC RAM causing checksum errors. It 
>> looks like I'll have to swap out my memory to ECC RAM if I want to continue 
>> this project, otherwise the data is pretty much hosed right now.
>>  
>>  
>> Did you actually run a memory tester (e.g., memcheck86), or is this just 
>> based on gut feeling? Lots of things can manifest as checksum errors. If 
>> you import the pool read-only, do successive scrubs find errors in 
>> different files (use “zpool status -v”) every time, or are they always in 
>> the same files? The former would indeed point to some kind of memory 
>> corruption issue, while in the latter case it’s much more likely that your 
>> on-disk data somehow got corrupted.
>>  
>
> memtest86 and memtest86+ for 18 hours came out okay. I'm on my third scrub 
> and the number or errors has remained at 43. Checksum errors continue to 
> pile up as the pool is getting scrubbed.
>
> I'm just as flustered about this. Thanks again for the input.
>  -- 
>
> --- 
> You received this message because you are subscribed to the Google Groups 
> "zfs-macos" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to zfs-macos+...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
> 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"zfs-macos" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [zfs-macos] Re: ZFS w/o ECC RAM -> Total loss of data

Reply via email to