Re: misc panics

2020-12-31 Thread Bastien Durel
Le lundi 28 décembre 2020 à 12:34 +0200, Gregory Edigarov a écrit :
> On 12/28/20 12:18 PM, rgc wrote:
> > On Mon, Dec 28, 2020 at 10:39:56AM +0100, Otto Moerbeek wrote:
> > > On Mon, Dec 28, 2020 at 10:25:08AM +0100, Bastien Durel wrote:
> > > 
> > > > Le lundi 28 d?cembre 2020 ? 09:17 +, Stuart Henderson a
> > > > ?crit?:
> > > > > > So hardware failure confirmed :/ Do you think I can change
> > > > > > the RAM
> > > > > > or
> > > > > > it's more likely a CPU/Chipset failure ?
> > > > > > 
> > > > > > Thanks,
> > > > > > 
> > > > > If you have multiple sticks of RAM, try removing some.
> > > > I have only one
> > > trying to reaset it is worth a try.
> > > 
> > > -Otto
> > > 
> > or doing the eraser magick
> > 
> > you clean the contacts (remove oxidation) of the RAM module (the
> > side that
> > sticks in the motherboard) by rubbing a pencil eraser on the
> > contacts of the
> > RAM module.
> > 
> in my experience, all the RAM modules nowadays comes gold plated, so
> no
> need to use eraser on them.
> just a piece of paper, to make sure there is no grease on the
> contacts
> 
Hello,

For those interested, a new memory module made the box able to run
again. (Neither cleaning or using eraser worked)

Thanks,

-- 
Bastien



Re: misc panics

2020-12-28 Thread Gregory Edigarov



On 12/28/20 12:18 PM, rgc wrote:
> On Mon, Dec 28, 2020 at 10:39:56AM +0100, Otto Moerbeek wrote:
>> On Mon, Dec 28, 2020 at 10:25:08AM +0100, Bastien Durel wrote:
>>
>>> Le lundi 28 d?cembre 2020 ? 09:17 +, Stuart Henderson a ?crit?:
> So hardware failure confirmed :/ Do you think I can change the RAM
> or
> it's more likely a CPU/Chipset failure ?
>
> Thanks,
>
 If you have multiple sticks of RAM, try removing some.
>>> I have only one
>> trying to reaset it is worth a try.
>>
>>  -Otto
>>
> or doing the eraser magick
>
> you clean the contacts (remove oxidation) of the RAM module (the side that
> sticks in the motherboard) by rubbing a pencil eraser on the contacts of the
> RAM module.
>
in my experience, all the RAM modules nowadays comes gold plated, so no
need to use eraser on them.
just a piece of paper, to make sure there is no grease on the contacts



Re: misc panics

2020-12-28 Thread rgc
On Mon, Dec 28, 2020 at 10:39:56AM +0100, Otto Moerbeek wrote:
> On Mon, Dec 28, 2020 at 10:25:08AM +0100, Bastien Durel wrote:
> 
> > Le lundi 28 d?cembre 2020 ? 09:17 +, Stuart Henderson a ?crit?:
> > > > So hardware failure confirmed :/ Do you think I can change the RAM
> > > > or
> > > > it's more likely a CPU/Chipset failure ?
> > > > 
> > > > Thanks,
> > > > 
> > > 
> > > If you have multiple sticks of RAM, try removing some.
> > I have only one
> 
> trying to reaset it is worth a try.
> 
>   -Otto
> 

or doing the eraser magick

you clean the contacts (remove oxidation) of the RAM module (the side that
sticks in the motherboard) by rubbing a pencil eraser on the contacts of the
RAM module.

- rgc



Re: misc panics

2020-12-28 Thread Otto Moerbeek
On Mon, Dec 28, 2020 at 10:25:08AM +0100, Bastien Durel wrote:

> Le lundi 28 décembre 2020 à 09:17 +, Stuart Henderson a écrit :
> > > So hardware failure confirmed :/ Do you think I can change the RAM
> > > or
> > > it's more likely a CPU/Chipset failure ?
> > > 
> > > Thanks,
> > > 
> > 
> > If you have multiple sticks of RAM, try removing some.
> I have only one

trying to reaset it is worth a try.

-Otto



Re: misc panics

2020-12-28 Thread Bastien Durel
Le lundi 28 décembre 2020 à 09:17 +, Stuart Henderson a écrit :
> > So hardware failure confirmed :/ Do you think I can change the RAM
> > or
> > it's more likely a CPU/Chipset failure ?
> > 
> > Thanks,
> > 
> 
> If you have multiple sticks of RAM, try removing some.
I have only one

-- 
Bastien



Re: misc panics

2020-12-28 Thread Stuart Henderson
On 2020-12-28, Bastien Durel  wrote:
> Le lundi 28 décembre 2020 à 09:23 +1000, Stuart Longland a écrit :
>> On 28/12/20 3:56 am, Bastien Durel wrote:
>> > After that I got a (maybe) endless loop of panics inducing panics
>> > (I did 
>> > not got the output, it was cycling fast), and after that the /bsd
>> > file 
>> > was left empty :
>> > 
>> > > > > OpenBSD/amd64 BOOT 3.52
>> > > boot> NOTE: random seed is being reused.
>> > > booting hd0a:/bsd: read header
>> > >  failed(0). will try /bsd
>> …
>> > How can I figure out the cause of all these problems ?
>> 
>> Seems awfully strange for `/bsd` to become zero-length out-of-the-
>> blue. 
>>   Got a `memtest86` disk handy?
>> 
>> I'd be checking:
>> - RAM
>> - disks
>> - CPU
>> 
>> I think from the `dmesg` the storage device is a SSD?  Could it be it
>> has failed early?  Some do that, and they give practically no warning
>> when they do.
>
> SMART is OK on the disk
>
> I ran a memtest86 test, and got thousands of errors
>
>
> Test Start Time   2020-12-28 08:38:08
> Elapsed Time  0:01:11
> Memory Range Tested   0x0 - 16F00 (5872MB)
> CPU Selection ModeParallel (All CPUs)
> ECC Polling   Enabled
>
> Lowest Error Address  0x12AA18018 (4778MB)
> Highest Error Address 0x12BFE7FF8 (4799MB)
> Bits in Error MaskFF00
> Bits in Error 8
> Max Contiguous Errors 1
>
>
>
> Test  # Tests Passed  Errors
> Test 0 [Address test, walking ones, 1 CPU]1/1 (100%)  0
> Test 1 [Address test, own address, 1 CPU] 0/0 (0%)10988
>
>
> Last 10 Errors
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7FF8,
> Expected: 00012BFE7FF8, Actual: 10012BFE7FF8
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7FE8,
> Expected: 00012BFE7FE8, Actual: 04012BFE7FE8
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7F58,
> Expected: 00012BFE7F58, Actual: 04012BFE7F58
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7F48,
> Expected: 00012BFE7F48, Actual: 08012BFE7F48
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7EF8,
> Expected: 00012BFE7EF8, Actual: 40012BFE7EF8
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7EE8,
> Expected: 00012BFE7EE8, Actual: C0012BFE7EE8
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7EC8,
> Expected: 00012BFE7EC8, Actual: 04012BFE7EC8
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7E58,
> Expected: 00012BFE7E58, Actual: 40012BFE7E58
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7D58,
> Expected: 00012BFE7D58, Actual: 08012BFE7D58
> 2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7D48,
> Expected: 00012BFE7D48, Actual: 08012BFE7D48
>
>
> So hardware failure confirmed :/ Do you think I can change the RAM or
> it's more likely a CPU/Chipset failure ?
>
> Thanks,
>

If you have multiple sticks of RAM, try removing some.




Re: misc panics

2020-12-28 Thread Bastien Durel
Le lundi 28 décembre 2020 à 09:23 +1000, Stuart Longland a écrit :
> On 28/12/20 3:56 am, Bastien Durel wrote:
> > After that I got a (maybe) endless loop of panics inducing panics
> > (I did 
> > not got the output, it was cycling fast), and after that the /bsd
> > file 
> > was left empty :
> > 
> > > > > OpenBSD/amd64 BOOT 3.52
> > > boot> NOTE: random seed is being reused.
> > > booting hd0a:/bsd: read header
> > >  failed(0). will try /bsd
> …
> > How can I figure out the cause of all these problems ?
> 
> Seems awfully strange for `/bsd` to become zero-length out-of-the-
> blue. 
>   Got a `memtest86` disk handy?
> 
> I'd be checking:
> - RAM
> - disks
> - CPU
> 
> I think from the `dmesg` the storage device is a SSD?  Could it be it
> has failed early?  Some do that, and they give practically no warning
> when they do.

SMART is OK on the disk

I ran a memtest86 test, and got thousands of errors


Test Start Time 2020-12-28 08:38:08
Elapsed Time0:01:11
Memory Range Tested 0x0 - 16F00 (5872MB)
CPU Selection Mode  Parallel (All CPUs)
ECC Polling Enabled

Lowest Error Address0x12AA18018 (4778MB)
Highest Error Address   0x12BFE7FF8 (4799MB)
Bits in Error Mask  FF00
Bits in Error   8
Max Contiguous Errors   1



Test# Tests Passed  Errors
Test 0 [Address test, walking ones, 1 CPU]  1/1 (100%)  0
Test 1 [Address test, own address, 1 CPU]   0/0 (0%)10988


Last 10 Errors
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7FF8,
Expected: 00012BFE7FF8, Actual: 10012BFE7FF8
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7FE8,
Expected: 00012BFE7FE8, Actual: 04012BFE7FE8
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7F58,
Expected: 00012BFE7F58, Actual: 04012BFE7F58
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7F48,
Expected: 00012BFE7F48, Actual: 08012BFE7F48
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7EF8,
Expected: 00012BFE7EF8, Actual: 40012BFE7EF8
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7EE8,
Expected: 00012BFE7EE8, Actual: C0012BFE7EE8
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7EC8,
Expected: 00012BFE7EC8, Actual: 04012BFE7EC8
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7E58,
Expected: 00012BFE7E58, Actual: 40012BFE7E58
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7D58,
Expected: 00012BFE7D58, Actual: 08012BFE7D58
2020-12-28 08:39:19 - [Data Error] Test: 1, CPU: 0, Address: 12BFE7D48,
Expected: 00012BFE7D48, Actual: 08012BFE7D48


So hardware failure confirmed :/ Do you think I can change the RAM or
it's more likely a CPU/Chipset failure ?

Thanks,

-- 
Bastien Durel





Re: misc panics

2020-12-27 Thread Stuart Henderson
On 2020-12-27, Stuart Longland  wrote:
> Seems awfully strange for `/bsd` to become zero-length out-of-the-blue. 

Not if it crashed at a bad point in "reorder_kernel".

I would try GENERIC instead of GENERIC.MP to see if there's any change.




Re: misc panics

2020-12-27 Thread Stuart Longland

On 28/12/20 3:56 am, Bastien Durel wrote:
After that I got a (maybe) endless loop of panics inducing panics (I did 
not got the output, it was cycling fast), and after that the /bsd file 
was left empty :



OpenBSD/amd64 BOOT 3.52

boot> NOTE: random seed is being reused.
booting hd0a:/bsd: read header
 failed(0). will try /bsd

…

How can I figure out the cause of all these problems ?


Seems awfully strange for `/bsd` to become zero-length out-of-the-blue. 
 Got a `memtest86` disk handy?


I'd be checking:
- RAM
- disks
- CPU

I think from the `dmesg` the storage device is a SSD?  Could it be it 
has failed early?  Some do that, and they give practically no warning 
when they do.

--
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.