[zfs-discuss] horrible slow pool

2012-10-11 Thread Carsten John
Hello everybody,

I just wanted to share my experience with a (partially) broken SSD that was in 
use in a ZIL mirror.

We experienced a dramatic performance problem with one of our zpools, serving 
home directories. Mainly NFS clients were affected. Our SunRay infrastructure 
came to a complete halt.

Finally we were able to identify one SSD as the root caus. The SSD was still 
working, but quite slow.

The issue didn't trigger ZFS to detect the disk as faulty. FMA didn't detect 
it, too.

We identified the broken disk by issuing iostat -en'. After replacing the SSD, 
everything went back to normal.

To prevent outages like this in the future I hacked together a quick and 
dirty bash script to detect disks with a given rate of total errors. The 
script might be used in conjunction with nagios.

Perhaps it's of use for others sa well:

###
#!/bin/bash
# check disk in all pools for errors.
# partially failing (or slow) disks
# may result in horribly degradded 
# performance of zpools despite the fact
# the pool is still healthy

# exit codes
# 0 OK
# 1 WARNING
# 2 CRITICAL
# 3 UNKONOWN

OUTPUT=
WARNING=0
CRITICAL=0
SOFTLIMIT=5
HARDLIMIT=20

LIST=$(zpool status | grep c[1-9].*d0  | awk '{print $1}')
for DISK in $LIST 
do  
ERROR=$(iostat -enr $DISK | cut -d , -f 4 | grep ^[0-9])
if [[ $ERROR -gt $SOFTLIMIT ]]
then
OUTPUT=$OUTPUT, $DISK:$ERROR
WARNING=1
fi
if [[ $ERROR -gt $HARDLIMIT ]]
then
OUTPUT=$OUTPUT, $DISK:$ERROR
CRITICAL=1
fi
done

if [[ $CRITICAL -gt 0 ]]
then
echo CRITICAL: Disks with error count = $HARDLIMIT found: $OUTPUT
exit 2
fi
if [[ $WARNING -gt 0 ]]
then
echo WARNING: Disks with error count = $SOFTLIMIT found: $OUTPUT
exit 1
fi

echo OK: No significant disk errors found
exit 0

###



cu

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sol11 time-slider / snapshot not starting [again]

2012-09-11 Thread Carsten John
Hello everybody,

my time-slider service on a Sol11 machine died. I already deinstalled/installed 
the time-slider packeage, restarted manifest-import service etc., but no 
success.

/var/svc/log/application-time-slider:default.log:



--snip--


[ Sep 11 12:40:04 Enabled. ]
[ Sep 11 12:40:04 Executing start method (/lib/svc/method/time-slider start). 
]
Traceback (most recent call last):
  File /usr/lib/time-sliderd, line 10, in module
main(abspath(__file__))
  File /usr/lib/../share/time-slider/lib/time_slider/timesliderd.py, line 
941, in main
snapshot = SnapshotManager(systemBus)
  File /usr/lib/../share/time-slider/lib/time_slider/timesliderd.py, line 83, 
in __init__
self.refresh()
  File /usr/lib/../share/time-slider/lib/time_slider/timesliderd.py, line 
188, in refresh
self._rebuild_schedules()
  File /usr/lib/../share/time-slider/lib/time_slider/timesliderd.py, line 
285, in _rebuild_schedules
Details:\n + str(message)
RuntimeError: Error reading SMF schedule instances
Details:
['/usr/bin/svcs', '-H', '-o', 'state', 
'svc:/system/filesystem/zfs/auto-snapshot:monthly'] failed with exit code 1
svcs: Pattern 'svc:/system/filesystem/zfs/auto-snapshot:monthly' doesn't match 
any instances

Time Slider failed to start: error 95
[ Sep 11 12:40:06 Method start exited with status 95. ]

--snip--





Any suggestions?


thx

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sol11 time-slider / snapshot not starting [SOLVED]

2012-09-11 Thread Carsten John
-Original message-
To: zfs-discuss@opensolaris.org; 
From:   Carsten John cj...@mpi-bremen.de
Sent:   Tue 11-09-2012 13:08
Subject:[zfs-discuss] Sol11 time-slider / snapshot not starting [again]
 Hello everybody,
 
 my time-slider service on a Sol11 machine died. I already 
 deinstalled/installed 
 the time-slider packeage, restarted manifest-import service etc., but no 
 success.
 
 /var/svc/log/application-time-slider:default.log:


Finally I was able to fix it:

- uninstall time-slider
- restert manifest-import service
- install time-slider
- restart manifest-import service
- enable time-slider service
- enable snapshot services


I have no clue why it has to be done exact in this order, but finally I 
succeeded.


cu

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sol11 missing snapshot facility [solved]

2012-07-06 Thread Carsten John
-Original message-
To: Carsten John cj...@mpi-bremen.de; 
CC: zfs-discuss@opensolaris.org; 
From:   Ian Collins i...@ianshome.com
Sent:   Thu 05-07-2012 21:40
Subject:Re: [zfs-discuss] Sol11 missing snapshot facility
 On 07/ 5/12 11:32 PM, Carsten John wrote:
  -Original message-
  To: Carsten Johncj...@mpi-bremen.de;
  CC: zfs-discuss@opensolaris.org;
  From:   Ian Collinsi...@ianshome.com
  Sent:   Thu 05-07-2012 11:35
  Subject:Re: [zfs-discuss] Sol11 missing snapshot facility
  On 07/ 5/12 09:25 PM, Carsten John wrote:
 
  Hi Ian,
 
  yes, I already checked that:
 
  svcs -a | grep zfs
  disabled   11:50:39 svc:/application/time-slider/plugin:zfs-send
 
  is the only service I get listed.
 
  Odd.
 
  How did you install?
 
  Is the manifest there
  (/lib/svc/manifest/system/filesystem/auto-snapshot.xml)?
 
  Hi Ian,
 
  I installed from CD/DVD, but it might have been in a rush, as I needed to 
 replace a broken machine as quick as possible.
 
  The manifest is there:
 
 
  ls /lib/svc/manifest/system/filesystem/
  .  .. auto-snapshot.xml  autofs.xml 
 local-fs.xml   minimal-fs.xml rmvolmgr.xml   root-fs.xml
 ufs-quota.xml  usr-fs.xml
 
 
 Running svcadm restart manifest-import should load it, or give you 
 some idea why it won't load.
 
 -- 
 Ian.
 
 

Hi Ian,

it did the trick, but I had to uninstall/install the time-slider package.


thx for the help

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sol11 missing snapshot facility

2012-07-05 Thread Carsten John
Hello everybody,


for some reason I can not find the zfs-autosnapshot service facility any more. 
I already reinstalles time-slider, but it refuses to start:


RuntimeError: Error reading SMF schedule instances
Details:
['/usr/bin/svcs', '-H', '-o', 'state', 
'svc:/system/filesystem/zfs/auto-snapshot:monthly'] failed with exit code 1
svcs: Pattern 'svc:/system/filesystem/zfs/auto-snapshot:monthly' doesn't match 
any instances



did anybody know a way to get the services back again?


thx


Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sol11 missing snapshot facility

2012-07-05 Thread Carsten John
-Original message-
To: Carsten John cj...@mpi-bremen.de; 
CC: zfs-discuss@opensolaris.org; 
From:   Ian Collins i...@ianshome.com
Sent:   Thu 05-07-2012 09:59
Subject:Re: [zfs-discuss] Sol11 missing snapshot facility
 On 07/ 5/12 06:52 PM, Carsten John wrote:
  Hello everybody,
 
 
  for some reason I can not find the zfs-autosnapshot service facility any 
 more. I already reinstalles time-slider, but it refuses to start:
 
 
  RuntimeError: Error reading SMF schedule instances
  Details:
  ['/usr/bin/svcs', '-H', '-o', 'state', 
 'svc:/system/filesystem/zfs/auto-snapshot:monthly'] failed with exit code 1
  svcs: Pattern 'svc:/system/filesystem/zfs/auto-snapshot:monthly' doesn't 
 match any instances
 
 Have you looked with svcs -a?
 
 # svcs -a | grep zfs
 disabled   Jul_02   svc:/system/filesystem/zfs/auto-snapshot:daily
 disabled   Jul_02   svc:/system/filesystem/zfs/auto-snapshot:frequent
 disabled   Jul_02   svc:/system/filesystem/zfs/auto-snapshot:hourly
 disabled   Jul_02   svc:/system/filesystem/zfs/auto-snapshot:monthly
 disabled   Jul_02   svc:/system/filesystem/zfs/auto-snapshot:weekly
 disabled   Jul_02   svc:/application/time-slider/plugin:zfs-send
 
 -- 
 Ian.
 
 


Hi Ian,

yes, I already checked that:

svcs -a | grep zfs
disabled   11:50:39 svc:/application/time-slider/plugin:zfs-send

is the only service I get listed.


thx

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sol11 missing snapshot facility

2012-07-05 Thread Carsten John
-Original message-
To: Carsten John cj...@mpi-bremen.de; 
CC: zfs-discuss@opensolaris.org; 
From:   Ian Collins i...@ianshome.com
Sent:   Thu 05-07-2012 11:35
Subject:Re: [zfs-discuss] Sol11 missing snapshot facility
 On 07/ 5/12 09:25 PM, Carsten John wrote:
 
  Hi Ian,
 
  yes, I already checked that:
 
  svcs -a | grep zfs
  disabled   11:50:39 svc:/application/time-slider/plugin:zfs-send
 
  is the only service I get listed.
 
 Odd.
 
 How did you install?
 
 Is the manifest there 
 (/lib/svc/manifest/system/filesystem/auto-snapshot.xml)?
 
 -- 
 Ian.
 
 

Hi Ian,

I installed from CD/DVD, but it might have been in a rush, as I needed to 
replace a broken machine as quick as possible.

The manifest is there:


ls /lib/svc/manifest/system/filesystem/
.  .. auto-snapshot.xml  autofs.xml 
local-fs.xml   minimal-fs.xml rmvolmgr.xml   root-fs.xml
ufs-quota.xml  usr-fs.xml



thx


Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] snapshots slow on sol11?

2012-06-26 Thread Carsten John
Hello everybody,

I recently migrated a file server (NFS  Samba) from OpenSolaris (Build 111) to 
Sol11. This the move we are facing random (or random looking) outages of our 
Samba. As we have moved several folders (like Desktop and ApplicationData) out 
of the usual profile to a folder inside the users home share, the setup is 
sensible about timeouts. From time to time users are getting the infamous 
Windows Delayed Write Failure.

After checking nearly every parameter that came to my mind in the last days, 
the zfs-auto-snapshot mechanism inside Solaris 11 came to my attention. We had 
hourly and daily snapshot enabled and discovered that the snapshots are not 
rotated as expected.

As there were known issues (if I remember correctly) with timesliderd in 
OpenIndiana and we had the old zfs-auto-snap mechanism (without timesliderd) 
running without any problems before the update, I'm wondering if the are any 
known (performance) issues with the stuff in Solaris 11.



thx


Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] kernel panic during zfs import [UPDATE]

2012-04-17 Thread Carsten John
Hello everybody,

just to let you know what happened in the meantime:

I was able to open a Service Request at Oracle.

The issue is a known bug (Bug 6742788 : assertion panic at: zfs:zap_deref_leaf)

The bug has bin fixed (according to Oracle support) since build 164, but there 
is no fix for Solaris 11 available so far (will be fixed in S11U7?).

There is a workaround available that works (partly), but my system crashed 
again when trying to rebuild the offending zfs within the affected zpool.

At the moment I'm waiting for a so called interim diagnostic relief patch


cu

Carsten

-- 
Max Planck Institut fuer marine Mikrobiologie
- Network Administration -
Celsiustr. 1
D-28359 Bremen
Tel.: +49 421 2028568
Fax.: +49 421 2028565
PGP public key:http://www.mpi-bremen.de/Carsten_John.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] kernel panic during zfs import [ORACLE should notice this]

2012-03-30 Thread Carsten John
-Original message-
To: zfs-discuss@opensolaris.org; 
From:   John D Groenveld jdg...@elvis.arl.psu.edu
Sent:   Fri 30-03-2012 21:47
Subject:Re: [zfs-discuss] kernel panic during zfs import [ORACLE should 
notice this]
 In message 4f735451.2020...@oracle.com, Deepak Honnalli writes:
  Thanks for your reply. I would love to take a look at the core
  file. If there is a way this can somehow be transferred to
  the internal cores server, I can work on the bug.
 
  I am not sure about the modalities of transferring the core
  file though. I will ask around and see if I can help you here.
 
 How to Upload Data to Oracle Such as Explorer and Core Files [ID 1020199.1]
 
 John
 groenv...@acm.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

Hi John,

in the meantime I managed to open a service request at Oracle. There is a 
webportal https://supportfiles.sun.com. There you can upload the files...


cu

Carsten

-- 
Max Planck Institut fuer marine Mikrobiologie
- Network Administration -
Celsiustr. 1
D-28359 Bremen
Tel.: +49 421 2028568
Fax.: +49 421 2028565
PGP public key:http://www.mpi-bremen.de/Carsten_John.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Puzzling problem with zfs receive exit status

2012-03-29 Thread Carsten John
-Original message-
To: zfs-discuss@opensolaris.org; 
From:   Borja Marcos bor...@sarenet.es
Sent:   Thu 29-03-2012 11:49
Subject:[zfs-discuss] Puzzling problem with zfs receive exit status
 
 Hello,
 
 I hope someone has an idea. 
 
 I have a replication program that copies a dataset from one server to another 
 one. The replication mechanism is the obvious one, of course:
 
  zfs send -Ri from snapshot(n-1) snapshot(n)  file
 scp file remote machine (I do it this way instead of using a pipeline so that 
 a 
 network error won't interrupt a receive data stream)
 and on the remote machine,
 zfs receive -Fd pool
 
 It's been working perfectly for months, no issues. However, yesterday we 
 began 
 to see something weird: the zfs receive being executed on the remote machine 
 is 
 exiting with an exit status of 1, even though the replication is finished, 
 and 
 I see the copied snapshots on the remote machine. 
 
 Any ideas? It's really puzzling. It seems that the replication is working (a 
 zfs list -t snapshot shows the new snapshots correctly applied to the 
 dataset) 
 but I'm afraid there's some kind of corruption.
 
 The OS is Solaris, SunOS  5.10 Generic_141445-09 i86pc i386 i86pc.
 
 Any ideas?
 
 
 
 Thanks in advance,
 
 
 
 
 
 Borja.
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 


Hi Borja,


did you try to check the snapshot file with zstreamdump? It will validate the 
checksums.

Perhaps the information here

http://blog.richardelling.com/2009/10/check-integrity-of-zfs-send-streams.html

might be useful for you.



Carsten

-- 
Max Planck Institut fuer marine Mikrobiologie
- Network Administration -
Celsiustr. 1
D-28359 Bremen
Tel.: +49 421 2028568
Fax.: +49 421 2028565
PGP public key:http://www.mpi-bremen.de/Carsten_John.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] kernel panic during zfs import

2012-03-28 Thread Carsten John
-Original message-
To: ZFS Discussions zfs-discuss@opensolaris.org; 
From:   Paul Kraus p...@kraus-haus.org
Sent:   Tue 27-03-2012 15:05
Subject:Re: [zfs-discuss] kernel panic during zfs import
 On Tue, Mar 27, 2012 at 3:14 AM, Carsten John cj...@mpi-bremen.de wrote:
  Hallo everybody,
 
  I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic 
 during the import of a zpool (some 30TB) containing ~500 zfs filesystems 
 after 
 reboot. This causes a reboot loop, until booted single user and removed 
 /etc/zfs/zpool.cache.
 
 
  From /var/adm/messages:
 
  savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf 
 Page fault) rp=ff002f9cec50 addr=20 occurred in module zfs due to a 
 NULL 
 pointer dereference
  savecore: [ID 882351 auth.error] Saving compressed system crash dump in 
 /var/crash/vmdump.2
 
 
 I ran into a very similar problem with Solaris 10U9 and the
 replica (zfs send | zfs recv destination) of a zpool of about 25 TB of
 data. The problem was an incomplete snapshot (the zfs send | zfs recv
 had been interrupted). On boot the system was trying to import the
 zpool and as part of that it was trying to destroy the offending
 (incomplete) snapshot. This was zpool version 22 and destruction of
 snapshots is handled as a single TXG. The problem was that the
 operation was running the system out of RAM (32 GB worth). There is a
 fix for this and it is in zpool 26 (or newer), but any snapshots
 created while the zpool is at a version prior to 26 will have the
 problem on-disk. We have support with Oracle and were able to get a
 loaner system with 128 GB RAM to clean up the zpool (it took about 75
 GB RAM to do so).
 
 If you are at zpool 26 or later this is not your problem. If you
 are at zpool  26, then test for an incomplete snapshot by importing
 the pool read only, then `zdb -d zpool | grep '%'` as the incomplete
 snapshot will have a '%' instead of a '@' as the dataset / snapshot
 separator. You can also run the zdb against the _un_imported_ zpool
 using the -e option to zdb.
 
 See the following Oracle Bugs for more information.
 
 CR# 6876953
 CR# 6910767
 CR# 7082249
 
 CR#7082249 has been marked as a duplicate of CR# 6948890
 
 P.S. I have a suspect that the incomplete snapshot was also corrupt in
 some strange way, but could never make a solid determination of that.
 We think what caused the zfs send | zfs recv to be interrupted was
 hitting an e1000g Ethernet device driver bug.
 
 -- 
 {1-2-3-4-5-6-7-}
 Paul Kraus
 - Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
 - Sound Coordinator, Schenectady Light Opera Company (
 http://www.sloctheater.org/ )
 - Technical Advisor, Troy Civic Theatre Company
 - Technical Advisor, RPI Players
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

Hi,


this scenario seems to fit. The machine that was sending the snapshot is on 
OpenSolaris Build 111b (which is running zpool version 14).

I rebooted the receiving machine due to a hanging zfs receive that couldn't 
be killed.

zdb -d -e pool does not give any useful information:

zdb -d -e san_pool   
Dataset san_pool [ZPL], ID 18, cr_txg 1, 36.0K, 11 objects


When importing the pool readonly, I get an error about two datasets:

zpool import -o readonly=on san_pool
cannot set property for 'san_pool/home/someuser': dataset is read-only
cannot set property for 'san_pool/home/someotheruser': dataset is read-only

As this is a mirror machine, I still have the option to destroy the pool and 
copy over the stuff via send/receive from the primary. But nobody knows how 
long this will work until I'm hit again

If an interrupted send/receive can screw up a 30TB target pool, then 
send/receive isn't an option for replication data at all, furthermore it should 
be flagged as don't use it if your target pool might contain any valuable data

I wil reproduce the crash once more and try to file a bug report for S11 as 
recommended by Deepak (not so easy these days...).



thanks



Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] kernel panic during zfs import [ORACLE should notice this]

2012-03-28 Thread Carsten John
-Original message-
To: zfs-discuss@opensolaris.org; 
From:   Deepak Honnalli deepak.honna...@oracle.com
Sent:   Wed 28-03-2012 09:12
Subject:Re: [zfs-discuss] kernel panic during zfs import
 Hi Carsten,
 
  This was supposed to be fixed in build 164 of Nevada (6742788). If 
 you are still seeing this
  issue in S11, I think you should raise a bug with relevant details. 
 As Paul has suggested,
  this could also be due to incomplete snapshot.
 
  I have seen interrupted zfs recv's causing weired bugs.
 
 Thanks,
 Deepak.


Hi Deepak,

I just spent about an hour (or two) trying to file a bug report regarding the 
issue without success.

Seems to me, that I'm too stupid to use this MyOracleSupport portal.

So, as I'm getting paid for keeping systems running and not clicking through 
flash overloaded support portals searching for CSIs, I'm giving the relevant 
information to the list now.

Perhaps, someone at Oracle, reading the list, is able to file a bug report, or 
contact me off list.



Background:

Machine A
- Sun X4270 
- Opensolaris Build 111b
- zpool version 14
- primary file server
- sending snapshots via zfs send
- direct attached Sun J4400 SAS JBODs with totally 40 TB storage

Machine B
- Sun X4270
- Solaris 11
- zpool version 33
- mirror server
- receiving snapshots via zfs receive
- FC attached Storagetek FLX280 storage 


Incident:

After a zfs send/receive run machine B had a hanging zfs receive process. To 
get rid of the process, I rebooted the machine. During reboot the kernel 
panics, resulting in a reboot loop.

To bring up the system, I rebooted single user, removed /etc/zfs/zpool.cache 
and rebooted again.

The damaged pool can imported readonly, giving a warning:

   $zpool import -o readonly=on san_pool
   cannot set property for 'san_pool/home/someuser': dataset is read-only
   cannot set property for 'san_pool/home/someotheruser': dataset is read-only

The ZFS debugger zdb does not give any additional information:

   $zdb -d -e san_pool
   Dataset san_pool [ZPL], ID 18, cr_txg 1, 36.0K, 11 objects


The issue can reproduced by trying to import the pool r/w, resulting in a 
kernel panic.


The fmdump utility gives the following information for the relevant UUID:

   $fmdump -Vp -u 91da1503-74c5-67c2-b7c1-d4e245e4d968
   TIME   UUID 
SUNW-MSG-ID
   Mar 28 2012 12:54:26.563203000 91da1503-74c5-67c2-b7c1-d4e245e4d968 
SUNOS-8000-KL

 TIME CLASS ENA
 Mar 28 12:54:24.2698 ireport.os.sunos.panic.dump_available 
0x
 Mar 28 12:54:05.9826 ireport.os.sunos.panic.dump_pending_on_device 
0x

   nvlist version: 0
version = 0x0
class = list.suspect
uuid = 91da1503-74c5-67c2-b7c1-d4e245e4d968
code = SUNOS-8000-KL
diag-time = 1332932066 541092
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
__case_state = 0x1
topo-uuid = 3b4117e0-0ac7-cde5-b434-b9735176d591
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = 
sw:///:path=/var/crash/.91da1503-74c5-67c2-b7c1-d4e245e4d968
resource = 
sw:///:path=/var/crash/.91da1503-74c5-67c2-b7c1-d4e245e4d968
savecore-succcess = 1
dump-dir = /var/crash
dump-files = vmdump.0
os-instance-uuid = 91da1503-74c5-67c2-b7c1-d4e245e4d968
panicstr = BAD TRAP: type=e (#pf Page fault) 
rp=ff002f6dcc50 addr=20 occurred in module zfs due to a NULL pointer 
dereference
panicstack = unix:die+d8 () | unix:trap+152b () | 
unix:cmntrap+e6 () | zfs:zap_leaf_lookup_closest+45 () | 
zfs:fzap_cursor_retrieve+cd () | zfs:zap_cursor_retrieve+195 () | 
zfs:zfs_purgedir+4d () |   zfs:zfs_rmnode+57 () | zfs:zfs_zinactive+b4 () | 
zfs:zfs_inactive+1a3 () | genunix:fop_inactive+b1 () | genunix:vn_rele+58 () | 
zfs:zfs_unlinked_drain+a7 () | zfs:zfsvfs_setup+f1 () | zfs:zfs_domount+152 () 
| zfs:zfs_mount+4e3 () | genunix:fsop_mount+22 () | genunix:domount+d2f () | 
genunix:mount+c0 () | genunix:syscall_ap+92 () | unix:brand_sys_sysenter+1cf () 
| 
crashtime = 1332931339
panic-time = March 28, 2012 12:42:19 PM CEST CEST
(end fault-list[0])

fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x4f72ede2 0x2191cbb8


The 'first view' debugger output looks like:

   mdb unix.0 vmcore.0 
   Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp 
scsi_vhci zfs mpt sd ip hook neti arp usba uhci sockfs qlc fctl s1394 kssl lofs 
random idm sppp crypto sata fcip cpc fcp ufs logindmux ptm ]
$c
   zap_leaf_lookup_closest+0x45(ff0728eac588, 

[zfs-discuss] zfs import from i86 to sparc

2012-03-06 Thread Carsten John
Hi everybody,

are there any problems to expect if we try to export/import a zfs pool from 
opensolaris (intel) (zpool version 14) to solaris 10 (sparc) (zpool version 19)?


thanks


Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs send/receive script

2012-03-06 Thread Carsten John
Hello everybody,

I set up a script to replicate all zfs filesystems (some 300 user home 
directories in this case) within a given pool to a mirror machine. The basic 
idea is to send the snapshots incremental if the corresponding snapshot exists 
on the remote side or send a complete snapshot if no corresponding previous 
snapshot is available

Thee setup basically works, but form time to time (within a run over all 
filesystems) I get error messages like:

cannot receive new filesystem stream: dataset is busy or

cannot receive incremental filesystem stream: dataset is busy

The complete script is available under:

http://pastebin.com/AWevkGAd


does anybody have a suggestion what might cause the dataset to be busy?



thx


Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] snapshots in solaris11 express

2011-07-27 Thread Carsten John
Hello everybody,

is there any known way to configure the point-in-time *when* the time-slider 
will snapshot/rotate?

With hundreds of zfs filesystems, the daily snapshot rotation slows down a big 
file server significantly, so it would be better to  have the snapshots rotated 
outside the usual workhours.

As as I found out so far, the first snapshot is taken when the service is 
restartet and then the next occurs 24 hour later (as supposed). Do I need to 
restart the service at 2:00 AM to get the desired result (not a big deal deal 
with /usr/bin/at, but not as straight forward as I would exspect).

Any suggestions?


thx


Carsten

-- 
Max Planck Institut fuer marine Mikrobiologie
- Network Administration -
Celsiustr. 1
D-28359 Bremen
Tel.: +49 421 2028568
Fax.: +49 421 2028565
PGP public key:http://www.mpi-bremen.de/Carsten_John.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace zil drive

2011-06-28 Thread Carsten John


bin65MpxTCk5V.bin
Description: PGP/MIME version identification


encrypted.asc
Description: OpenPGP encrypted message
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace zil drive

2011-06-28 Thread Carsten John
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/28/11 02:55, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Carsten John

 Now I'm wondering about the best option to replace the HDD with the SSD:

 What version of zpool are you running?  If it's = 19, then you could
 actually survive a complete ZIL device failure.  So you should simply
 offline or detach or whatever the HDD and then either attach or add the new
 SDD.  Attach would be mirror, add would be two separate non-mirrored
 devices.  Maybe better performance, maybe not.

 If it's zpool  19, then you absolutely do not want to degrade to
 non-mirrored status.  First attach the new SSD, then when it's done, detach
 the HDD.



Sorry, sent encrypted before

I'm currently running:

zpool upgrade -v
This system is currently running ZFS pool version 31

So, detaching the HDD seems to be a safe option.


thx


Carsten


- -- 
Max Planck Institut fuer marine Mikrobiologie
- - Network Administration -
Celsiustr. 1
D-28359 Bremen
Tel.: +49 421 2028568
Fax.: +49 421 2028565
PGP public key:http://www.mpi-bremen.de/Carsten_John.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4Je/EACgkQsRCwZeehufuHwwCglTkFJPT54dUhyyh/rqWMhFLy
sIQAn3VaJw5uDKCdhI917PDzKLb81zfY
=wtA6
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace zil drive [SOLVED}

2011-06-28 Thread Carsten John
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/28/11 02:55, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Carsten John

 Now I'm wondering about the best option to replace the HDD with the SSD:
 
 What version of zpool are you running?  If it's = 19, then you could
 actually survive a complete ZIL device failure.  So you should simply
 offline or detach or whatever the HDD and then either attach or add the new
 SDD.  Attach would be mirror, add would be two separate non-mirrored
 devices.  Maybe better performance, maybe not.
 
 If it's zpool  19, then you absolutely do not want to degrade to
 non-mirrored status.  First attach the new SSD, then when it's done, detach
 the HDD.
 


Worked like a charm.

Detached the HDD, physically replaced the HDD with the new SSD and added
the new SSD to the pool's log.


thx for the suggestions


Carsten

- -- 
Max Planck Institut fuer marine Mikrobiologie
- - Network Administration -
Celsiustr. 1
D-28359 Bremen
Tel.: +49 421 2028568
Fax.: +49 421 2028565
PGP public key:http://www.mpi-bremen.de/Carsten_John.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4JilAACgkQsRCwZeehufvfhACdF0yae2NGDKrNEswRmW4NVEFv
K5sAn0EnuAyOa9Z7ytNQQZF9QPJPjgeZ
=fVxi
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] replace zil drive

2011-06-27 Thread Carsten John
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello everybody,

some time ago a SSD within a ZIL mirror died. As I had no SSD available
to replace it, I dropped in a normal SAS harddisk to rebuild the mirror.

In the meantime I got the warranty replacement SSD.

Now I'm wondering about the best option to replace the HDD with the SSD:

1. Remove the log mirror, put the new disk in place, add log mirror

2. Pull the HDD, forcing the mirror to fail, replace the HDD with the SSD

Unfortunately I have no free slot in the JBOD available (want to keep
the ZIL in the same JBAD as the rest of the pool):

3. Put additional temporary SAS HDD in free slot of different JBOD,
replace the HDD in the ZIL mirror with temporary HDD, pull now unused
HDD, use free slot for SSD, replace temporary HDD with SSD.



Any suggestions?


thx



Carsten





- -- 
Max Planck Institut fuer marine Mikrobiologie
- - Network Administration -
Celsiustr. 1
D-28359 Bremen
Tel.: +49 421 2028568
Fax.: +49 421 2028565
PGP public key:http://www.mpi-bremen.de/Carsten_John.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4ISy8ACgkQsRCwZeehufs9MQCfetuYQwjbqH2Rb7qyY8G4vxaQ
TvUAoNcHPnHED1Ykat8VHF8EJIRiPmct
=jwZQ
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool died during scrub

2010-09-01 Thread Carsten John
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jeff Bacon wrote:
 I have a bunch of sol10U8 boxes with ZFS pools, most all raidz2 8-disk
 stripe. They're all supermicro-based with retail LSI cards.
 
 I've noticed a tendency for things to go a little bonkers during the
 weekly scrub (they all scrub over the weekend), and that's when I'll
 lose a disk here and there. OK, fine, that's sort of the point, and
 they're SATA drives so things happen. 
 
 I've never lost a pool though, until now. This is Not Fun. 
 
 ::status
 debugging crash dump vmcore.0 (64-bit) from ny-fs4
 operating system: 5.10 Generic_142901-10 (i86pc)
 panic message:
 BAD TRAP: type=e (#pf Page fault) rp=fe80007cb850 addr=28 occurred
 in module zfs due to a NULL pointer dereference
 dump content: kernel pages only
 $C
 fe80007cb960 vdev_is_dead+2()
 fe80007cb9a0 vdev_mirror_child_select+0x65()
 fe80007cba00 vdev_mirror_io_start+0x44()
 fe80007cba30 zio_vdev_io_start+0x159()
 fe80007cba60 zio_execute+0x6f()
 fe80007cba90 zio_wait+0x2d()
 fe80007cbb40 arc_read_nolock+0x668()
 fe80007cbbd0 dmu_objset_open_impl+0xcf()
 fe80007cbc20 dsl_pool_open+0x4e()
 fe80007cbcc0 spa_load+0x307()
 fe80007cbd00 spa_open_common+0xf7()
 fe80007cbd10 spa_open+0xb()
 fe80007cbd30 pool_status_check+0x19()
 fe80007cbd80 zfsdev_ioctl+0x1b1()
 fe80007cbd90 cdev_ioctl+0x1d()
 fe80007cbdb0 spec_ioctl+0x50()
 fe80007cbde0 fop_ioctl+0x25()
 fe80007cbec0 ioctl+0xac()
 fe80007cbf10 _sys_sysenter_post_swapgs+0x14b()
 
   pool: srv
 id: 9515618289022845993
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
 devices and try again.
see: http://www.sun.com/msg/ZFS-8000-6X
 config:
 
 srvUNAVAIL  missing device
   raidz2   ONLINE
 c2t5000C5001F2CCE1Fd0  ONLINE
 c2t5000C5001F34F5FAd0  ONLINE
 c2t5000C5001F48D399d0  ONLINE
 c2t5000C5001F485EC3d0  ONLINE
 c2t5000C5001F492E42d0  ONLINE
 c2t5000C5001F48549Bd0  ONLINE
 c2t5000C5001F370919d0  ONLINE
 c2t5000C5001F484245d0  ONLINE
   raidz2   ONLINE
 c2t5F000B5C8187d0  ONLINE
 c2t5F000B5C8157d0  ONLINE
 c2t5F000B5C9101d0  ONLINE
 c2t5F000B5C8167d0  ONLINE
 c2t5F000B5C9120d0  ONLINE
 c2t5F000B5C9151d0  ONLINE
 c2t5F000B5C9170d0  ONLINE
 c2t5F000B5C9180d0  ONLINE
   raidz2   ONLINE
 c2t5000C50010A88E76d0  ONLINE
 c2t5000C5000DCD308Cd0  ONLINE
 c2t5000C5001F1F456Dd0  ONLINE
 c2t5000C50010920E06d0  ONLINE
 c2t5000C5001F20C81Fd0  ONLINE
 c2t5000C5001F3C7735d0  ONLINE
 c2t5000C500113BC008d0  ONLINE
 c2t5000C50014CD416Ad0  ONLINE
 
 Additional devices are known to be part of this pool, though
 their
 exact configuration cannot be determined.
 
 
 All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT WERE
 PART OF THE POOL. How can it be missing a device that didn't exist? 
 
 A zpool import -fF results in the above kernel panic. This also
 creates /etc/zfs/zpool.cache.tmp, which then results in the pool being
 imported, which leads to a continuous reboot/panic cycle. 
 
 I can't obviously use b134 to import the pool without logs, since that
 would imply upgrading the pool first, which is hard to do if it's not
 imported. 
 
 My zdb skills are lacking - zdb -l gets you about so far and that's it.
 (where the heck are the other options to zdb even written down, besides
 in the code?)
 
 OK, so this isn't the end of the world, but it's 15TB of data I'd really
 rather not have to re-copy across a 100Mbit line. It really more
 concerns me that ZFS would do this in the first place - it's not
 supposed to corrupt itself!!
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Hi Jeff,



looks similar to a crash I had here at our site a few month ago. Same
symptoms, no actual solution. We had to recover from a rsync backup server.


We had the logs on an mirrored SSD and an additional SSD as cache.

The machine (SUN 4270 with SUN J4400 JBODS and SUN SAS disks) crashed in
the same manner (core dumping while trying to import the pool). After
booting into single user mode we found the log pool mirror corrupted
(one disk unavailbale). Even after replacing the disk and resilvering
the log mirror we were not able to import the pool.

I suggest that it may has been related to memory (perhaps a lack of memory).


all the best


Carsten





- --
Max Planck Institut fuer marine Mikrobiologie
- - Network Administration -
Celsiustr. 1

[zfs-discuss] crashed zpool

2010-03-01 Thread Carsten John
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello everybody,

last week we experienced a severe outage due to a crashed zpool. I'm now
in the process of investigating the reason for the crash, to prevernt it
in the future. May be some of the people with more experience are able
to help me

The setup:

- - Sun Fire X4270 with 16GB RAM running Opensolaris 2009.06 acting as
samba PDC and NIS/NFS server for some 400 users.

- - sas_zpool built of 24x 300GB SAS disks (4x raidz) in JBOD, 2x 32GB SSD
(mirror) for zfs log, 1x  160 GB SSD for zfs cache

- - bulk_pool containing 42x 1TB SATA/SAS disks in 2 JBODS


the machine worked several month without a problem. A week ago we added
the last set of 6 disks to the sas_pool.


What happened:

the server became unavailable, obviously it had crashed and wrote a
kernel core dump.

After rebooting the machine the server crashed again (core dumping)
while trying to mount the zfs filessytems (home directories) from the
sas_pool.

We booted single user and checked the zpool status. The sas_pool was
degraded with a failed SSD disk in the log mirror. We replaced the
failed disk and waited until the resilvering process had finished (took
some 4 hours). zpool status for the pool was fine after that. Rebooting
the machine in multi user mode resulted in the same core dump like before.

Fortunately we had a rsync mirror of our home directories (second 4270
with a bunch of SATA JBODs). We finally mounted the spare machine via
NFS instead of the crashed pool to keep services running.


What might be the reason?

- - the failed SSD (shouldn't harm as it is mirrored)
- - not enough RAM causing the crash, damaging the zpool


Is there any chance to reanimate the crashed pool, otherwise we need to
build the pool from scratch and rsync from the fallback (this will take
several days)




Thanks in advance for any suggestions



Carsten





- --
Max Planck Institut fuer marine Mikrobiologie
- - Network Administration -
Celsiustr. 1
D-28359 Bremen
Tel.: +49 421 2028568
Fax.: +49 421 2028565
PGP public key:http://www.mpi-bremen.de/Carsten_John.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFLi3sosRCwZeehufsRAqXmAKDg2KoR1exq4jTMkiR8iBt+xsDW1QCgjsrO
mK4uYJec0A3oO1kQCyM9XFQ=
=icmr
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss