Re: [OmniOS-discuss] Resilver zero progress
On 10 May 2017 at 16:22, Richard Ellingwrote: > mdb’s "::zfs_dbgmsg" macro shows scan progress You can also watch the messages in real-time using DTrace; e.g., dtrace -qn ' BEGIN { last = walltimestamp; } zfs-dbgmsg /walltimestamp - last > 100/ { printf("\n"); } zfs-dbgmsg { printf("%Y %s\n", walltimestamp, stringof(arg0)); last = walltimestamp; } ' Cheers. -- Joshua M. Clulow UNIX Admin/Developer http://blog.sysmgr.org ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] LX: real ksh93 broken
Okay, I found what causes ksh to misbehave. It's in sh_init(), when shgd->lim.child_max is initialized with the results of getconf("CHILD_MAX"), see: https://github.com/att/ast/blob/master/src/cmd/ksh93/sh/init.c#L1289 I've commented out that line, hardcoded shgd->lim.child_max to 128, rebuilt and voila: ksh works as it should. Now I have to dig into that getconf() method to figure out what the returned value is and where it's coming from. Sounds trivial, but my C is *very* rusty, the asm gcc generates doesn't look at all what the JVM's JIT generates (which gives me wrong reflexes as I'm used to the latter) and I'm not very familiar with mdb. Oh well, that turned into a nice debugging re-training session which I very much needed. That reminds me the good old days at my first job when I was porting Linux apps to Solaris. Thank you for maintaining such a well-designed and pleasant to use OS! On Wed, May 10, 2017 at 3:59 PM, Dan McDonaldwrote: > Wow, thank you for the further deep-diving. > > > On May 10, 2017, at 5:21 AM, Ludovic Orban wrote: > > > > Looking at ksh' sources, my understanding is that job_post is stuck in > that else clause: > >else > >{ > > /* create a new job */ > > while((pw->p_job = job_alloc()) < 0) > > job_wait((pid_t)1); > > pw->p_nxtjob = job.pwlist; > > pw->p_nxtproc = 0; > >} > > > > Digging into the sources and stepping though the instructions of > job_alloc and job_byjid it looks like ksh cannot allocate a job id as it > believes they're all reserved. But so far, all this code is purely working > on internal structures of ksh so a LX bug would have no impact. > > > > I'll continue looking into this as time permits and I'll post an update > if I find anything worth mentioning. > > > > Be careful of narrowing your focus too far. I see some things worth > considering: > > 1.) If the "if" you're not showing me dependent on something in global > state that may have been mis-initialized by an LX emulation bug? > > 2.) Same question as #1, but applied to job_alloc() and job_wait(). > > I'm guessing LX in OmniOS is failing because I mismerged or plain forgot > something, given that Nahum says he can run ksh93 on SmartOS just fine. > > > Please make sure you're looking at the bigger picture, but THANK YOU for > the further investigation. > > Dan > > ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
[OmniOS-discuss] Resilver zero progress
I have a pool that has had a resilver running for about an hour but the progress status is a bit alarming. I'm concerned for some reason it will not resilver. Resilvers are tuned to be faster in /etc/system. This is on OmniOS r151014, currently fully updated. Any suggestions? -Chip from /etc/system: set zfs:zfs_resilver_delay = 0 set zfs:zfs_scrub_delay = 0 set zfs:zfs_top_maxinflight = 64 set zfs:zfs_resilver_min_time_ms = 5000 # zpool status hcp03 pool: hcp03 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed May 10 09:22:15 2017 1 scanned out of 545T at 1/s, (scan is slow, no estimated time) 0 resilvered, 0.00% done config: NAME STATE READ WRITE CKSUM hcp03DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c0t5000C500846F161Fd0ONLINE 0 0 0 spare-1 UNAVAIL 0 0 0 5676922542927845170UNAVAIL 0 0 0 was /dev/dsk/c0t5000C5008473DBF3d0s0 c0t5000C500846F1823d0 ONLINE 0 0 0 c0t5000C500846F134Fd0ONLINE 0 0 0 c0t5000C500846F139Fd0ONLINE 0 0 0 c0t5000C5008473B89Fd0ONLINE 0 0 0 c0t5000C500846F145Bd0ONLINE 0 0 0 c0t5000C5008473B6BBd0ONLINE 0 0 0 c0t5000C500846F131Fd0ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c0t5000C5008473BB63d0ONLINE 0 0 0 c0t5000C5008473C9C7d0ONLINE 0 0 0 c0t5000C500846F1A17d0ONLINE 0 0 0 c0t5000C5008473A0A3d0ONLINE 0 0 0 c0t5000C5008473D047d0ONLINE 0 0 0 c0t5000C5008473BF63d0ONLINE 0 0 0 c0t5000C5008473BC83d0ONLINE 0 0 0 c0t5000C5008473E35Bd0ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 c0t5000C5008473ABAFd0ONLINE 0 0 0 c0t5000C5008473ADF3d0ONLINE 0 0 0 c0t5000C5008473AE77d0ONLINE 0 0 0 c0t5000C5008473A23Bd0ONLINE 0 0 0 c0t5000C5008473C907d0ONLINE 0 0 0 c0t5000C5008473CCABd0ONLINE 0 0 0 c0t5000C5008473C77Fd0ONLINE 0 0 0 c0t5000C5008473B6D3d0ONLINE 0 0 0 raidz2-3 ONLINE 0 0 0 c0t5000C5008473E4FFd0ONLINE 0 0 0 c0t5000C5008473ECFFd0ONLINE 0 0 0 c0t5000C5008473F4C3d0ONLINE 0 0 0 c0t5000C5008473F8CFd0ONLINE 0 0 0 c0t5000C500846F1897d0ONLINE 0 0 0 c0t5000C500846F14B7d0ONLINE 0 0 0 c0t5000C500846F1353d0ONLINE 0 0 0 c0t5000C5008473EEDFd0ONLINE 0 0 0 raidz2-4 ONLINE 0 0 0 c0t5000C500846F144Bd0ONLINE 0 0 0 c0t5000C5008473F10Fd0ONLINE 0 0 0 c0t5000C500846F15CBd0ONLINE 0 0 0 c0t5000C500846F1493d0ONLINE 0 0 0 c0t5000C5008473E26Fd0ONLINE 0 0 0 c0t5000C500846F1A0Bd0ONLINE 0 0 0 c0t5000C5008473EE07d0ONLINE 0 0 0 c0t5000C500846F1453d0ONLINE 0 0 0 raidz2-5 ONLINE 0 0 0 c0t5000C500846F153Bd0ONLINE 0 0 0 c0t5000C5008473F9EBd0ONLINE 0 0 0 c0t5000C500846F14EFd0ONLINE 0 0 0 c0t5000C5008473AB0Bd0ONLINE 0 0 0 c0t5000C500846F140Bd0ONLINE 0 0 0 c0t5000C5008473FC0Fd0ONLINE 0 0 0 c0t5000C5008473DFA3d0ONLINE 0 0 0 c0t5000C5008473F89Bd0ONLINE 0 0 0 raidz2-6 ONLINE 0 0 0 c0t5000C500846F19BFd0ONLINE 0 0 0 c0t5000C5008473D1ABd0ONLINE 0 0 0 c0t5000C50084739FD3d0ONLINE 0 0 0 c0t5000C5008473FFB7d0ONLINE 0 0 0 c0t5000C5008473E72Fd0ONLINE 0 0 0
Re: [OmniOS-discuss] LX: real ksh93 broken
In x86 asm, cmpl is both signed and unsigned, it's the following jump that decides to work signed or not. In this case it's jl "jump if less" so it's signed (vs jb "jump if before" that is unsigned). But I digress. I've recompiled ksh93 with debug, no stripped symbols and no optimizations (the binary is here: https://www.dropbox.com/s/brys628g40akruv/ksh93.gz?dl=0) and managed to figure out where that infinite loop is happening: > ::stack job_byjid+5() job_alloc+0x62() job_post+0x1a2() _sh_fork+0x265() sh_ntfork+0xa99() sh_exec+0x2be8() sh_subshell+0x982() comsubst+0xbf0() varsub+0x3f4() copyto+0xa2a() sh_mactrim+0x196() nv_setlist+0x220() sh_exec+0xdb7() sh_eval+0x2b9() sh_trap+0x29b() ed_setup+0x7ac() ed_viread+0xf6() slowread+0x181() sfrd+0x4da() _sffilbuf+0x433() sfreserve+0x566() exfile+0x808() sh_main+0xb38() main+0x25() > Looking at ksh' sources, my understanding is that job_post is stuck in that else clause: else { /* create a new job */ while((pw->p_job = job_alloc()) < 0) job_wait((pid_t)1); pw->p_nxtjob = job.pwlist; pw->p_nxtproc = 0; } Digging into the sources and stepping though the instructions of job_alloc and job_byjid it looks like ksh cannot allocate a job id as it believes they're all reserved. But so far, all this code is purely working on internal structures of ksh so a LX bug would have no impact. I'll continue looking into this as time permits and I'll post an update if I find anything worth mentioning. -- Ludovic On Tue, May 9, 2017 at 5:15 PM, Dan McDonaldwrote: > > > On May 9, 2017, at 11:05 AM, Dan McDonald wrote: > > > > And I've no good way to know what it's doing, as the illumos-native > tools aren't giving me enough data. > > ksh93 appears to be looping in something: > > mdb: target stopped at: > 0x42adf0: movq +0x350129(%rip),%rax <0x77af20> > > ::step > mdb: target stopped at: > 0x42adf7: testq %rax,%rax > > ::step > mdb: target stopped at: > 0x42adfa: jne+0xc <0x42ae08> > > ::step > mdb: target stopped at: > 0x42adfc: jmp+0x28<0x42ae26> > > ::step > mdb: target stopped at: > 0x42ae26: addl $0x1,%r14d > > ::step > mdb: target stopped at: > 0x42ae2a: cmpl 0x10(%rsi),%r14d > > ::step > mdb: target stopped at: > 0x42ae2e: jl -0x40<0x42adf0> > > ::step > mdb: target stopped at: > 0x42adf0: movq +0x350129(%rip),%rax <0x77af20> > > Usage: step [ over | out ] [SIG] > > 0x7f0470f0 > > 0x7f0470f0/D > 0x7f0470f0: 2147483647 > > 0x7f0470f0/X > 0x7f0470f0: 7fff > > mdb: failed to read data from target: no mapping for address > 0x761133e5: > >