Re: [OmniOS-discuss] Resilver zero progress

2017-05-10 Thread Joshua M. Clulow
On 10 May 2017 at 16:22, Richard Elling
 wrote:
> mdb’s "::zfs_dbgmsg" macro shows scan progress

You can also watch the messages in real-time using DTrace; e.g.,

dtrace -qn '
BEGIN
{
last = walltimestamp;
}

zfs-dbgmsg
/walltimestamp - last > 100/
{
printf("\n");
}

zfs-dbgmsg
{
printf("%Y  %s\n", walltimestamp, stringof(arg0));
last = walltimestamp;
}
'


Cheers.

-- 
Joshua M. Clulow
UNIX Admin/Developer
http://blog.sysmgr.org
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] LX: real ksh93 broken

2017-05-10 Thread Ludovic Orban
Okay, I found what causes ksh to misbehave. It's in sh_init(), when
shgd->lim.child_max is initialized with the results of
getconf("CHILD_MAX"), see:
https://github.com/att/ast/blob/master/src/cmd/ksh93/sh/init.c#L1289

I've commented out that line, hardcoded shgd->lim.child_max to 128, rebuilt
and voila: ksh works as it should.

Now I have to dig into that getconf() method to figure out what the
returned value is and where it's coming from. Sounds trivial, but my C is
*very* rusty, the asm gcc generates doesn't look at all what the JVM's JIT
generates (which gives me wrong reflexes as I'm used to the latter) and I'm
not very familiar with mdb.

Oh well, that turned into a nice debugging re-training session which I very
much needed. That reminds me the good old days at my first job when I was
porting Linux apps to Solaris.

Thank you for maintaining such a well-designed and pleasant to use OS!


On Wed, May 10, 2017 at 3:59 PM, Dan McDonald  wrote:

> Wow, thank you for the further deep-diving.
>
> > On May 10, 2017, at 5:21 AM, Ludovic Orban  wrote:
> >
> > Looking at ksh' sources, my understanding is that job_post is stuck in
> that else clause:
> >else
> >{
> >   /* create a new job */
> >   while((pw->p_job = job_alloc()) < 0)
> >  job_wait((pid_t)1);
> >   pw->p_nxtjob = job.pwlist;
> >   pw->p_nxtproc = 0;
> >}
> >
> > Digging into the sources and stepping though the instructions of
> job_alloc and job_byjid it looks like ksh cannot allocate a job id as it
> believes they're all reserved. But so far, all this code is purely working
> on internal structures of ksh so a LX bug would have no impact.
> >
> > I'll continue looking into this as time permits and I'll post an update
> if I find anything worth mentioning.
> >
>
> Be careful of narrowing your focus too far.  I see some things worth
> considering:
>
> 1.) If the "if" you're not showing me dependent on something in global
> state that may have been mis-initialized by an LX emulation bug?
>
> 2.) Same question as #1, but applied to job_alloc() and job_wait().
>
> I'm guessing LX in OmniOS is failing because I mismerged or plain forgot
> something, given that Nahum says he can run ksh93 on SmartOS just fine.
>
>
> Please make sure you're looking at the bigger picture, but THANK YOU for
> the further investigation.
>
> Dan
>
>
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


[OmniOS-discuss] Resilver zero progress

2017-05-10 Thread Schweiss, Chip
I have a pool that has had a resilver running for about an hour but the
progress status is a bit alarming.  I'm concerned for some reason it will
not resilver.   Resilvers are tuned to be faster in /etc/system.   This is
on OmniOS r151014, currently fully updated.   Any suggestions?

-Chip

from /etc/system:

set zfs:zfs_resilver_delay = 0
set zfs:zfs_scrub_delay = 0
set zfs:zfs_top_maxinflight = 64
set zfs:zfs_resilver_min_time_ms = 5000


# zpool status hcp03
  pool: hcp03
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed May 10 09:22:15 2017
1 scanned out of 545T at 1/s, (scan is slow, no estimated time)
0 resilvered, 0.00% done
config:

NAME STATE READ WRITE CKSUM
hcp03DEGRADED 0 0 0
  raidz2-0   DEGRADED 0 0 0
c0t5000C500846F161Fd0ONLINE   0 0 0
spare-1  UNAVAIL  0 0 0
  5676922542927845170UNAVAIL  0 0 0  was
/dev/dsk/c0t5000C5008473DBF3d0s0
  c0t5000C500846F1823d0  ONLINE   0 0 0
c0t5000C500846F134Fd0ONLINE   0 0 0
c0t5000C500846F139Fd0ONLINE   0 0 0
c0t5000C5008473B89Fd0ONLINE   0 0 0
c0t5000C500846F145Bd0ONLINE   0 0 0
c0t5000C5008473B6BBd0ONLINE   0 0 0
c0t5000C500846F131Fd0ONLINE   0 0 0
  raidz2-1   ONLINE   0 0 0
c0t5000C5008473BB63d0ONLINE   0 0 0
c0t5000C5008473C9C7d0ONLINE   0 0 0
c0t5000C500846F1A17d0ONLINE   0 0 0
c0t5000C5008473A0A3d0ONLINE   0 0 0
c0t5000C5008473D047d0ONLINE   0 0 0
c0t5000C5008473BF63d0ONLINE   0 0 0
c0t5000C5008473BC83d0ONLINE   0 0 0
c0t5000C5008473E35Bd0ONLINE   0 0 0
  raidz2-2   ONLINE   0 0 0
c0t5000C5008473ABAFd0ONLINE   0 0 0
c0t5000C5008473ADF3d0ONLINE   0 0 0
c0t5000C5008473AE77d0ONLINE   0 0 0
c0t5000C5008473A23Bd0ONLINE   0 0 0
c0t5000C5008473C907d0ONLINE   0 0 0
c0t5000C5008473CCABd0ONLINE   0 0 0
c0t5000C5008473C77Fd0ONLINE   0 0 0
c0t5000C5008473B6D3d0ONLINE   0 0 0
  raidz2-3   ONLINE   0 0 0
c0t5000C5008473E4FFd0ONLINE   0 0 0
c0t5000C5008473ECFFd0ONLINE   0 0 0
c0t5000C5008473F4C3d0ONLINE   0 0 0
c0t5000C5008473F8CFd0ONLINE   0 0 0
c0t5000C500846F1897d0ONLINE   0 0 0
c0t5000C500846F14B7d0ONLINE   0 0 0
c0t5000C500846F1353d0ONLINE   0 0 0
c0t5000C5008473EEDFd0ONLINE   0 0 0
  raidz2-4   ONLINE   0 0 0
c0t5000C500846F144Bd0ONLINE   0 0 0
c0t5000C5008473F10Fd0ONLINE   0 0 0
c0t5000C500846F15CBd0ONLINE   0 0 0
c0t5000C500846F1493d0ONLINE   0 0 0
c0t5000C5008473E26Fd0ONLINE   0 0 0
c0t5000C500846F1A0Bd0ONLINE   0 0 0
c0t5000C5008473EE07d0ONLINE   0 0 0
c0t5000C500846F1453d0ONLINE   0 0 0
  raidz2-5   ONLINE   0 0 0
c0t5000C500846F153Bd0ONLINE   0 0 0
c0t5000C5008473F9EBd0ONLINE   0 0 0
c0t5000C500846F14EFd0ONLINE   0 0 0
c0t5000C5008473AB0Bd0ONLINE   0 0 0
c0t5000C500846F140Bd0ONLINE   0 0 0
c0t5000C5008473FC0Fd0ONLINE   0 0 0
c0t5000C5008473DFA3d0ONLINE   0 0 0
c0t5000C5008473F89Bd0ONLINE   0 0 0
  raidz2-6   ONLINE   0 0 0
c0t5000C500846F19BFd0ONLINE   0 0 0
c0t5000C5008473D1ABd0ONLINE   0 0 0
c0t5000C50084739FD3d0ONLINE   0 0 0
c0t5000C5008473FFB7d0ONLINE   0 0 0
c0t5000C5008473E72Fd0ONLINE   0 0 0

Re: [OmniOS-discuss] LX: real ksh93 broken

2017-05-10 Thread Ludovic Orban
In x86 asm, cmpl is both signed and unsigned, it's the following jump that
decides to work signed or not. In this case it's jl "jump if less" so it's
signed (vs jb "jump if before" that is unsigned). But I digress.

I've recompiled ksh93 with debug, no stripped symbols and no optimizations
(the binary is here: https://www.dropbox.com/s/brys628g40akruv/ksh93.gz?dl=0)
and managed to figure out where that infinite loop is happening:

> ::stack
job_byjid+5()
job_alloc+0x62()
job_post+0x1a2()
_sh_fork+0x265()
sh_ntfork+0xa99()
sh_exec+0x2be8()
sh_subshell+0x982()
comsubst+0xbf0()
varsub+0x3f4()
copyto+0xa2a()
sh_mactrim+0x196()
nv_setlist+0x220()
sh_exec+0xdb7()
sh_eval+0x2b9()
sh_trap+0x29b()
ed_setup+0x7ac()
ed_viread+0xf6()
slowread+0x181()
sfrd+0x4da()
_sffilbuf+0x433()
sfreserve+0x566()
exfile+0x808()
sh_main+0xb38()
main+0x25()
>

Looking at ksh' sources, my understanding is that job_post is stuck in that
else clause:
   else
   {
  /* create a new job */
  while((pw->p_job = job_alloc()) < 0)
 job_wait((pid_t)1);
  pw->p_nxtjob = job.pwlist;
  pw->p_nxtproc = 0;
   }

Digging into the sources and stepping though the instructions of job_alloc
and job_byjid it looks like ksh cannot allocate a job id as it believes
they're all reserved. But so far, all this code is purely working on
internal structures of ksh so a LX bug would have no impact.

I'll continue looking into this as time permits and I'll post an update if
I find anything worth mentioning.

--
Ludovic



On Tue, May 9, 2017 at 5:15 PM, Dan McDonald  wrote:

>
> > On May 9, 2017, at 11:05 AM, Dan McDonald  wrote:
> >
> > And I've no good way to know what it's doing, as the illumos-native
> tools aren't giving me enough data.
>
> ksh93 appears to be looping in something:
>
> mdb: target stopped at:
> 0x42adf0:   movq   +0x350129(%rip),%rax <0x77af20>
> > ::step
> mdb: target stopped at:
> 0x42adf7:   testq  %rax,%rax
> > ::step
> mdb: target stopped at:
> 0x42adfa:   jne+0xc <0x42ae08>
> > ::step
> mdb: target stopped at:
> 0x42adfc:   jmp+0x28<0x42ae26>
> > ::step
> mdb: target stopped at:
> 0x42ae26:   addl   $0x1,%r14d
> > ::step
> mdb: target stopped at:
> 0x42ae2a:   cmpl   0x10(%rsi),%r14d
> > ::step
> mdb: target stopped at:
> 0x42ae2e:   jl -0x40<0x42adf0>
> > ::step
> mdb: target stopped at:
> 0x42adf0:   movq   +0x350129(%rip),%rax <0x77af20>
> >  Usage: step [ over | out ] [SIG]
> >  0x7f0470f0
> > 0x7f0470f0/D
> 0x7f0470f0: 2147483647
> > 0x7f0470f0/X
> 0x7f0470f0: 7fff
> >  mdb: failed to read data from target: no mapping for address
> 0x761133e5:
> >