On Thu 06 Dec 2007 at 02:05PM, Steve Peng wrote:
> 
> Please code review the following changes which address the service
> import performance issue:
> 
> 6351623 Initial manifest-import is slow
> 
> It is under http://cr.opensolaris.org/~stevep/6351623
> 
> The fix is to copy the repository to the fast tmpfs, import the
> manifests and once the import is completed copy the repository
> back to /etc/svc.  During the verification cycle on both sparc
> and amd64 systems, I see ~85 percent performance improvement.
> 
> Any comment/suggestion is greatly appreciated.
> 
> Thanks

Hurray!  This is going to make a huge difference for users of zones,
who hit the slow repo import problem fairly often.  As such, I took
a look at the code.  Mostly, I have nits.   See below.

        -dp


usr/src/cmd/svc/configd/backend.c

---
     1131 +        /*
     1132 +         * Coping
     1133 +         */

I think you mean "Copying"?  It might be nice to say a little more.

---

     1121 +                configd_critical("Backend copy faild: mkstemp %s\n", 
tmppath);

Do you mean "failed"?  "faild" appears three or four times.
---

     1155 +        /*
     1156 +         * Rename files.  Note the following rename ops are done
     1157 +         * "atomically":

"atomically" does not need to be quoted.

---
     1155 +        /*
     1156 +         * Rename files.  Note the following rename ops are done
     1157 +         * "atomically":
     1158 +         *      - unlink target
     1159 +         *      - link source to target
     1160 +         *      - unlink source
     1161 +         *
     1162 +         * Extra works needed if we switch back from tmpfs
     1163 +         */

"works" --> work

Also, this is a bit confusing in the wording... is the comment really in
sync with the code?   Do you mean to say that the whole operation proceeds
atomically?  Or that each of the individual operations is atomic?  Or something
else?

It seems like this code could leak files if parts of it fail-- for example,
if the code at 1180 succeeds (tmppath renames to "trans_db"), but the code
at 1188 fails, won't "trans_db" be leaked?

Or, if 1120 (mkstemp) succeeds, but 1125 (open) fails, won't we leak
the temp file?
---

backend_switch()

     1250 +         * If copy fails for any reason and we are switching
     1251 +         * back then we should go ahead fall back to the
     1252 +         * original repository since we don't want to stay
     1253 +         * with tmpfs for too long
     1254 +         */

Run-on sentence (also, please end with a period).  I don't really know
what you mean by "stay with tmpfs for too long"?

---
backend_switch()

Are there some missing calls to sqllite_close in this routine?  I'm not
really familiar with the sqllite API.

---
backend_switch_recovery()

I don't totally understand what is going on here.  The comments should perhaps
explain the "unlink" that happens at the end of the routine.  It seems to me
that if things go wrong (like you can't open the trans db) the code doesn't say
anything about that, and winds up deleting any evidence we might use for
debugging...

Also this routine also seems a bit more complex than is needed...
Couldn't this be recoded as follows?

        db = NULL;
        ...

        if (stat(fast_db, &s_buf) == 0)
                db = fast_db;

        if (stat(trans_db, &s_buf) == 0)
                db = trans_db;

        if (db == NULL)
                return;

        be_db = sqlite_open(db, 0600, &errp);
        ...

---

     2158 +         * If system crashes while the beakend switch is performed

nit: backend

     2160 +         * which may contain most recent data so attemp recovery

nit: attempt

nit: please put a period at the end of sentences.

Consider rewriting:

If the system crashed during a backend switch, there might
be a leftover transient database containing useful information
which can be used for recovery.


Also:

     2162 +        (void) backend_switch_recovery();

The function is declared void, so the cast is not needed.


---

I'm curious about how you have stress tested this?  A zone in a reboot
loop could be useful here, I'd think.

        -dp

-- 
Daniel Price - Solaris Kernel Engineering - dp at eng.sun.com - blogs.sun.com/dp

Reply via email to