On Thu 06 Dec 2007 at 02:05PM, Steve Peng wrote: > > Please code review the following changes which address the service > import performance issue: > > 6351623 Initial manifest-import is slow > > It is under http://cr.opensolaris.org/~stevep/6351623 > > The fix is to copy the repository to the fast tmpfs, import the > manifests and once the import is completed copy the repository > back to /etc/svc. During the verification cycle on both sparc > and amd64 systems, I see ~85 percent performance improvement. > > Any comment/suggestion is greatly appreciated. > > Thanks
Hurray! This is going to make a huge difference for users of zones, who hit the slow repo import problem fairly often. As such, I took a look at the code. Mostly, I have nits. See below. -dp usr/src/cmd/svc/configd/backend.c --- 1131 + /* 1132 + * Coping 1133 + */ I think you mean "Copying"? It might be nice to say a little more. --- 1121 + configd_critical("Backend copy faild: mkstemp %s\n", tmppath); Do you mean "failed"? "faild" appears three or four times. --- 1155 + /* 1156 + * Rename files. Note the following rename ops are done 1157 + * "atomically": "atomically" does not need to be quoted. --- 1155 + /* 1156 + * Rename files. Note the following rename ops are done 1157 + * "atomically": 1158 + * - unlink target 1159 + * - link source to target 1160 + * - unlink source 1161 + * 1162 + * Extra works needed if we switch back from tmpfs 1163 + */ "works" --> work Also, this is a bit confusing in the wording... is the comment really in sync with the code? Do you mean to say that the whole operation proceeds atomically? Or that each of the individual operations is atomic? Or something else? It seems like this code could leak files if parts of it fail-- for example, if the code at 1180 succeeds (tmppath renames to "trans_db"), but the code at 1188 fails, won't "trans_db" be leaked? Or, if 1120 (mkstemp) succeeds, but 1125 (open) fails, won't we leak the temp file? --- backend_switch() 1250 + * If copy fails for any reason and we are switching 1251 + * back then we should go ahead fall back to the 1252 + * original repository since we don't want to stay 1253 + * with tmpfs for too long 1254 + */ Run-on sentence (also, please end with a period). I don't really know what you mean by "stay with tmpfs for too long"? --- backend_switch() Are there some missing calls to sqllite_close in this routine? I'm not really familiar with the sqllite API. --- backend_switch_recovery() I don't totally understand what is going on here. The comments should perhaps explain the "unlink" that happens at the end of the routine. It seems to me that if things go wrong (like you can't open the trans db) the code doesn't say anything about that, and winds up deleting any evidence we might use for debugging... Also this routine also seems a bit more complex than is needed... Couldn't this be recoded as follows? db = NULL; ... if (stat(fast_db, &s_buf) == 0) db = fast_db; if (stat(trans_db, &s_buf) == 0) db = trans_db; if (db == NULL) return; be_db = sqlite_open(db, 0600, &errp); ... --- 2158 + * If system crashes while the beakend switch is performed nit: backend 2160 + * which may contain most recent data so attemp recovery nit: attempt nit: please put a period at the end of sentences. Consider rewriting: If the system crashed during a backend switch, there might be a leftover transient database containing useful information which can be used for recovery. Also: 2162 + (void) backend_switch_recovery(); The function is declared void, so the cast is not needed. --- I'm curious about how you have stress tested this? A zone in a reboot loop could be useful here, I'd think. -dp -- Daniel Price - Solaris Kernel Engineering - dp at eng.sun.com - blogs.sun.com/dp