Re: softraid(4) and WAPBL

Owain G. Ainsworth Tue, 24 Feb 2015 09:03:13 -0800

Let me know if it seems ok and i'll update the affected laptop. I've let it
get rather behind.


On Tue Feb 24 2015 at 1:08:49 PM Pedro Martelletto <[email protected]>
wrote:

> Hi Oga,
>
> Thanks for the feedback. I've pushed a branch, pedro_softraid, which I
> plan to test later today. Except for the missing DIOCCACHESYNC ioctl, I
> expect it to work.
>
> -p.
>
> >Amusingly, this is actually the theory that I had about this. I thought i
> >had shared it with you but I may well misremember. Had no time to hack it
> >up though.
> >
> >I concur though. Sounds very plausible.
> >
> >On Tue Feb 24 2015 at 7:20:55 AM Pedro Martelletto <
> [email protected]>
> >wrote:
> >
> >> Here are some thoughts on the interaction between softraid(4) and WAPBL,
> >> and how to improve the current situation. All of the code references in
> >> this mail concern the master branch with HEAD = 3ca4b444.
> >>
> >> WAPBL saw the introduction of the DIOCCACHESYNC ioctl, whose purpose, as
> >> the name implies, is to flush an I/O controller's cache.
> >>
> >> WAPBL periodically issues the DIOCCACHESYNC ioctl to ensure that journal
> >> metadata is safely on disk. This use of DIOCCACHESYNC happens at the
> >> file system level, which means that a disk device handling multiple
> >> partitions, each a fully independent file system, may be requested to
> >> perform multiple and possibly concurrent DIOCCACHESYNC operations.
> >>
> >> For softraid(4), DIOCCACHESYNC gets translated to sr_raid_sync(). What
> >> sr_raid_sync() implements is not quite a sync operation per se, but
> >> rather a drain: it loops until all pending I/O operations have finished.
> >> This rationale assumes that no further I/O requests can get posted to
> >> softraid(4) after a sync has started, which is the case of the scenarios
> >> for which sr_raid_sync() was originally envisaged (look at the code
> >> paths that lead to sd_flush() other than DIOCCACHESYNC: activate,
> >> deactive, close).
> >>
> >> However, that assumption does not hold true for a softraid(4) disk
> >> backing multiple WAPBL file systems. In this situation, new I/O requests
> >> may and very likely will be posted to softraid(4) when a sync is in
> >> progress. This may have the effect of slowing down the sync operation,
> >> as sr_raid_sync() loops waiting for the number of outstanding I/O
> >> operations to reach zero, ultimately leading to a timeout after 15
> >> seconds. This worst-case scenario will certainly happen should one of
> >> these parallel operations be another sync, as one sync will cause the
> >> other to timeout.
> >>
> >> Oga, this is probably the slowdown you are seeing. I'm sorry that it
> >> took so long for the problem to be diagnosed.
> >>
> >> Below is nothing but a sketch of a fix. I plan to work on it as I find
> >> the time.
> >>
> >> -p.
> >>
> >> diff --git sys/dev/softraid.c sys/dev/softraid.c
> >> index a9e9ff5..4a2657c 100644
> >> --- sys/dev/softraid.c
> >> +++ sys/dev/softraid.c
> >> @@ -2121,6 +2121,8 @@ sr_wu_free(struct sr_discipline *sd)
> >>
> >>         DNPRINTF(SR_D_WU, "%s: sr_wu_free %p\n", DEVNAME(sd->sd_sc),
> sd);
> >>
> >> +       /* XXX mutex? */
> >> +
> >>         while ((wu = TAILQ_FIRST(&sd->sd_wu_freeq)) != NULL)
> >>                 TAILQ_REMOVE(&sd->sd_wu_freeq, wu, swu_link);
> >>         while ((wu = TAILQ_FIRST(&sd->sd_wu_pendq)) != NULL)
> >> @@ -2144,6 +2146,7 @@ sr_wu_get(void *xsd)
> >>         wu = TAILQ_FIRST(&sd->sd_wu_freeq);
> >>         if (wu) {
> >>                 TAILQ_REMOVE(&sd->sd_wu_freeq, wu, swu_link);
> >> +               TAILQ_INSERT_TAIL(&sd->sd_wu_pendq, wu, swu_link);
> >>                 sd->sd_wu_pending++;
> >>         }
> >>         mtx_leave(&sd->sd_wu_mtx);
> >> @@ -2266,6 +2269,8 @@ sr_wu_done_callback(void *xwu)
> >>                         goto done;
> >>         }
> >>
> >> +       /* XXX mutex? */
> >> +
> >>         /* Remove work unit from pending queue. */
> >>         TAILQ_FOREACH(wup, &sd->sd_wu_pendq, swu_link)
> >>                 if (wup == wu)
> >> @@ -4078,6 +4083,7 @@ sr_raid_start_stop(struct sr_workunit *wu)
> >>  int
> >>  sr_raid_sync(struct sr_workunit *wu)
> >>  {
> >> +       struct sr_workunit      *wup;
> >>         struct sr_discipline    *sd = wu->swu_dis;
> >>         int                     ios = 0, rv = 0, retries = 15;
> >>
> >> @@ -4099,35 +4105,36 @@ sr_raid_sync(struct sr_workunit *wu)
> >>
> >>         /* diagnostic */
> >>         if (sd->sd_sync != 0) {
> >> -               printf("sd->sd_sync != 0, called reentrant\n");
> >> +               /* wait for ongoing syncs to finish */
> >>         }
> >>
> >>         sd->sd_sync = 1;
> >> -       for (;;) {
> >> -               if (sd->sd_wu_pending <= ios) {
> >> -                       break;
> >> -               }
> >>
> >> -               /* wait a bit for io to drain */
> >> -               if (msleep(sd, &sd->sd_wu_mtx, PWAIT, "sr_sync", 1 *
> hz) ==
> >> -                   EWOULDBLOCK) {
> >> -                       if (retries-- > 0) {
> >> -                               continue;
> >> -                       }
> >> +       TAILQ_FOREACH(wup, &sd->sd_wu_pendq, swu_link) {
> >> +               KASSERT((wup->swu_flags & SR_WUF_SEEN) == 0);
> >> +               wup->swu_flags |= SR_WUF_SEEN;
> >> +       }
> >>
> >> -                       /* print timeout for now */
> >> -                       printf("%s: sr_raid_sync timeout\n",
> >> -                           DEVNAME(sd->sd_sc));
> >> +       while (sd->sd_wu_pending > ios && retries-- > 0) {
> >> +               msleep(sd, &sd->sd_wu_mtx, PWAIT, "sr_sync", 1 * hz);
> >> +               TAILQ_FOREACH(wup, &sd->sd_wu_pendq, swu_link)
> >> +                       if (wup->swu_flags & SR_WUF_SEEN)
> >> +                               continue;
> >> +               rv = 0;
> >> +               break;
> >> +       }
> >>
> >> -                       DNPRINTF(SR_D_DIS, "%s: sr_raid_sync timeout\n",
> >> -                           DEVNAME(sd->sd_sc));
> >> -                       rv = 1;
> >> -                       break;
> >> -               }
> >> +       if (rv != 0)
> >> +               TAILQ_FOREACH(wup, &sd->sd_wu_pendq, swu_link) {
> >> +                       wup->swu_flags &= ~SR_WUF_SEEN;
> >>         }
> >> -       sd->sd_sync = 0;
> >> +
> >>         mtx_leave(&sd->sd_wu_mtx);
> >>
> >> +       sd->sd_sync = 0;
> >> +
> >> +       /* XXX should issue DIOCCACHESYNC on the underlying disk(s)! */
> >> +
> >>         wakeup(&sd->sd_sync);
> >>
> >>         return (rv);
> >> @@ -4183,6 +4190,8 @@ sr_schedule_wu(struct sr_workunit *wu)
> >>                 panic("sr_schedule_wu: work unit not in progress (state
> >> %i)\n",
> >>                     wu->swu_state);
> >>
> >> +       /* XXX mutex? */
> >> +
> >>         /* Walk queue backwards and fill in collider if we have one. */
> >>         TAILQ_FOREACH_REVERSE(wup, &sd->sd_wu_pendq, sr_wu_list,
> swu_link)
> >> {
> >>                 if (wu->swu_blk_end < wup->swu_blk_start ||
> >> @@ -4223,9 +4232,6 @@ sr_raid_startwu(struct sr_workunit *wu)
> >>                 wu->swu_state = SR_WU_INPROGRESS;
> >>         }
> >>
> >> -       if (wu->swu_state != SR_WU_RESTART)
> >> -               TAILQ_INSERT_TAIL(&sd->sd_wu_pendq, wu, swu_link);
> >> -
> >>         /* Start all of the individual I/Os. */
> >>         if (wu->swu_cb_active == 1)
> >>                 panic("%s: sr_startwu_callback", DEVNAME(sd->sd_sc));
> >> diff --git sys/dev/softraidvar.h sys/dev/softraidvar.h
> >> index 462a4f2..d386fa0 100644
> >> --- sys/dev/softraidvar.h
> >> +++ sys/dev/softraidvar.h
> >> @@ -387,6 +387,7 @@ struct sr_workunit {
> >>  #define SR_WUF_WAKEUP          (1<<4)          /* Wakeup on I/O
> >> completion. */
> >>  #define SR_WUF_DISCIPLINE      (1<<5)          /* Discipline specific
> >> I/O. */
> >>  #define SR_WUF_FAKE            (1<<6)          /* Faked workunit. */
> >> +#define SR_WUF_SEEN            (1<<7)          /* Visited during sync.
> */
> >>
> >>         /* workunit io range */
> >>         daddr_t                 swu_blk_start;
> >>
> >>
>
>

Re: softraid(4) and WAPBL

Reply via email to