Re: Memory barrier needed with wake_up_process()?
Hi, Felipe Balbiwrites: > Alan Stern writes: >> On Mon, 16 Jan 2017, Felipe Balbi wrote: >> >>> Sorry for the long delay, I finally have more information on this. All >>> this time I was doing something that I never considered to matter: I've >>> been running host and peripheral on the same machine. Now that I have >>> tracepoints on xHCI as well, I could see that these 30 seconds of >>> "nothing" is actuall full of xHCI activity and I can see that for the >>> duration of these 30 seconds preempt depth on the CPU that (eventually) >>> queues a request on dwc3, is always > 1 (sometimes 2, most of the time >>> 1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU >>> and g_mass_storage is spinning for over 30 seconds at which point >>> storage.ko (host side class driver) dequeues the request. >>> >>> I'll see if I can capture a fresh trace with both xHCI and dwc3 with >>> this happening, but probably not today (testing stuff for -rc). >> >> Does anything change if the host and peripheral are separate machines? > > couldn't reproduce the problem yet ;-) *yet* is a keyword here. I wouldn't call this problem "done" until I successfully run my test case for at least a week. -- balbi -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
Hi, Alan Sternwrites: > On Mon, 16 Jan 2017, Felipe Balbi wrote: > >> Sorry for the long delay, I finally have more information on this. All >> this time I was doing something that I never considered to matter: I've >> been running host and peripheral on the same machine. Now that I have >> tracepoints on xHCI as well, I could see that these 30 seconds of >> "nothing" is actuall full of xHCI activity and I can see that for the >> duration of these 30 seconds preempt depth on the CPU that (eventually) >> queues a request on dwc3, is always > 1 (sometimes 2, most of the time >> 1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU >> and g_mass_storage is spinning for over 30 seconds at which point >> storage.ko (host side class driver) dequeues the request. >> >> I'll see if I can capture a fresh trace with both xHCI and dwc3 with >> this happening, but probably not today (testing stuff for -rc). > > Does anything change if the host and peripheral are separate machines? couldn't reproduce the problem yet ;-) -- balbi -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Mon, 16 Jan 2017, Felipe Balbi wrote: > Sorry for the long delay, I finally have more information on this. All > this time I was doing something that I never considered to matter: I've > been running host and peripheral on the same machine. Now that I have > tracepoints on xHCI as well, I could see that these 30 seconds of > "nothing" is actuall full of xHCI activity and I can see that for the > duration of these 30 seconds preempt depth on the CPU that (eventually) > queues a request on dwc3, is always > 1 (sometimes 2, most of the time > 1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU > and g_mass_storage is spinning for over 30 seconds at which point > storage.ko (host side class driver) dequeues the request. > > I'll see if I can capture a fresh trace with both xHCI and dwc3 with > this happening, but probably not today (testing stuff for -rc). Does anything change if the host and peripheral are separate machines? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
Hi, Alan Sternwrites: > On Tue, 20 Sep 2016, Felipe Balbi wrote: > >> And here's trace output (complete, scroll to bottom). It seems to me >> like the thread didn't wake up at all. It didn't even try to >> execute. I'll add some more traces and try to get better information >> about what's going on. >> >> >> >> # tracer: nop >> # >> # entries-in-buffer/entries-written: 2865/2865 #P:4 >> # >> # _-=> irqs-off >> # / _=> need-resched >> #| / _---=> hardirq/softirq >> #|| / _--=> preempt-depth >> #||| / delay >> # TASK-PID CPU# TIMESTAMP FUNCTION >> # | | | | | > > Skipping to the end... > >> irq/17-dwc3-2522 [002] d...43.504199: bulk_out_complete: 0, 31/31 >> file-storage-2521 [001] 43.504202: fsg_main_thread: >> get_next_command -> 0 >> file-storage-2521 [001] 43.504288: fsg_main_thread: >> do_scsi_command -> 0 >> file-storage-2521 [001] 43.504295: fsg_main_thread: >> finish_reply -> 0 >> file-storage-2521 [001] 43.504298: fsg_main_thread: send_status >> -> 0 >> irq/17-dwc3-2522 [002] d...43.504347: bulk_in_complete: 0, >> 16384/16384 >> irq/17-dwc3-2522 [002] d...43.504351: bulk_in_complete: 0, >> 16384/16384 >> irq/17-dwc3-2522 [002] d...43.504434: bulk_in_complete: 0, >> 16384/16384 >> irq/17-dwc3-2522 [002] d...43.504438: bulk_in_complete: 0, >> 16384/16384 >> irq/17-dwc3-2522 [002] d...43.504535: bulk_in_complete: 0, >> 16384/16384 >> irq/17-dwc3-2522 [002] d...43.504539: bulk_in_complete: 0, >> 16384/16384 >> irq/17-dwc3-2522 [002] d...43.504618: bulk_in_complete: 0, >> 16384/16384 >> irq/17-dwc3-2522 [002] d...43.504703: bulk_in_complete: 0, >> 16384/16384 >> irq/17-dwc3-2522 [002] d...43.504794: bulk_in_complete: 0, 13/13 >> irq/17-dwc3-2522 [002] d...43.504797: bulk_out_complete: 0, 31/31 > > Like you say, it appears that the thread didn't get woken up at all. > But this is inconsistent with your earlier results. On Sep. 9, you > posted a message that ended with these lines: > >> irq/17-dwc3-3579 [003] d..1 21167.729666: bulk_out_complete: compl: bh >> 880111e6aac0 state 1 >> file-storage-3578 [002] 21167.729670: fsg_main_thread: next: bh >> 880111e6aac0 state 1 > > This indicates that in the earlier test, the thread did start running > and get_next_command should have returned. > > The trace you posted after this one doesn't seem to show anything new, > as far as I can tell. > > So I still can't tell what's happening. Maybe the patch below will > help. It concentrates on the critical area. Sorry for the long delay, I finally have more information on this. All this time I was doing something that I never considered to matter: I've been running host and peripheral on the same machine. Now that I have tracepoints on xHCI as well, I could see that these 30 seconds of "nothing" is actuall full of xHCI activity and I can see that for the duration of these 30 seconds preempt depth on the CPU that (eventually) queues a request on dwc3, is always > 1 (sometimes 2, most of the time 1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU and g_mass_storage is spinning for over 30 seconds at which point storage.ko (host side class driver) dequeues the request. I'll see if I can capture a fresh trace with both xHCI and dwc3 with this happening, but probably not today (testing stuff for -rc). -- balbi signature.asc Description: PGP signature
Re: Memory barrier needed with wake_up_process()?
On Tue, 20 Sep 2016, Felipe Balbi wrote: > And here's trace output (complete, scroll to bottom). It seems to me > like the thread didn't wake up at all. It didn't even try to > execute. I'll add some more traces and try to get better information > about what's going on. > > > > # tracer: nop > # > # entries-in-buffer/entries-written: 2865/2865 #P:4 > # > # _-=> irqs-off > # / _=> need-resched > #| / _---=> hardirq/softirq > #|| / _--=> preempt-depth > #||| / delay > # TASK-PID CPU# TIMESTAMP FUNCTION > # | | | | | Skipping to the end... > irq/17-dwc3-2522 [002] d...43.504199: bulk_out_complete: 0, 31/31 > file-storage-2521 [001] 43.504202: fsg_main_thread: > get_next_command -> 0 > file-storage-2521 [001] 43.504288: fsg_main_thread: > do_scsi_command -> 0 > file-storage-2521 [001] 43.504295: fsg_main_thread: finish_reply > -> 0 > file-storage-2521 [001] 43.504298: fsg_main_thread: send_status > -> 0 > irq/17-dwc3-2522 [002] d...43.504347: bulk_in_complete: 0, > 16384/16384 > irq/17-dwc3-2522 [002] d...43.504351: bulk_in_complete: 0, > 16384/16384 > irq/17-dwc3-2522 [002] d...43.504434: bulk_in_complete: 0, > 16384/16384 > irq/17-dwc3-2522 [002] d...43.504438: bulk_in_complete: 0, > 16384/16384 > irq/17-dwc3-2522 [002] d...43.504535: bulk_in_complete: 0, > 16384/16384 > irq/17-dwc3-2522 [002] d...43.504539: bulk_in_complete: 0, > 16384/16384 > irq/17-dwc3-2522 [002] d...43.504618: bulk_in_complete: 0, > 16384/16384 > irq/17-dwc3-2522 [002] d...43.504703: bulk_in_complete: 0, > 16384/16384 > irq/17-dwc3-2522 [002] d...43.504794: bulk_in_complete: 0, 13/13 > irq/17-dwc3-2522 [002] d...43.504797: bulk_out_complete: 0, 31/31 Like you say, it appears that the thread didn't get woken up at all. But this is inconsistent with your earlier results. On Sep. 9, you posted a message that ended with these lines: > irq/17-dwc3-3579 [003] d..1 21167.729666: bulk_out_complete: compl: bh > 880111e6aac0 state 1 > file-storage-3578 [002] 21167.729670: fsg_main_thread: next: bh > 880111e6aac0 state 1 This indicates that in the earlier test, the thread did start running and get_next_command should have returned. The trace you posted after this one doesn't seem to show anything new, as far as I can tell. So I still can't tell what's happening. Maybe the patch below will help. It concentrates on the critical area. Alan Stern Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c === --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c @@ -402,8 +402,12 @@ static void wakeup_thread(struct fsg_com smp_wmb(); /* ensure the write of bh->state is complete */ /* Tell the main thread that something has happened */ common->thread_wakeup_needed = 1; - if (common->thread_task) - wake_up_process(common->thread_task); + if (common->thread_task) { + if (wake_up_process(common->thread_task)) + trace_printk("wakeup_thread 1\n"); + else + trace_printk("wakeup_thread 0\n"); + } } static void raise_exception(struct fsg_common *common, enum fsg_state new_state) @@ -479,6 +483,8 @@ static void bulk_out_complete(struct usb req->status, req->actual, bh->bulk_out_intended_length); if (req->status == -ECONNRESET) /* Request was cancelled */ usb_ep_fifo_flush(ep); + trace_printk("bulk_out: %d, %u/%u\n", req->status, req->actual, + bh->bulk_out_intended_length); /* Hold the lock while we update the request and buffer states */ smp_wmb(); @@ -2204,13 +2210,17 @@ static int get_next_command(struct fsg_c /* Wait for the CBW to arrive */ while (bh->state != BUF_STATE_FULL) { + trace_printk("get_next: sleep\n"); rc = sleep_thread(common, true); + trace_printk("get_next: sleep -> %d loop %d\n", + rc, bh->state != BUF_STATE_FULL); if (rc) return rc; } smp_rmb(); rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; bh->state = BUF_STATE_EMPTY; + trace_printk("get_next: return %d\n", rc); return rc; } -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Memory barrier needed with wake_up_process()?
Hi, Felipe Balbiwrites: > [ Unknown signature status ] > > Hi, > > Alan Stern writes: > > [...] > >>> $ grep -RnH "raise_exception\|fsg_main_thread" /tmp/trace.txt >>> /tmp/trace.txt:111743: irq/17-dwc3-3527 [003] d..164.833078: >>> raise_exception: raise_exception 4 >>> /tmp/trace.txt:111745:file-storage-3526 [002] 64.833139: >>> fsg_main_thread: get_next_command -> -4 >>> /tmp/trace.txt:111746:file-storage-3526 [002] 64.833140: >>> fsg_main_thread: handling exception >>> /tmp/trace.txt:112950: irq/17-dwc3-3527 [003] d..164.951349: >>> raise_exception: raise_exception 4 >>> /tmp/trace.txt:112956:file-storage-3526 [002] 64.951401: >>> fsg_main_thread: handling exception >>> >>> Any ideas? >> >> I'm afraid not. The only thing I can think of to try next is complete >> tracing of fsg_main_thread, as in the patch below. It will generate >> continuous output as long as the gadget is doing something, but there >> doesn't seem to be any way to avoid this. At least when everything >> stops, it should be able to tell us exactly where and why. > > tried with your changes plus a trace_printk() added to both > bulk_out_complete and bulk_in_complete. Here's the diff: another version of diff and logs. Any ideas of what could be going on? (full trace compressed and attached) diff --git a/drivers/usb/gadget/function/f_mass_storage.c b/drivers/usb/gadget/function/f_mass_storage.c index 8f3659b65f53..de0a9ebe7f61 100644 --- a/drivers/usb/gadget/function/f_mass_storage.c +++ b/drivers/usb/gadget/function/f_mass_storage.c @@ -395,11 +395,24 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct usb_ep *ep) /* Caller must hold fsg->lock */ static void wakeup_thread(struct fsg_common *common) { + struct task_struct *p = common->thread_task; + smp_wmb(); /* ensure the write of bh->state is complete */ /* Tell the main thread that something has happened */ common->thread_wakeup_needed = 1; - if (common->thread_task) - wake_up_process(common->thread_task); + + trace_printk("wkup needed %d task %p [comm=%s pid=%d prio=%d target_cpu=%03d on_rq %d state %lx]\n", + common->thread_wakeup_needed, p, p->comm, p->pid, + p->prio, task_cpu(p), p->on_rq, p->state); + + if (common->thread_task) { + int rc; + rc = wake_up_process(common->thread_task); + if (rc) + trace_printk("thread_task woken up\n"); + else + trace_printk("NOT woken up\n"); + } } static void raise_exception(struct fsg_common *common, enum fsg_state new_state) @@ -411,6 +424,7 @@ static void raise_exception(struct fsg_common *common, enum fsg_state new_state) * If a lower-or-equal priority exception is in progress, preempt it * and notify the main thread by sending it a signal. */ + trace_printk("raise_exception %d\n", new_state); spin_lock_irqsave(>lock, flags); if (common->state <= new_state) { common->exception_req_tag = common->ep0_req_tag; @@ -449,6 +463,8 @@ static void bulk_in_complete(struct usb_ep *ep, struct usb_request *req) struct fsg_common *common = ep->driver_data; struct fsg_buffhd *bh = req->context; + trace_printk("%d, %u/%u\n", req->status, req->actual, req->length); + if (req->status || req->actual != req->length) DBG(common, "%s --> %d, %u/%u\n", __func__, req->status, req->actual, req->length); @@ -470,6 +486,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct usb_request *req) struct fsg_buffhd *bh = req->context; dump_msg(common, "bulk-out", req->buf, req->actual); + + trace_printk("%d, %u/%u\n", req->status, req->actual, bh->bulk_out_intended_length); if (req->status || req->actual != bh->bulk_out_intended_length) DBG(common, "%s --> %d, %u/%u\n", __func__, req->status, req->actual, bh->bulk_out_intended_length); @@ -573,6 +591,10 @@ static void start_transfer(struct fsg_dev *fsg, struct usb_ep *ep, spin_unlock_irq(>common->lock); rc = usb_ep_queue(ep, req, GFP_KERNEL); + + trace_printk("%d: %s: req %u bytes state %d ---> %d\n", __LINE__, ep->name, + req->length, *state, rc); + if (rc == 0) return; /* All good, we're done */ @@ -2496,6 +2518,7 @@ static void handle_exception(struct fsg_common *common) static int fsg_main_thread(void *common_) { struct fsg_common *common = common_; + int rc; /* * Allow the thread to be killed by a signal, but set the signal mask @@ -2519,6 +2542,7 @@ static int fsg_main_thread(void *common_)
Re: Memory barrier needed with wake_up_process()?
On Mon, 19 Sep 2016, Felipe Balbi wrote: > >> file-storage-3578 [002] 21167.727072: fsg_main_thread: next: bh > >> 880111e69a80 state 1 > >> file-storage-3578 [002] 21167.729458: fsg_main_thread: next: bh > >> 880111e6aac0 state 2 > >> irq/17-dwc3-3579 [003] d..1 21167.729666: bulk_out_complete: compl: > >> bh 880111e6aac0 state 1 > >> file-storage-3578 [002] 21167.729670: fsg_main_thread: next: bh > >> 880111e6aac0 state 1 > > > > And this is where everything stopped? > > yeah, that's everything. > > > This also looks normal. So the question is what happened when > > get_next_command() returned after this? > > > > Felipe, maybe the patch below (in place of your current patch) will > > help. Since the events that it logs are all supposed to be unusual, > > you can use printk if you want, but I wrote it with trace_printk. > > I've applied your patch and it wasn't giving me any output, which hinted > that g_mass_storage wasn't returning any failures. So I enabled dwc3's > traces to get more data out of it. Here's the final snippet (with > comments, again). Let me know if you want the entire thing (it's > ~14MiB). > > > irq/17-dwc3-3527 [003] d..134.215214: dwc3_event: event (00110301): > > Link Change [U1] > > irq/17-dwc3-3527 [003] d..134.215214: dwc3_event: event (4086): > > ep1in: Transfer In-Progress > > irq/17-dwc3-3527 [003] d..134.215214: dwc3_complete_trb: ep1in: 2/255 > > trb 880084813e70 buf 80808000 size 0 ctrl 0c18 > > (hlcS:SC:normal) > > irq/17-dwc3-3527 [003] d..134.215214: dwc3_gadget_giveback: ep1in: req > > 88016d046840 length 16384/16384 zsI ==> 0 > > irq/17-dwc3-3527 [003] d...34.215215: usb_gadget_giveback_request: > > ep1in: length 16384/16384 sgs 0/0 stream 0 zsI status 0 --> 0 > > irq/17-dwc3-3527 [003] d..134.215218: dwc3_gadget_ep_cmd: ep1in: cmd > > 'Update Transfer' [196615] params --> status: > > Successful > > irq/17-dwc3-3527 [003] d..134.215281: dwc3_event: event (00100301): > > Link Change [U0] > > irq/17-dwc3-3527 [003] d..134.215281: dwc3_event: event (00110301): > > Link Change [U1] > > irq/17-dwc3-3527 [003] d..134.215282: dwc3_event: event (4086): > > ep1in: Transfer In-Progress > > irq/17-dwc3-3527 [003] d..134.215282: dwc3_complete_trb: ep1in: 1/255 > > trb 880084813e80 buf 8080c000 size 0 ctrl 0c18 > > (hlcS:SC:normal) > > irq/17-dwc3-3527 [003] d..134.215282: dwc3_gadget_giveback: ep1in: req > > 88016d046f00 length 13/13 zsI ==> 0 > > irq/17-dwc3-3527 [003] d...34.215283: usb_gadget_giveback_request: > > ep1in: length 13/13 sgs 0/0 stream 0 zsI status 0 --> 0 > > irq/17-dwc3-3527 [003] d..134.215284: dwc3_event: event (00100301): > > Link Change [U0] > > irq/17-dwc3-3527 [003] d..134.215284: dwc3_event: event (00110301): > > Link Change [U1] > > irq/17-dwc3-3527 [003] d..134.215284: dwc3_event: event (6084): > > ep1out: Transfer In-Progress > > irq/17-dwc3-3527 [003] d..134.215284: dwc3_complete_trb: ep1out: 1/255 > > trb 880084804320 buf 8080c800 size 993 ctrl 0c18 > > (hlcS:SC:normal) > > irq/17-dwc3-3527 [003] d..134.215285: dwc3_gadget_giveback: ep1out: > > req 880154205300 length 31/1024 zsI ==> 0 > > irq/17-dwc3-3527 [003] d...34.215285: usb_gadget_giveback_request: > > ep1out: length 31/1024 sgs 0/0 stream 0 zsI status 0 --> 0 > > completed and gave back CBW. > > > irq/17-dwc3-3527 [003] d..134.215349: dwc3_event: event (00100301): > > Link Change [U0] > > irq/17-dwc3-3527 [003] d..134.215349: dwc3_event: event (00110301): > > Link Change [U1] > > irq/17-dwc3-3527 [003] d..134.225616: dwc3_event: event (00120301): > > Link Change [U2] > > irq/17-dwc3-3527 [003] d...34.225617: usb_gadget_vbus_draw: speed 5/5 > > state 7 0mA [sg:out_aligned:self-powered:activated:connected] --> -95 > > irq/17-dwc3-3527 [003] d..164.832221: dwc3_event: event (00100301): > > Link Change [U0] > > 30 seconds of nothing. > > > irq/17-dwc3-3527 [003] d..164.832240: dwc3_event: event (c040): > > ep0out: Transfer Complete > > irq/17-dwc3-3527 [003] d..164.832243: dwc3_ep0: ep0out: Transfer > > Complete: state 'Setup Phase' > > irq/17-dwc3-3527 [003] d..164.832243: dwc3_ep0: Setup Phase > > irq/17-dwc3-3527 [003] d..164.832244: dwc3_ctrl_req: bRequestType 00 > > bRequest 01 wValue 0030 wIndex wLength 0 > > irq/17-dwc3-3527 [003] d..164.832244: dwc3_ep0: USB_REQ_CLEAR_FEATURE > > irq/17-dwc3-3527 [003] d..164.832253: dwc3_event: event (20c2): > > ep0in: Transfer Not Ready (Not Active) > > irq/17-dwc3-3527 [003] d..164.832254: dwc3_ep0: ep0in: Transfer Not > > Ready (Not Active): state 'Setup Phase' > > irq/17-dwc3-3527 [003] d..164.832254: dwc3_ep0: Control Status > > irq/17-dwc3-3527 [003] d..164.832255: dwc3_prepare_trb:
Re: Memory barrier needed with wake_up_process()?
Hi Alan, Alan Sternwrites: > On Fri, 9 Sep 2016, Felipe Balbi wrote: > >> Finally :-) Here's the diff I used: >> >> diff --git a/drivers/usb/gadget/function/f_mass_storage.c >> b/drivers/usb/gadget/function/f_mass_storage.c >> index 8f3659b65f53..0716024f6b65 100644 >> --- a/drivers/usb/gadget/function/f_mass_storage.c >> +++ b/drivers/usb/gadget/function/f_mass_storage.c >> @@ -481,6 +481,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct >> usb_request *req) >> spin_lock(>lock); >> bh->outreq_busy = 0; >> bh->state = BUF_STATE_FULL; >> + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN) >> + trace_printk("compl: bh %p state %d\n", bh, bh->state); >> wakeup_thread(common); >> spin_unlock(>lock); >> } >> @@ -2208,6 +2210,7 @@ static int get_next_command(struct fsg_common *common) >> rc = sleep_thread(common, true); >> if (rc) >> return rc; >> + trace_printk("next: bh %p state %d\n", bh, bh->state); >> } >> smp_rmb(); >> rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; >> >> >> And here's trace output: >> >> # tracer: nop >> # >> # entries-in-buffer/entries-written: 1002/1002 #P:4 >> # >> # _-=> irqs-off >> # / _=> need-resched >> #| / _---=> hardirq/softirq >> #|| / _--=> preempt-depth >> #||| / delay >> # TASK-PID CPU# TIMESTAMP FUNCTION >> # | | | | | >> file-storage-3578 [000] 21166.789127: fsg_main_thread: next: bh >> 880111e69a00 state 2 >> file-storage-3578 [000] 21166.789312: fsg_main_thread: next: bh >> 880111e69a00 state 2 >> irq/17-dwc3-3579 [003] d..1 21166.789395: bulk_out_complete: compl: bh >> 880111e69a00 state 1 >> file-storage-3578 [000] 21166.789445: fsg_main_thread: next: bh >> 880111e69a00 state 1 > > Okay, that's normal. 2 = BUF_STATE_BUSY, 1 = BUF_STATE_FULL. So we get woken > up a couple of times while the transfer is in progress (probably because some > earlier buffers have finished transferring), then the CBW transfer completes > and the buffer is read. > > ... > >> file-storage-3578 [002] 21167.726827: fsg_main_thread: next: bh >> 880111e69a80 state 2 >> irq/17-dwc3-3579 [003] d..1 21167.727066: bulk_out_complete: compl: bh >> 880111e69a80 state 1 >> file-storage-3578 [002] 21167.727072: fsg_main_thread: next: bh >> 880111e69a80 state 1 >> file-storage-3578 [002] 21167.729458: fsg_main_thread: next: bh >> 880111e6aac0 state 2 >> irq/17-dwc3-3579 [003] d..1 21167.729666: bulk_out_complete: compl: bh >> 880111e6aac0 state 1 >> file-storage-3578 [002] 21167.729670: fsg_main_thread: next: bh >> 880111e6aac0 state 1 > > And this is where everything stopped? yeah, that's everything. > This also looks normal. So the question is what happened when > get_next_command() returned after this? > > Felipe, maybe the patch below (in place of your current patch) will > help. Since the events that it logs are all supposed to be unusual, > you can use printk if you want, but I wrote it with trace_printk. I've applied your patch and it wasn't giving me any output, which hinted that g_mass_storage wasn't returning any failures. So I enabled dwc3's traces to get more data out of it. Here's the final snippet (with comments, again). Let me know if you want the entire thing (it's ~14MiB). > irq/17-dwc3-3527 [003] d..134.215214: dwc3_event: event (00110301): Link > Change [U1] > irq/17-dwc3-3527 [003] d..134.215214: dwc3_event: event (4086): > ep1in: Transfer In-Progress > irq/17-dwc3-3527 [003] d..134.215214: dwc3_complete_trb: ep1in: 2/255 > trb 880084813e70 buf 80808000 size 0 ctrl 0c18 > (hlcS:SC:normal) > irq/17-dwc3-3527 [003] d..134.215214: dwc3_gadget_giveback: ep1in: req > 88016d046840 length 16384/16384 zsI ==> 0 > irq/17-dwc3-3527 [003] d...34.215215: usb_gadget_giveback_request: > ep1in: length 16384/16384 sgs 0/0 stream 0 zsI status 0 --> 0 > irq/17-dwc3-3527 [003] d..134.215218: dwc3_gadget_ep_cmd: ep1in: cmd > 'Update Transfer' [196615] params --> status: > Successful > irq/17-dwc3-3527 [003] d..134.215281: dwc3_event: event (00100301): Link > Change [U0] > irq/17-dwc3-3527 [003] d..134.215281: dwc3_event: event (00110301): Link > Change [U1] > irq/17-dwc3-3527 [003] d..134.215282: dwc3_event: event (4086): > ep1in: Transfer In-Progress > irq/17-dwc3-3527 [003] d..134.215282: dwc3_complete_trb: ep1in: 1/255 > trb 880084813e80 buf 8080c000 size 0 ctrl 0c18 > (hlcS:SC:normal) >
Re: Memory barrier needed with wake_up_process()?
On Fri, 9 Sep 2016, Felipe Balbi wrote: > Finally :-) Here's the diff I used: > > diff --git a/drivers/usb/gadget/function/f_mass_storage.c > b/drivers/usb/gadget/function/f_mass_storage.c > index 8f3659b65f53..0716024f6b65 100644 > --- a/drivers/usb/gadget/function/f_mass_storage.c > +++ b/drivers/usb/gadget/function/f_mass_storage.c > @@ -481,6 +481,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct > usb_request *req) > spin_lock(>lock); > bh->outreq_busy = 0; > bh->state = BUF_STATE_FULL; > + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN) > + trace_printk("compl: bh %p state %d\n", bh, bh->state); > wakeup_thread(common); > spin_unlock(>lock); > } > @@ -2208,6 +2210,7 @@ static int get_next_command(struct fsg_common *common) > rc = sleep_thread(common, true); > if (rc) > return rc; > + trace_printk("next: bh %p state %d\n", bh, bh->state); > } > smp_rmb(); > rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; > > > And here's trace output: > > # tracer: nop > # > # entries-in-buffer/entries-written: 1002/1002 #P:4 > # > # _-=> irqs-off > # / _=> need-resched > #| / _---=> hardirq/softirq > #|| / _--=> preempt-depth > #||| / delay > # TASK-PID CPU# TIMESTAMP FUNCTION > # | | | | | > file-storage-3578 [000] 21166.789127: fsg_main_thread: next: bh > 880111e69a00 state 2 > file-storage-3578 [000] 21166.789312: fsg_main_thread: next: bh > 880111e69a00 state 2 > irq/17-dwc3-3579 [003] d..1 21166.789395: bulk_out_complete: compl: bh > 880111e69a00 state 1 > file-storage-3578 [000] 21166.789445: fsg_main_thread: next: bh > 880111e69a00 state 1 Okay, that's normal. 2 = BUF_STATE_BUSY, 1 = BUF_STATE_FULL. So we get woken up a couple of times while the transfer is in progress (probably because some earlier buffers have finished transferring), then the CBW transfer completes and the buffer is read. ... > file-storage-3578 [002] 21167.726827: fsg_main_thread: next: bh > 880111e69a80 state 2 > irq/17-dwc3-3579 [003] d..1 21167.727066: bulk_out_complete: compl: bh > 880111e69a80 state 1 > file-storage-3578 [002] 21167.727072: fsg_main_thread: next: bh > 880111e69a80 state 1 > file-storage-3578 [002] 21167.729458: fsg_main_thread: next: bh > 880111e6aac0 state 2 > irq/17-dwc3-3579 [003] d..1 21167.729666: bulk_out_complete: compl: bh > 880111e6aac0 state 1 > file-storage-3578 [002] 21167.729670: fsg_main_thread: next: bh > 880111e6aac0 state 1 And this is where everything stopped? This also looks normal. So the question is what happened when get_next_command() returned after this? Felipe, maybe the patch below (in place of your current patch) will help. Since the events that it logs are all supposed to be unusual, you can use printk if you want, but I wrote it with trace_printk. Alan Stern Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c === --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c @@ -415,6 +415,7 @@ static void raise_exception(struct fsg_c * If a lower-or-equal priority exception is in progress, preempt it * and notify the main thread by sending it a signal. */ + trace_printk("raise_exception %d\n", new_state); spin_lock_irqsave(>lock, flags); if (common->state <= new_state) { common->exception_req_tag = common->ep0_req_tag; @@ -2495,6 +2496,7 @@ static void handle_exception(struct fsg_ static int fsg_main_thread(void *common_) { struct fsg_common *common = common_; + int rc; /* * Allow the thread to be killed by a signal, but set the signal mask @@ -2518,6 +2520,7 @@ static int fsg_main_thread(void *common_ /* The main loop */ while (common->state != FSG_STATE_TERMINATED) { if (exception_in_progress(common) || signal_pending(current)) { + trace_printk("handling exception\n"); handle_exception(common); continue; } @@ -2527,24 +2530,38 @@ static int fsg_main_thread(void *common_ continue; } - if (get_next_command(common)) + rc = get_next_command(common); + if (rc) { + trace_printk("get_next_command -> %d\n", rc); continue; + }
Re: Memory barrier needed with wake_up_process()?
Hi, Felipe Balbiwrites: > Alan Stern writes: >> On Tue, 6 Sep 2016, Peter Zijlstra wrote: >> >>> On Tue, Sep 06, 2016 at 01:49:37PM +0200, Peter Zijlstra wrote: >>> > On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote: >>> >>> > > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just >>> > > adds extra overhead which makes the problem much, much less likely to >>> > > happen. Does that sound plausible to you? >>> > >>> > I did consider that, but I've not sufficiently grokked the code to rule >>> > out actual fail. So let me stare at this a bit more. >>> >>> OK, so I'm really not seeing it, we've got: >>> >>> while (bh->state != FULL) { >>> for (;;) { >>> set_current_state(INTERRUPTIBLE); /* MB after */ >>> if (signal_pending(current)) >>> return -EINTR; >>> if (common->thread_wakeup_needed) >>> break; >>> schedule(); /* MB */ >>> } >>> __set_current_state(RUNNING); >>> common->thread_wakeup_needed = 0; >>> smp_rmb(); /* NOP */ >>> } >>> >>> >>> VS. >>> >>> >>> spin_lock(>lock); /* MB */ >>> bh->state = FULL; >>> smp_wmb(); /* NOP */ >>> common->thread_wakeup_needed = 1; >>> wake_up_process(common->thread_task); /* MB before */ >>> spin_unlock(>lock); >>> >>> >>> >>> (the MB annotations specific to x86, not true in general) >>> >>> >>> If we observe thread_wakeup_needed, we must also observe bh->state. >>> >>> And the sleep/wakeup ordering is also correct, we either see >>> thread_wakeup_needed and continue, or we see task->state == RUNNING >>> (from the wakeup) and NO-OP schedule(). The MB from set_current_statE() >>> then matches with the MB from wake_up_process() to ensure we must see >>> thead_wakeup_needed. >>> >>> Or, we go sleep, and get woken up, at which point the same happens. >>> Since the waking CPU gets the task back on its RQ the happens-before >>> chain includes the waking CPUs state along with the state of the task >>> itself before it went to sleep. >>> >>> At which point we're back where we started, once we see >>> thread_wakeup_needed we must then also see bh->state (and all state >>> prior to that on the waking CPU). >>> >>> >>> >>> There's enough cruft in the while-sleep loop to force reload bh->state. >>> >>> Load/store tearing cannot be a problem because all values are single >>> bytes (the variables are multi bytes, but all values used only affect >>> the LSB). >>> >>> Colour me puzzled. >> >> Felipe, can you please try this patch on an unmodified tree? If the >> problem still occurs, what shows up in the kernel log? >> >> Alan Stern >> >> >> >> Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c >> === >> --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c >> +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c >> @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb >> spin_lock(>lock); >> bh->outreq_busy = 0; >> bh->state = BUF_STATE_FULL; >> +if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN) >> +INFO(common, "compl: bh %p state %d\n", bh, bh->state); >> wakeup_thread(common); >> spin_unlock(>lock); >> } >> @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_c >> rc = sleep_thread(common, true); >> if (rc) >> return rc; >> +INFO(common, "next: bh %p state %d\n", bh, bh->state); >> } >> smp_rmb(); >> rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; > > I've replace INFO() with trace_printk() (which is what I have been using > anyway): > > diff --git a/drivers/usb/gadget/function/f_mass_storage.c > b/drivers/usb/gadget/function/f_mass_storage.c > index 2505117e88e8..dbc6a380b38b 100644 > --- a/drivers/usb/gadget/function/f_mass_storage.c > +++ b/drivers/usb/gadget/function/f_mass_storage.c > @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct > usb_request *req) > spin_lock(>lock); > bh->outreq_busy = 0; > bh->state = BUF_STATE_FULL; > + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN) > + trace_printk("compl: bh %p state %d\n", bh, bh->state); > wakeup_thread(common); > spin_unlock(>lock); > } > @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_common *common) > rc = sleep_thread(common, true); > if (rc) > return rc; > + trace_printk("next: bh %p state %d\n", bh, bh->state); > } > smp_rmb(); > rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; > > But I can't reproduce as reliably as before. I'll keep the thing running > an infinite loop which will stop only when interrupts in UDC (dwc3 in > this case) stop
Re: Memory barrier needed with wake_up_process()?
Hi, Alan Sternwrites: > On Tue, 6 Sep 2016, Peter Zijlstra wrote: > >> On Tue, Sep 06, 2016 at 01:49:37PM +0200, Peter Zijlstra wrote: >> > On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote: >> >> > > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just >> > > adds extra overhead which makes the problem much, much less likely to >> > > happen. Does that sound plausible to you? >> > >> > I did consider that, but I've not sufficiently grokked the code to rule >> > out actual fail. So let me stare at this a bit more. >> >> OK, so I'm really not seeing it, we've got: >> >> while (bh->state != FULL) { >> for (;;) { >> set_current_state(INTERRUPTIBLE); /* MB after */ >> if (signal_pending(current)) >> return -EINTR; >> if (common->thread_wakeup_needed) >> break; >> schedule(); /* MB */ >> } >> __set_current_state(RUNNING); >> common->thread_wakeup_needed = 0; >> smp_rmb(); /* NOP */ >> } >> >> >> VS. >> >> >> spin_lock(>lock); /* MB */ >> bh->state = FULL; >> smp_wmb(); /* NOP */ >> common->thread_wakeup_needed = 1; >> wake_up_process(common->thread_task); /* MB before */ >> spin_unlock(>lock); >> >> >> >> (the MB annotations specific to x86, not true in general) >> >> >> If we observe thread_wakeup_needed, we must also observe bh->state. >> >> And the sleep/wakeup ordering is also correct, we either see >> thread_wakeup_needed and continue, or we see task->state == RUNNING >> (from the wakeup) and NO-OP schedule(). The MB from set_current_statE() >> then matches with the MB from wake_up_process() to ensure we must see >> thead_wakeup_needed. >> >> Or, we go sleep, and get woken up, at which point the same happens. >> Since the waking CPU gets the task back on its RQ the happens-before >> chain includes the waking CPUs state along with the state of the task >> itself before it went to sleep. >> >> At which point we're back where we started, once we see >> thread_wakeup_needed we must then also see bh->state (and all state >> prior to that on the waking CPU). >> >> >> >> There's enough cruft in the while-sleep loop to force reload bh->state. >> >> Load/store tearing cannot be a problem because all values are single >> bytes (the variables are multi bytes, but all values used only affect >> the LSB). >> >> Colour me puzzled. > > Felipe, can you please try this patch on an unmodified tree? If the > problem still occurs, what shows up in the kernel log? > > Alan Stern > > > > Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c > === > --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c > +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c > @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb > spin_lock(>lock); > bh->outreq_busy = 0; > bh->state = BUF_STATE_FULL; > + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN) > + INFO(common, "compl: bh %p state %d\n", bh, bh->state); > wakeup_thread(common); > spin_unlock(>lock); > } > @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_c > rc = sleep_thread(common, true); > if (rc) > return rc; > + INFO(common, "next: bh %p state %d\n", bh, bh->state); > } > smp_rmb(); > rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; I've replace INFO() with trace_printk() (which is what I have been using anyway): diff --git a/drivers/usb/gadget/function/f_mass_storage.c b/drivers/usb/gadget/function/f_mass_storage.c index 2505117e88e8..dbc6a380b38b 100644 --- a/drivers/usb/gadget/function/f_mass_storage.c +++ b/drivers/usb/gadget/function/f_mass_storage.c @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct usb_request *req) spin_lock(>lock); bh->outreq_busy = 0; bh->state = BUF_STATE_FULL; + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN) + trace_printk("compl: bh %p state %d\n", bh, bh->state); wakeup_thread(common); spin_unlock(>lock); } @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_common *common) rc = sleep_thread(common, true); if (rc) return rc; + trace_printk("next: bh %p state %d\n", bh, bh->state); } smp_rmb(); rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; But I can't reproduce as reliably as before. I'll keep the thing running an infinite loop which will stop only when interrupts in UDC (dwc3 in this case) stop increasing. -- balbi signature.asc Description: PGP signature
Re: Memory barrier needed with wake_up_process()?
On Tue, Sep 06, 2016 at 10:46:55AM -0400, Alan Stern wrote: Not knowing where INFO() goes, you should use trace_printk() not printk(), as the former is strictly per cpu, while the latter is globally serialized and can hide all these problems. > Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c > === > --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c > +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c > @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb > spin_lock(>lock); > bh->outreq_busy = 0; > bh->state = BUF_STATE_FULL; > + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN) > + INFO(common, "compl: bh %p state %d\n", bh, bh->state); > wakeup_thread(common); > spin_unlock(>lock); > } > @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_c > rc = sleep_thread(common, true); > if (rc) > return rc; > + INFO(common, "next: bh %p state %d\n", bh, bh->state); > } > smp_rmb(); > rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Tue, 6 Sep 2016, Peter Zijlstra wrote: > On Tue, Sep 06, 2016 at 01:49:37PM +0200, Peter Zijlstra wrote: > > On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote: > > > > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just > > > adds extra overhead which makes the problem much, much less likely to > > > happen. Does that sound plausible to you? > > > > I did consider that, but I've not sufficiently grokked the code to rule > > out actual fail. So let me stare at this a bit more. > > OK, so I'm really not seeing it, we've got: > > while (bh->state != FULL) { > for (;;) { > set_current_state(INTERRUPTIBLE); /* MB after */ > if (signal_pending(current)) > return -EINTR; > if (common->thread_wakeup_needed) > break; > schedule(); /* MB */ > } > __set_current_state(RUNNING); > common->thread_wakeup_needed = 0; > smp_rmb(); /* NOP */ > } > > > VS. > > > spin_lock(>lock); /* MB */ > bh->state = FULL; > smp_wmb(); /* NOP */ > common->thread_wakeup_needed = 1; > wake_up_process(common->thread_task); /* MB before */ > spin_unlock(>lock); > > > > (the MB annotations specific to x86, not true in general) > > > If we observe thread_wakeup_needed, we must also observe bh->state. > > And the sleep/wakeup ordering is also correct, we either see > thread_wakeup_needed and continue, or we see task->state == RUNNING > (from the wakeup) and NO-OP schedule(). The MB from set_current_statE() > then matches with the MB from wake_up_process() to ensure we must see > thead_wakeup_needed. > > Or, we go sleep, and get woken up, at which point the same happens. > Since the waking CPU gets the task back on its RQ the happens-before > chain includes the waking CPUs state along with the state of the task > itself before it went to sleep. > > At which point we're back where we started, once we see > thread_wakeup_needed we must then also see bh->state (and all state > prior to that on the waking CPU). > > > > There's enough cruft in the while-sleep loop to force reload bh->state. > > Load/store tearing cannot be a problem because all values are single > bytes (the variables are multi bytes, but all values used only affect > the LSB). > > Colour me puzzled. Felipe, can you please try this patch on an unmodified tree? If the problem still occurs, what shows up in the kernel log? Alan Stern Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c === --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb spin_lock(>lock); bh->outreq_busy = 0; bh->state = BUF_STATE_FULL; + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN) + INFO(common, "compl: bh %p state %d\n", bh, bh->state); wakeup_thread(common); spin_unlock(>lock); } @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_c rc = sleep_thread(common, true); if (rc) return rc; + INFO(common, "next: bh %p state %d\n", bh, bh->state); } smp_rmb(); rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO; -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Tue, Sep 06, 2016 at 01:49:37PM +0200, Peter Zijlstra wrote: > On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote: > > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just > > adds extra overhead which makes the problem much, much less likely to > > happen. Does that sound plausible to you? > > I did consider that, but I've not sufficiently grokked the code to rule > out actual fail. So let me stare at this a bit more. OK, so I'm really not seeing it, we've got: while (bh->state != FULL) { for (;;) { set_current_state(INTERRUPTIBLE); /* MB after */ if (signal_pending(current)) return -EINTR; if (common->thread_wakeup_needed) break; schedule(); /* MB */ } __set_current_state(RUNNING); common->thread_wakeup_needed = 0; smp_rmb(); /* NOP */ } VS. spin_lock(>lock); /* MB */ bh->state = FULL; smp_wmb(); /* NOP */ common->thread_wakeup_needed = 1; wake_up_process(common->thread_task); /* MB before */ spin_unlock(>lock); (the MB annotations specific to x86, not true in general) If we observe thread_wakeup_needed, we must also observe bh->state. And the sleep/wakeup ordering is also correct, we either see thread_wakeup_needed and continue, or we see task->state == RUNNING (from the wakeup) and NO-OP schedule(). The MB from set_current_statE() then matches with the MB from wake_up_process() to ensure we must see thead_wakeup_needed. Or, we go sleep, and get woken up, at which point the same happens. Since the waking CPU gets the task back on its RQ the happens-before chain includes the waking CPUs state along with the state of the task itself before it went to sleep. At which point we're back where we started, once we see thread_wakeup_needed we must then also see bh->state (and all state prior to that on the waking CPU). There's enough cruft in the while-sleep loop to force reload bh->state. Load/store tearing cannot be a problem because all values are single bytes (the variables are multi bytes, but all values used only affect the LSB). Colour me puzzled. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote: > > Could you confirm that bulk_{in,out}_complete() work on different > > usb_request structures, and they can not, at any time, get called on the > > _same_ request? > > usb_requests are allocated for a specific endpoint and USB Device > Controller (UDC) drivers refuse to queue requests allocated for epX to > epY, so this really can never happen. Good, thanks! > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just > adds extra overhead which makes the problem much, much less likely to > happen. Does that sound plausible to you? I did consider that, but I've not sufficiently grokked the code to rule out actual fail. So let me stare at this a bit more. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
Hi, Peter Zijlstrawrites: > On Mon, Sep 05, 2016 at 11:29:26AM -0400, Alan Stern wrote: >> On Mon, 5 Sep 2016, Peter Zijlstra wrote: >> >> > > Actually, seeing it written out like this, one realizes that it really >> > > ought to be: >> > > >> > > CPU 0 CPU 1 >> > > -- >> > > Start DMAHandle DMA-complete irq >> > > Sleep until bh->statesmp_mb() >> > > set bh->state >> > > Wake up CPU 0 >> > > smp_mb() >> > > Compute rc based on contents of the DMA buffer >> > > >> > > (Bear in mind also that on some platforms, the I/O operation is carried >> > > out by PIO rather than DMA.) > >> > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC. >> > Its only defined to do CPU/CPU interactions. >> >> Suppose the DMA master finishes filling up an input buffer and issues a >> completion irq to CPU1. Presumably the data would then be visible to >> CPU1 if the interrupt handler looked at it. So if CPU1 executes smp_mb >> before setting bh->state and waking up CPU0, and if CPU0 executes >> smp_mb after testing bh->state and before reading the data buffer, >> wouldn't CPU0 then see the correct data in the buffer? Even if CPU0 >> never did go to sleep? > > Couple of notes here; I would expect the DMA master to make its stores > _globally_ visible on 'completion'. Because I'm not sure our smp_mb() > would help make it globally visible, since its only defined on CPU/CPU > interactions, not external. > > Note that for example ARM has the distinction where smp_mb() uses > DMB-ISH barrier, dma_[rw]mb() uses DMB-OSH{LD,ST} and mb() uses DSB-SY. > > The ISH domain is the Inner-SHarable (IIRC) and only includes CPUs. The > OSH is Outer-SHarable and adds external agents like DMA (also includes > CPUs). The DSB-SY thing is even heavier and syncs world or something; I > always forget these details. > >> Or would something more be needed? > > The thing is, this is x86 (TSO). Most everything is globally > ordered/visible and full barriers. > > The only reorder allowed by TSO is a later read can happen before a > prior store. That is, only: > > X = 1; > smp_mb(); > r = Y; > > actually needs a full barrier. All the other variants are no-ops. > > Note that x86's dma_[rw]mb() are no-ops (like the regular smp_[rw]mb()). > Which seems to imply DMA is coherent and globally visible. > >> > I would very much expect the device IO stuff to order things for us in >> > this case. "Start DMA" should very much include sufficient fences to >> > ensure the data under DMA is visible to the DMA engine, this would very >> > much include things like flushing store buffers and maybe even writeback >> > caches, depending on platform needs. >> > >> > At the same time, I would expect "Handle DMA-complete irq", even if it >> > were done with a PIO polling loop, to guarantee ordering against later >> > operations such that 'complete' really means that. >> >> That's what I would expect too. >> >> Back in the original email thread where the problem was first reported, >> Felipe says that the problem appears to be something else. Here's what >> it looks like now, in schematic form: >> >> CPU0 >> >> get_next_command(): >>while (bh->state != BUF_STATE_EMPTY) >> sleep_thread(); >>start an input request for bh >>while (bh->state != BUF_STATE_FULL) >> sleep_thread(); >> >> As mentioned above, the input involves DMA and is terminated by an irq. >> The request's completion handler is bulk_out_complete(): >> >> CPU1 >> >> bulk_out_complete(): >>bh->state = BUF_STATE_FULL; >>wakeup_thread(); >> >> According to Felipe, when CPU0 wakes up and checks bh->state, it sees a >> value different from BUF_STATE_FULL. So it goes back to sleep again >> and doesn't make any forward progress. >> >> It's possible that something else is changing bh->state when it >> shouldn't. But if this were the explanation, why would Felipe see that >> the problem goes away when he changes the memory barriers in >> sleep_thread() and wakeup_thread() to smp_mb()? > > Good question, let me stare at the code more. > > Could you confirm that bulk_{in,out}_complete() work on different > usb_request structures, and they can not, at any time, get called on the > _same_ request? usb_requests are allocated for a specific endpoint and USB Device Controller (UDC) drivers refuse to queue requests allocated for epX to epY, so this really can never happen. My fear now, however, is that changing smp_[rw]mb() to smp_mb() just adds extra overhead which makes the problem much, much less likely to happen. Does that sound plausible to you? -- balbi signature.asc Description: PGP signature
Re: Memory barrier needed with wake_up_process()?
On Mon, Sep 05, 2016 at 11:29:26AM -0400, Alan Stern wrote: > On Mon, 5 Sep 2016, Peter Zijlstra wrote: > > > > Actually, seeing it written out like this, one realizes that it really > > > ought to be: > > > > > > CPU 0 CPU 1 > > > - - > > > Start DMA Handle DMA-complete irq > > > Sleep until bh->state smp_mb() > > > set bh->state > > > Wake up CPU 0 > > > smp_mb() > > > Compute rc based on contents of the DMA buffer > > > > > > (Bear in mind also that on some platforms, the I/O operation is carried > > > out by PIO rather than DMA.) > > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC. > > Its only defined to do CPU/CPU interactions. > > Suppose the DMA master finishes filling up an input buffer and issues a > completion irq to CPU1. Presumably the data would then be visible to > CPU1 if the interrupt handler looked at it. So if CPU1 executes smp_mb > before setting bh->state and waking up CPU0, and if CPU0 executes > smp_mb after testing bh->state and before reading the data buffer, > wouldn't CPU0 then see the correct data in the buffer? Even if CPU0 > never did go to sleep? Couple of notes here; I would expect the DMA master to make its stores _globally_ visible on 'completion'. Because I'm not sure our smp_mb() would help make it globally visible, since its only defined on CPU/CPU interactions, not external. Note that for example ARM has the distinction where smp_mb() uses DMB-ISH barrier, dma_[rw]mb() uses DMB-OSH{LD,ST} and mb() uses DSB-SY. The ISH domain is the Inner-SHarable (IIRC) and only includes CPUs. The OSH is Outer-SHarable and adds external agents like DMA (also includes CPUs). The DSB-SY thing is even heavier and syncs world or something; I always forget these details. > Or would something more be needed? The thing is, this is x86 (TSO). Most everything is globally ordered/visible and full barriers. The only reorder allowed by TSO is a later read can happen before a prior store. That is, only: X = 1; smp_mb(); r = Y; actually needs a full barrier. All the other variants are no-ops. Note that x86's dma_[rw]mb() are no-ops (like the regular smp_[rw]mb()). Which seems to imply DMA is coherent and globally visible. > > I would very much expect the device IO stuff to order things for us in > > this case. "Start DMA" should very much include sufficient fences to > > ensure the data under DMA is visible to the DMA engine, this would very > > much include things like flushing store buffers and maybe even writeback > > caches, depending on platform needs. > > > > At the same time, I would expect "Handle DMA-complete irq", even if it > > were done with a PIO polling loop, to guarantee ordering against later > > operations such that 'complete' really means that. > > That's what I would expect too. > > Back in the original email thread where the problem was first reported, > Felipe says that the problem appears to be something else. Here's what > it looks like now, in schematic form: > > CPU0 > > get_next_command(): > while (bh->state != BUF_STATE_EMPTY) > sleep_thread(); > start an input request for bh > while (bh->state != BUF_STATE_FULL) > sleep_thread(); > > As mentioned above, the input involves DMA and is terminated by an irq. > The request's completion handler is bulk_out_complete(): > > CPU1 > > bulk_out_complete(): > bh->state = BUF_STATE_FULL; > wakeup_thread(); > > According to Felipe, when CPU0 wakes up and checks bh->state, it sees a > value different from BUF_STATE_FULL. So it goes back to sleep again > and doesn't make any forward progress. > > It's possible that something else is changing bh->state when it > shouldn't. But if this were the explanation, why would Felipe see that > the problem goes away when he changes the memory barriers in > sleep_thread() and wakeup_thread() to smp_mb()? Good question, let me stare at the code more. Could you confirm that bulk_{in,out}_complete() work on different usb_request structures, and they can not, at any time, get called on the _same_ request? -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Mon, Sep 05, 2016 at 10:43:11AM +0100, Will Deacon wrote: > On Sat, Sep 03, 2016 at 12:16:29AM +0200, Peter Zijlstra wrote: > > Forgot to Cc Will. Will, does ARM64 need to make smp_mb__before_spinlock > > smp_mb() too? > > Yes, probably. Just to confirm, the test is something like: > > > CPU0 > > > Wx=1 > smp_mb__before_spinlock() > LOCK(y) > Rz=0 > > CPU1 > > > Wz=1 > smp_mb() > Rx=0 > > > and that should be forbidden? Indeed. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Mon, 5 Sep 2016, Peter Zijlstra wrote: > > Actually, seeing it written out like this, one realizes that it really > > ought to be: > > > > CPU 0 CPU 1 > > - - > > Start DMA Handle DMA-complete irq > > Sleep until bh->state smp_mb() > > set bh->state > > Wake up CPU 0 > > smp_mb() > > Compute rc based on contents of the DMA buffer > > > > (Bear in mind also that on some platforms, the I/O operation is carried > > out by PIO rather than DMA.) > > I'm sorry, but I still don't follow. This could be because I seldom > interact with DMA agents and therefore am not familiar with that stuff. I haven't seen the details of memory ordering requirements in the presence of DMA spelled out anywhere. The documents I have read merely state that you have to careful to flush caches before doing DMA OUT and to avoid filling caches before DMA IN is complete. Neither of those is an issue here, apparently. > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC. > Its only defined to do CPU/CPU interactions. Suppose the DMA master finishes filling up an input buffer and issues a completion irq to CPU1. Presumably the data would then be visible to CPU1 if the interrupt handler looked at it. So if CPU1 executes smp_mb before setting bh->state and waking up CPU0, and if CPU0 executes smp_mb after testing bh->state and before reading the data buffer, wouldn't CPU0 then see the correct data in the buffer? Even if CPU0 never did go to sleep? Or would something more be needed? > I would very much expect the device IO stuff to order things for us in > this case. "Start DMA" should very much include sufficient fences to > ensure the data under DMA is visible to the DMA engine, this would very > much include things like flushing store buffers and maybe even writeback > caches, depending on platform needs. > > At the same time, I would expect "Handle DMA-complete irq", even if it > were done with a PIO polling loop, to guarantee ordering against later > operations such that 'complete' really means that. That's what I would expect too. Back in the original email thread where the problem was first reported, Felipe says that the problem appears to be something else. Here's what it looks like now, in schematic form: CPU0 get_next_command(): while (bh->state != BUF_STATE_EMPTY) sleep_thread(); start an input request for bh while (bh->state != BUF_STATE_FULL) sleep_thread(); As mentioned above, the input involves DMA and is terminated by an irq. The request's completion handler is bulk_out_complete(): CPU1 bulk_out_complete(): bh->state = BUF_STATE_FULL; wakeup_thread(); According to Felipe, when CPU0 wakes up and checks bh->state, it sees a value different from BUF_STATE_FULL. So it goes back to sleep again and doesn't make any forward progress. It's possible that something else is changing bh->state when it shouldn't. But if this were the explanation, why would Felipe see that the problem goes away when he changes the memory barriers in sleep_thread() and wakeup_thread() to smp_mb()? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Mon, 5 Sep 2016, Peter Zijlstra wrote: > > You know, I never went through and verified that _all_ the invocations > > of sleep_thread() are like that. > > Well, thing is, they're all inside a loop which checks other conditions > for forward progress. Therefore the loop inside sleep_thread() is > pointless. Even if you were to return early, you'd simply loop in the > outer loop and go back to sleep again. > > > In fact, I wrote the sleep/wakeup > > routines _before_ the rest of the code, and I didn't know in advance > > exactly how they were going to be called. > > Still seems strange to me, why not use wait-queues for the first cut? > > Only if you find a performance issue with wait-queues, which cannot be > fixed in the wait-queue proper, then do you do custom thingies. > > Starting with a custom sleeper, just doesn't make sense to me. I really don't remember. Felipe says that the ancient history shows the initial implementation did use a wait-queue, and then it was changed. Perhaps I was imitating the structure of scsi_error_handler(). > > The problem may be that when the thread wakes up (or skips going to > > sleep), it needs to see more than just bh->state. Those other values > > it needs are not written by the same CPU that calls wakeup_thread(), > > and so to ensure that they are visible that smp_wmb() really ought to > > be smp_mb() (and correspondingly in the thread. That's what Felipe has > > been testing. > > So you're saying something like: > > > CPU0CPU1CPU2 > > X = 1 sleep_thread() > wakeup_thread() > r = X > > But how does CPU1 know to do the wakeup? That is, how are CPU0 and CPU1 > coupled. As mentioned later on, "CPU0" is actually a DMA master, not another CPU. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, Sep 03, 2016 at 12:16:29AM +0200, Peter Zijlstra wrote: > On Sat, Sep 03, 2016 at 12:14:13AM +0200, Peter Zijlstra wrote: > > On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote: > > > > > > Actually, that's not entirely true (although presumably it works okay > > > for most architectures). > > > > Yeah, all load-store archs (with exception of PowerPC and ARM64 and > > possibly MIPS) implement ACQUIRE with a general fence (after the ll/sc). > > > > ( and MIPS doesn't use their fancy barriers in Linux ) > > > > PowerPC does the full fence for smp_mb__before_spinlock, which leaves > > ARM64, I'm not sure its correct, but I'm way too tired to think about > > that now. > > > > The TSO archs imply full barriers with all atomic RmW ops and are > > therefore also good. > > > > Forgot to Cc Will. Will, does ARM64 need to make smp_mb__before_spinlock > smp_mb() too? Yes, probably. Just to confirm, the test is something like: CPU0 Wx=1 smp_mb__before_spinlock() LOCK(y) Rz=0 CPU1 Wz=1 smp_mb() Rx=0 and that should be forbidden? Will -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, Sep 03, 2016 at 10:49:39AM -0400, Alan Stern wrote: > On Sat, 3 Sep 2016, Alan Stern wrote: > > > In other words, we have: > > > > CPU 0 CPU 1 > > - - > > Start DMA Handle DMA-complete irq > > Sleep until bh->state Set bh->state > > smp_wmb() > > Wake up CPU 0 > > smp_rmb() > > Compute rc based on contents > > of the DMA buffer > > > > This was written many years ago, at a time when I did not fully > > understand all the details of memory ordering. Do you agree that both > > of those barriers should really be smp_mb()? That's what Felipe has > > been testing. > > Actually, seeing it written out like this, one realizes that it really > ought to be: > > CPU 0 CPU 1 > - - > Start DMA Handle DMA-complete irq > Sleep until bh->state smp_mb() > set bh->state > Wake up CPU 0 > smp_mb() > Compute rc based on contents of the DMA buffer > > (Bear in mind also that on some platforms, the I/O operation is carried > out by PIO rather than DMA.) I'm sorry, but I still don't follow. This could be because I seldom interact with DMA agents and therefore am not familiar with that stuff. Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC. Its only defined to do CPU/CPU interactions. I would very much expect the device IO stuff to order things for us in this case. "Start DMA" should very much include sufficient fences to ensure the data under DMA is visible to the DMA engine, this would very much include things like flushing store buffers and maybe even writeback caches, depending on platform needs. At the same time, I would expect "Handle DMA-complete irq", even if it were done with a PIO polling loop, to guarantee ordering against later operations such that 'complete' really means that. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, Sep 03, 2016 at 04:51:07PM +0300, Felipe Balbi wrote: > > That said, I cannot spot an obvious fail, > > okay, but a fail does exist. Any hints on what extra information I could > capture to help figuring this one out? More information on which sleep is not waking woudl help I suppose. That greatly limits the amount of code one has to stare at. Also, a better picture of all threads involved in the wakeup. Alan seems to suggest there's multiple CPUs involved in the wakeup. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, Sep 03, 2016 at 10:16:31AM -0400, Alan Stern wrote: > > Sorry, but that is horrible code. A barrier cannot ensure writes are > > 'complete', at best they can ensure order between writes (or reads > > etc..). > > The code is better than the comment. What I really meant was that the > write of bh->state needs to be visible to the thread after it wakes up > (or after it checks the wakeup condition and skips going to sleep). Yeah, I got that. > > Also, looking at that thing, that common->thread_wakeup_needed variable > > is 100% redundant. All sleep_thread() invocations are inside a loop of > > sorts and basically wait for other conditions to become true. > > > > For example: > > > > while (bh->state != BUF_STATE_EMPTY) { > > rc = sleep_thread(common, false); > > if (rc) > > return rc; > > } > > > > All you care about there is bh->state, _not_ > > common->thread_wakeup_needed. > > You know, I never went through and verified that _all_ the invocations > of sleep_thread() are like that. Well, thing is, they're all inside a loop which checks other conditions for forward progress. Therefore the loop inside sleep_thread() is pointless. Even if you were to return early, you'd simply loop in the outer loop and go back to sleep again. > In fact, I wrote the sleep/wakeup > routines _before_ the rest of the code, and I didn't know in advance > exactly how they were going to be called. Still seems strange to me, why not use wait-queues for the first cut? Only if you find a performance issue with wait-queues, which cannot be fixed in the wait-queue proper, then do you do custom thingies. Starting with a custom sleeper, just doesn't make sense to me. > > That said, I cannot spot an obvious fail, but the code can certainly use > > help. > > The problem may be that when the thread wakes up (or skips going to > sleep), it needs to see more than just bh->state. Those other values > it needs are not written by the same CPU that calls wakeup_thread(), > and so to ensure that they are visible that smp_wmb() really ought to > be smp_mb() (and correspondingly in the thread. That's what Felipe has > been testing. So you're saying something like: CPU0CPU1CPU2 X = 1 sleep_thread() wakeup_thread() r = X But how does CPU1 know to do the wakeup? That is, how are CPU0 and CPU1 coupled. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, 3 Sep 2016, Alan Stern wrote: > In other words, we have: > > CPU 0 CPU 1 > - - > Start DMA Handle DMA-complete irq > Sleep until bh->state Set bh->state > smp_wmb() > Wake up CPU 0 > smp_rmb() > Compute rc based on contents > of the DMA buffer > > This was written many years ago, at a time when I did not fully > understand all the details of memory ordering. Do you agree that both > of those barriers should really be smp_mb()? That's what Felipe has > been testing. Actually, seeing it written out like this, one realizes that it really ought to be: CPU 0 CPU 1 - - Start DMA Handle DMA-complete irq Sleep until bh->state smp_mb() set bh->state Wake up CPU 0 smp_mb() Compute rc based on contents of the DMA buffer (Bear in mind also that on some platforms, the I/O operation is carried out by PIO rather than DMA.) Also, the smp_wmb() in bulk_out_complete() looks unnecessary. I can't remember why I put it there originally. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, 3 Sep 2016, Peter Zijlstra wrote: > On Fri, Sep 02, 2016 at 04:29:19PM -0400, Alan Stern wrote: > > I'm afraid so. The code doesn't use wait_event(), in part because > > there's no wait_queue (since only one task is involved). > > You can use wait_queue fine with just one task, and it would clean up > the code tremendously. > > You can replace things like the earlier mentioned: > > while (bh->state != BUF_STATE_EMPTY) { > rc = sleep_thread(common, false); > if (rc) > return rc; > } > > with: > > rc = wait_event_interruptible(>wq, bh->state == > BUF_STATE_EMPTY); > if (rc) > return rc; If someone wants to devote time and effort to cleaning up the driver, that would be a good start. > > But maybe there's another barrier which needs to be fixed. Felipe, can > > you check to see if received_cbw() is getting called in > > get_next_command(), and if so, what value it returns? Or is the > > preceding sleep_thread() the one that never wakes up? > > > > It could be that the smp_wmb() in wakeup_thread() needs to be smp_mb(). > > The reason being that get_next_command() runs outside the protection of > > the spinlock. > > Being somewhat confused by the code, I fail to follow that argument. > wakeup_thread() is always called under that spinlock(), but since the > critical section is 2 stores, I fail to see how a smp_mb() can make any > difference over the smp_wmb() already there. But sleep_thread() and the code that follows it are _not_ called under the spinlock. And the following code examines values that were written by DMA, not by the CPU calling wakeup_thread(). (Although that CPU _is_ the one that receives the DMA-completion notice.) In other words, we have: CPU 0 CPU 1 - - Start DMA Handle DMA-complete irq Sleep until bh->state Set bh->state smp_wmb() Wake up CPU 0 smp_rmb() Compute rc based on contents of the DMA buffer This was written many years ago, at a time when I did not fully understand all the details of memory ordering. Do you agree that both of those barriers should really be smp_mb()? That's what Felipe has been testing. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, 3 Sep 2016, Peter Zijlstra wrote: > On Sat, Sep 03, 2016 at 09:58:09AM +0300, Felipe Balbi wrote: > > > > What arch are you seeing this on? > > > > x86. Skylake to be exact. > > So it _cannot_ be the thing Alan mentioned. By the simple fact that > spin_lock() is a full barrier on x86 (every LOCK prefixed instruction > is). True, my guess was wrong. > > The following change survived through the night: > > > > diff --git a/drivers/usb/gadget/function/f_mass_storage.c > > b/drivers/usb/gadget/function/f_mass_storage.c > > index 8f3659b65f53..d31581dd5ce5 100644 > > --- a/drivers/usb/gadget/function/f_mass_storage.c > > +++ b/drivers/usb/gadget/function/f_mass_storage.c > > @@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct > > usb_ep *ep) > > /* Caller must hold fsg->lock */ > > static void wakeup_thread(struct fsg_common *common) > > { > > - smp_wmb(); /* ensure the write of bh->state is complete */ > > + smp_mb(); /* ensure the write of bh->state is complete */ > > /* Tell the main thread that something has happened */ > > common->thread_wakeup_needed = 1; > > if (common->thread_task) > > @@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool > > can_freeze) > > } > > __set_current_state(TASK_RUNNING); > > common->thread_wakeup_needed = 0; > > - smp_rmb(); /* ensure the latest bh->state is visible */ > > + smp_mb(); /* ensure the latest bh->state is visible */ > > return rc; > > } > > Sorry, but that is horrible code. A barrier cannot ensure writes are > 'complete', at best they can ensure order between writes (or reads > etc..). The code is better than the comment. What I really meant was that the write of bh->state needs to be visible to the thread after it wakes up (or after it checks the wakeup condition and skips going to sleep). > Also, looking at that thing, that common->thread_wakeup_needed variable > is 100% redundant. All sleep_thread() invocations are inside a loop of > sorts and basically wait for other conditions to become true. > > For example: > > while (bh->state != BUF_STATE_EMPTY) { > rc = sleep_thread(common, false); > if (rc) > return rc; > } > > All you care about there is bh->state, _not_ > common->thread_wakeup_needed. You know, I never went through and verified that _all_ the invocations of sleep_thread() are like that. In fact, I wrote the sleep/wakeup routines _before_ the rest of the code, and I didn't know in advance exactly how they were going to be called. > That said, I cannot spot an obvious fail, but the code can certainly use > help. The problem may be that when the thread wakes up (or skips going to sleep), it needs to see more than just bh->state. Those other values it needs are not written by the same CPU that calls wakeup_thread(), and so to ensure that they are visible that smp_wmb() really ought to be smp_mb() (and correspondingly in the thread. That's what Felipe has been testing. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
Hi, Peter Zijlstrawrites: > On Sat, Sep 03, 2016 at 09:58:09AM +0300, Felipe Balbi wrote: > >> > What arch are you seeing this on? >> >> x86. Skylake to be exact. > > So it _cannot_ be the thing Alan mentioned. By the simple fact that > spin_lock() is a full barrier on x86 (every LOCK prefixed instruction > is). I still have this working even after 15 hours of runtime on a test case that was failing consistently within few minutes. At a minimum smp_mb() has some side effect which is hiding the actual problem. >> The following change survived through the night: >> >> diff --git a/drivers/usb/gadget/function/f_mass_storage.c >> b/drivers/usb/gadget/function/f_mass_storage.c >> index 8f3659b65f53..d31581dd5ce5 100644 >> --- a/drivers/usb/gadget/function/f_mass_storage.c >> +++ b/drivers/usb/gadget/function/f_mass_storage.c >> @@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct >> usb_ep *ep) >> /* Caller must hold fsg->lock */ >> static void wakeup_thread(struct fsg_common *common) >> { >> -smp_wmb(); /* ensure the write of bh->state is complete */ >> +smp_mb(); /* ensure the write of bh->state is complete */ >> /* Tell the main thread that something has happened */ >> common->thread_wakeup_needed = 1; >> if (common->thread_task) >> @@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool >> can_freeze) >> } >> __set_current_state(TASK_RUNNING); >> common->thread_wakeup_needed = 0; >> -smp_rmb(); /* ensure the latest bh->state is visible */ >> +smp_mb(); /* ensure the latest bh->state is visible */ >> return rc; >> } > > Sorry, but that is horrible code. A barrier cannot ensure writes are > 'complete', at best they can ensure order between writes (or reads > etc..). not arguing ;-) > Also, looking at that thing, that common->thread_wakeup_needed variable > is 100% redundant. All sleep_thread() invocations are inside a loop of > sorts and basically wait for other conditions to become true. > > For example: > > while (bh->state != BUF_STATE_EMPTY) { > rc = sleep_thread(common, false); > if (rc) > return rc; > } right > All you care about there is bh->state, _not_ > common->thread_wakeup_needed. > > That said, I cannot spot an obvious fail, okay, but a fail does exist. Any hints on what extra information I could capture to help figuring this one out? > but the code can certainly use help. Sure, that can be done for v4.9 (if I have time) or v4.10 merge window. Meanwhile, we're trying to find a minimal fix for the -rc which can also be backported to stable, right? -- balbi -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, Sep 03, 2016 at 09:58:09AM +0300, Felipe Balbi wrote: > > What arch are you seeing this on? > > x86. Skylake to be exact. So it _cannot_ be the thing Alan mentioned. By the simple fact that spin_lock() is a full barrier on x86 (every LOCK prefixed instruction is). > The following change survived through the night: > > diff --git a/drivers/usb/gadget/function/f_mass_storage.c > b/drivers/usb/gadget/function/f_mass_storage.c > index 8f3659b65f53..d31581dd5ce5 100644 > --- a/drivers/usb/gadget/function/f_mass_storage.c > +++ b/drivers/usb/gadget/function/f_mass_storage.c > @@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct > usb_ep *ep) > /* Caller must hold fsg->lock */ > static void wakeup_thread(struct fsg_common *common) > { > - smp_wmb(); /* ensure the write of bh->state is complete */ > + smp_mb(); /* ensure the write of bh->state is complete */ > /* Tell the main thread that something has happened */ > common->thread_wakeup_needed = 1; > if (common->thread_task) > @@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool > can_freeze) > } > __set_current_state(TASK_RUNNING); > common->thread_wakeup_needed = 0; > - smp_rmb(); /* ensure the latest bh->state is visible */ > + smp_mb(); /* ensure the latest bh->state is visible */ > return rc; > } Sorry, but that is horrible code. A barrier cannot ensure writes are 'complete', at best they can ensure order between writes (or reads etc..). Also, looking at that thing, that common->thread_wakeup_needed variable is 100% redundant. All sleep_thread() invocations are inside a loop of sorts and basically wait for other conditions to become true. For example: while (bh->state != BUF_STATE_EMPTY) { rc = sleep_thread(common, false); if (rc) return rc; } All you care about there is bh->state, _not_ common->thread_wakeup_needed. That said, I cannot spot an obvious fail, but the code can certainly use help. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Fri, Sep 02, 2016 at 04:29:19PM -0400, Alan Stern wrote: > I'm afraid so. The code doesn't use wait_event(), in part because > there's no wait_queue (since only one task is involved). You can use wait_queue fine with just one task, and it would clean up the code tremendously. You can replace things like the earlier mentioned: while (bh->state != BUF_STATE_EMPTY) { rc = sleep_thread(common, false); if (rc) return rc; } with: rc = wait_event_interruptible(>wq, bh->state == BUF_STATE_EMPTY); if (rc) return rc; > But maybe there's another barrier which needs to be fixed. Felipe, can > you check to see if received_cbw() is getting called in > get_next_command(), and if so, what value it returns? Or is the > preceding sleep_thread() the one that never wakes up? > > It could be that the smp_wmb() in wakeup_thread() needs to be smp_mb(). > The reason being that get_next_command() runs outside the protection of > the spinlock. Being somewhat confused by the code, I fail to follow that argument. wakeup_thread() is always called under that spinlock(), but since the critical section is 2 stores, I fail to see how a smp_mb() can make any difference over the smp_wmb() already there. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Fri, Sep 02, 2016 at 04:29:19PM -0400, Alan Stern wrote: > On Fri, 2 Sep 2016, Paul E. McKenney wrote: > > > On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote: > > > Paul, Peter, and Ingo: > > > > > > This must have come up before, but I don't know what was decided. > > > > > > Isn't it often true that a memory barrier is needed before a call to > > > wake_up_process()? A typical scenario might look like this: > > > > > > CPU 0 > > > - > > > for (;;) { > > > set_current_state(TASK_INTERRUPTIBLE); > > > if (signal_pending(current)) > > > break; > > > if (wakeup_flag) > > > break; > > > schedule(); > > > } > > > __set_current_state(TASK_RUNNING); > > > wakeup_flag = 0; > > > > > > > > > CPU 1 > > > - > > > wakeup_flag = 1; > > > wake_up_process(my_task); > > > > > > The underlying pattern is: > > > > > > CPU 0 CPU 1 > > > - - > > > write current->statewrite wakeup_flag > > > smp_mb(); > > > read wakeup_flagread my_task->state > > > > > > where set_current_state() does the write to current->state and > > > automatically adds the smp_mb(), and wake_up_process() reads > > > my_task->state to see whether the task needs to be woken up. > > > > > > The kerneldoc for wake_up_process() says that it has no implied memory > > > barrier if it doesn't actually wake anything up. And even when it > > > does, the implied barrier is only smp_wmb, not smp_mb. > > > > > > This is the so-called SB (Store Buffer) pattern, which is well known to > > > require a full smp_mb on both sides. Since wake_up_process() doesn't > > > include smp_mb(), isn't it correct that the caller must add it > > > explicitly? > > > > > > In other words, shouldn't the code for CPU 1 really be: > > > > > > wakeup_flag = 1; > > > smp_mb(); > > > wake_up_process(task); > > > > > > If my reasoning is correct, then why doesn't wake_up_process() include > > > this memory barrier automatically, the way set_current_state() does? > > > There could be an alternate version (__wake_up_process()) which omits > > > the barrier, just like __set_current_state(). > > > > A common case uses locking, in which case additional memory barriers > > inside of the wait/wakeup functions are not needed. Any accesses made > > while holding the lock before invoking the wakeup function (e.g., > > wake_up()) are guaranteed to be seen after acquiring that same > > lock following return from the wait function (e.g., wait_event()). > > In this case, adding barriers to the wait and wakeup functions would > > just add overhead. > > > > But yes, this decision does mean that people using the wait/wakeup > > functions without locking need to be more careful. Something like > > this: > > > > /* prior accesses. */ > > smp_mb(); > > wakeup_flag = 1; > > wake_up(...); > > > > And on the other task: > > > > wait_event(... wakeup_flag == 1 ...); > > smp_mb(); > > /* The waker's prior accesses will be visible here. */ > > > > Or am I missing your point? > > I'm afraid so. The code doesn't use wait_event(), in part because > there's no wait_queue (since only one task is involved). Ah, got it. The required pattern should be very similar, however. > But maybe there's another barrier which needs to be fixed. Felipe, can > you check to see if received_cbw() is getting called in > get_next_command(), and if so, what value it returns? Or is the > preceding sleep_thread() the one that never wakes up? > > It could be that the smp_wmb() in wakeup_thread() needs to be smp_mb(). > The reason being that get_next_command() runs outside the protection of > the spinlock. This sounds very likely to me. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
hi, Peter Zijlstrawrites: > On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote: >> Felipe, your tests will show whether my guess was totally off-base. >> For the new people, Felipe is tracking down a problem that involves >> exactly the code sequence listed at the top of the email, where we know >> that the wakeup routine runs but nevertheless the task sleeps. At >> least, that's what it looks like at the moment. > > What arch are you seeing this on? x86. Skylake to be exact. The following change survived through the night: diff --git a/drivers/usb/gadget/function/f_mass_storage.c b/drivers/usb/gadget/function/f_mass_storage.c index 8f3659b65f53..d31581dd5ce5 100644 --- a/drivers/usb/gadget/function/f_mass_storage.c +++ b/drivers/usb/gadget/function/f_mass_storage.c @@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct usb_ep *ep) /* Caller must hold fsg->lock */ static void wakeup_thread(struct fsg_common *common) { - smp_wmb(); /* ensure the write of bh->state is complete */ + smp_mb(); /* ensure the write of bh->state is complete */ /* Tell the main thread that something has happened */ common->thread_wakeup_needed = 1; if (common->thread_task) @@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool can_freeze) } __set_current_state(TASK_RUNNING); common->thread_wakeup_needed = 0; - smp_rmb(); /* ensure the latest bh->state is visible */ + smp_mb(); /* ensure the latest bh->state is visible */ return rc; } I'll keep it running until Monday at least -- balbi -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote: > Felipe, your tests will show whether my guess was totally off-base. > For the new people, Felipe is tracking down a problem that involves > exactly the code sequence listed at the top of the email, where we know > that the wakeup routine runs but nevertheless the task sleeps. At > least, that's what it looks like at the moment. What arch are you seeing this on? -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Sat, Sep 03, 2016 at 12:14:13AM +0200, Peter Zijlstra wrote: > On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote: > > > > Actually, that's not entirely true (although presumably it works okay > > for most architectures). > > Yeah, all load-store archs (with exception of PowerPC and ARM64 and > possibly MIPS) implement ACQUIRE with a general fence (after the ll/sc). > > ( and MIPS doesn't use their fancy barriers in Linux ) > > PowerPC does the full fence for smp_mb__before_spinlock, which leaves > ARM64, I'm not sure its correct, but I'm way too tired to think about > that now. > > The TSO archs imply full barriers with all atomic RmW ops and are > therefore also good. > Forgot to Cc Will. Will, does ARM64 need to make smp_mb__before_spinlock smp_mb() too? -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote: > > Actually, that's not entirely true (although presumably it works okay > for most architectures). Yeah, all load-store archs (with exception of PowerPC and ARM64 and possibly MIPS) implement ACQUIRE with a general fence (after the ll/sc). ( and MIPS doesn't use their fancy barriers in Linux ) PowerPC does the full fence for smp_mb__before_spinlock, which leaves ARM64, I'm not sure its correct, but I'm way too tired to think about that now. The TSO archs imply full barriers with all atomic RmW ops and are therefore also good. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Fri, 2 Sep 2016, Paul E. McKenney wrote: > On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote: > > Paul, Peter, and Ingo: > > > > This must have come up before, but I don't know what was decided. > > > > Isn't it often true that a memory barrier is needed before a call to > > wake_up_process()? A typical scenario might look like this: > > > > CPU 0 > > - > > for (;;) { > > set_current_state(TASK_INTERRUPTIBLE); > > if (signal_pending(current)) > > break; > > if (wakeup_flag) > > break; > > schedule(); > > } > > __set_current_state(TASK_RUNNING); > > wakeup_flag = 0; > > > > > > CPU 1 > > - > > wakeup_flag = 1; > > wake_up_process(my_task); > > > > The underlying pattern is: > > > > CPU 0 CPU 1 > > - - > > write current->statewrite wakeup_flag > > smp_mb(); > > read wakeup_flagread my_task->state > > > > where set_current_state() does the write to current->state and > > automatically adds the smp_mb(), and wake_up_process() reads > > my_task->state to see whether the task needs to be woken up. > > > > The kerneldoc for wake_up_process() says that it has no implied memory > > barrier if it doesn't actually wake anything up. And even when it > > does, the implied barrier is only smp_wmb, not smp_mb. > > > > This is the so-called SB (Store Buffer) pattern, which is well known to > > require a full smp_mb on both sides. Since wake_up_process() doesn't > > include smp_mb(), isn't it correct that the caller must add it > > explicitly? > > > > In other words, shouldn't the code for CPU 1 really be: > > > > wakeup_flag = 1; > > smp_mb(); > > wake_up_process(task); > > > > If my reasoning is correct, then why doesn't wake_up_process() include > > this memory barrier automatically, the way set_current_state() does? > > There could be an alternate version (__wake_up_process()) which omits > > the barrier, just like __set_current_state(). > > A common case uses locking, in which case additional memory barriers > inside of the wait/wakeup functions are not needed. Any accesses made > while holding the lock before invoking the wakeup function (e.g., > wake_up()) are guaranteed to be seen after acquiring that same > lock following return from the wait function (e.g., wait_event()). > In this case, adding barriers to the wait and wakeup functions would > just add overhead. > > But yes, this decision does mean that people using the wait/wakeup > functions without locking need to be more careful. Something like > this: > > /* prior accesses. */ > smp_mb(); > wakeup_flag = 1; > wake_up(...); > > And on the other task: > > wait_event(... wakeup_flag == 1 ...); > smp_mb(); > /* The waker's prior accesses will be visible here. */ > > Or am I missing your point? I'm afraid so. The code doesn't use wait_event(), in part because there's no wait_queue (since only one task is involved). But maybe there's another barrier which needs to be fixed. Felipe, can you check to see if received_cbw() is getting called in get_next_command(), and if so, what value it returns? Or is the preceding sleep_thread() the one that never wakes up? It could be that the smp_wmb() in wakeup_thread() needs to be smp_mb(). The reason being that get_next_command() runs outside the protection of the spinlock. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Fri, 2 Sep 2016, Peter Zijlstra wrote: > On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote: > > Paul, Peter, and Ingo: > > > > This must have come up before, but I don't know what was decided. > > > > Isn't it often true that a memory barrier is needed before a call to > > wake_up_process()? A typical scenario might look like this: > > > > CPU 0 > > - > > for (;;) { > > set_current_state(TASK_INTERRUPTIBLE); > > if (signal_pending(current)) > > break; > > if (wakeup_flag) > > break; > > schedule(); > > } > > __set_current_state(TASK_RUNNING); > > wakeup_flag = 0; > > > > > > CPU 1 > > - > > wakeup_flag = 1; > > wake_up_process(my_task); > > > > The underlying pattern is: > > > > CPU 0 CPU 1 > > - - > > write current->statewrite wakeup_flag > > smp_mb(); > > read wakeup_flagread my_task->state > > > > where set_current_state() does the write to current->state and > > automatically adds the smp_mb(), and wake_up_process() reads > > my_task->state to see whether the task needs to be woken up. > > > > The kerneldoc for wake_up_process() says that it has no implied memory > > barrier if it doesn't actually wake anything up. And even when it > > does, the implied barrier is only smp_wmb, not smp_mb. > > > > This is the so-called SB (Store Buffer) pattern, which is well known to > > require a full smp_mb on both sides. Since wake_up_process() doesn't > > include smp_mb(), isn't it correct that the caller must add it > > explicitly? > > > > In other words, shouldn't the code for CPU 1 really be: > > > > wakeup_flag = 1; > > smp_mb(); > > wake_up_process(task); > > > > No, it doesn't need to do that. try_to_wake_up() does the right thing. > > It does: > > smp_mb__before_spinlock(); > raw_spin_lock_irqsave(>pi_lock); > > Now, smp_mb__before_spinlock() is a bit of an odd duck, if you look at > its comment it says: > > /* > * Despite its name it doesn't necessarily has to be a full barrier. > * It should only guarantee that a STORE before the critical section > * can not be reordered with LOADs and STOREs inside this section. > * spin_lock() is the one-way barrier, this LOAD can not escape out > * of the region. So the default implementation simply ensures that > * a STORE can not move into the critical section, smp_wmb() should > * serialize it with another STORE done by spin_lock(). > */ > #ifndef smp_mb__before_spinlock > #define smp_mb__before_spinlock() smp_wmb() > #endif I see. So the kerneldoc for wake_up_process() is misleading at best. > So per default it ends up being: > > WMB > LOCK > > Which is sufficient to order the prior store vs the later load as is > required. Note that a spinlock acquire _must_ imply a store (we need to > mark the lock as taken), therefore the prior store is ordered against > the lock store per the wmb, and since the lock must imply an ACQUIRE > that limits the load. Actually, that's not entirely true (although presumably it works okay for most architectures). In theory, the first write and the WMB can move down after the lock's ACQUIRE. Then the later read could move earlier, before the lock's store, the WMB, and the first write, but still after the ACQUIRE. Thus the later read could be reordered with the first write. In other words, these actions: store wakeup_flag WMB load-ACQUIRE lock store lock read task->state can be reordered to: load-ACQUIRE lock store wakeup_flag WMB store lock read task->state and then to: load-ACQUIRE lock read task->state store wakeup_flag WMB store lock without violating any ordering rules. > Now, PowerPC defines smp_mb__before_spinlock as smp_mb(), and this is > because PowerPC ACQUIRE is a bit of an exception, if you want more > details I'm sure I or Paul can dredge them up :-) PPC can't do the reordering described above, but it does have other weaknesses. Without smp_mb(), the write to wakeup_flag doesn't have to propagate from one CPU to the other before the second CPU tries to read it. Felipe, your tests will show whether my guess was totally off-base. For the new people, Felipe is tracking down a problem that involves exactly the code sequence listed at the top of the email, where we know that the wakeup routine runs but nevertheless the task sleeps. At least, that's what it looks like at the moment. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote: > Paul, Peter, and Ingo: > > This must have come up before, but I don't know what was decided. > > Isn't it often true that a memory barrier is needed before a call to > wake_up_process()? A typical scenario might look like this: > > CPU 0 > - > for (;;) { > set_current_state(TASK_INTERRUPTIBLE); > if (signal_pending(current)) > break; > if (wakeup_flag) > break; > schedule(); > } > __set_current_state(TASK_RUNNING); > wakeup_flag = 0; > > > CPU 1 > - > wakeup_flag = 1; > wake_up_process(my_task); > > The underlying pattern is: > > CPU 0 CPU 1 > - - > write current->statewrite wakeup_flag > smp_mb(); > read wakeup_flagread my_task->state > > where set_current_state() does the write to current->state and > automatically adds the smp_mb(), and wake_up_process() reads > my_task->state to see whether the task needs to be woken up. > > The kerneldoc for wake_up_process() says that it has no implied memory > barrier if it doesn't actually wake anything up. And even when it > does, the implied barrier is only smp_wmb, not smp_mb. > > This is the so-called SB (Store Buffer) pattern, which is well known to > require a full smp_mb on both sides. Since wake_up_process() doesn't > include smp_mb(), isn't it correct that the caller must add it > explicitly? > > In other words, shouldn't the code for CPU 1 really be: > > wakeup_flag = 1; > smp_mb(); > wake_up_process(task); > No, it doesn't need to do that. try_to_wake_up() does the right thing. It does: smp_mb__before_spinlock(); raw_spin_lock_irqsave(>pi_lock); Now, smp_mb__before_spinlock() is a bit of an odd duck, if you look at its comment it says: /* * Despite its name it doesn't necessarily has to be a full barrier. * It should only guarantee that a STORE before the critical section * can not be reordered with LOADs and STOREs inside this section. * spin_lock() is the one-way barrier, this LOAD can not escape out * of the region. So the default implementation simply ensures that * a STORE can not move into the critical section, smp_wmb() should * serialize it with another STORE done by spin_lock(). */ #ifndef smp_mb__before_spinlock #define smp_mb__before_spinlock() smp_wmb() #endif So per default it ends up being: WMB LOCK Which is sufficient to order the prior store vs the later load as is required. Note that a spinlock acquire _must_ imply a store (we need to mark the lock as taken), therefore the prior store is ordered against the lock store per the wmb, and since the lock must imply an ACQUIRE that limits the load. Now, PowerPC defines smp_mb__before_spinlock as smp_mb(), and this is because PowerPC ACQUIRE is a bit of an exception, if you want more details I'm sure I or Paul can dredge them up :-) -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory barrier needed with wake_up_process()?
On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote: > Paul, Peter, and Ingo: > > This must have come up before, but I don't know what was decided. > > Isn't it often true that a memory barrier is needed before a call to > wake_up_process()? A typical scenario might look like this: > > CPU 0 > - > for (;;) { > set_current_state(TASK_INTERRUPTIBLE); > if (signal_pending(current)) > break; > if (wakeup_flag) > break; > schedule(); > } > __set_current_state(TASK_RUNNING); > wakeup_flag = 0; > > > CPU 1 > - > wakeup_flag = 1; > wake_up_process(my_task); > > The underlying pattern is: > > CPU 0 CPU 1 > - - > write current->statewrite wakeup_flag > smp_mb(); > read wakeup_flagread my_task->state > > where set_current_state() does the write to current->state and > automatically adds the smp_mb(), and wake_up_process() reads > my_task->state to see whether the task needs to be woken up. > > The kerneldoc for wake_up_process() says that it has no implied memory > barrier if it doesn't actually wake anything up. And even when it > does, the implied barrier is only smp_wmb, not smp_mb. > > This is the so-called SB (Store Buffer) pattern, which is well known to > require a full smp_mb on both sides. Since wake_up_process() doesn't > include smp_mb(), isn't it correct that the caller must add it > explicitly? > > In other words, shouldn't the code for CPU 1 really be: > > wakeup_flag = 1; > smp_mb(); > wake_up_process(task); > > If my reasoning is correct, then why doesn't wake_up_process() include > this memory barrier automatically, the way set_current_state() does? > There could be an alternate version (__wake_up_process()) which omits > the barrier, just like __set_current_state(). A common case uses locking, in which case additional memory barriers inside of the wait/wakeup functions are not needed. Any accesses made while holding the lock before invoking the wakeup function (e.g., wake_up()) are guaranteed to be seen after acquiring that same lock following return from the wait function (e.g., wait_event()). In this case, adding barriers to the wait and wakeup functions would just add overhead. But yes, this decision does mean that people using the wait/wakeup functions without locking need to be more careful. Something like this: /* prior accesses. */ smp_mb(); wakeup_flag = 1; wake_up(...); And on the other task: wait_event(... wakeup_flag == 1 ...); smp_mb(); /* The waker's prior accesses will be visible here. */ Or am I missing your point? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Memory barrier needed with wake_up_process()?
Paul, Peter, and Ingo: This must have come up before, but I don't know what was decided. Isn't it often true that a memory barrier is needed before a call to wake_up_process()? A typical scenario might look like this: CPU 0 - for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (signal_pending(current)) break; if (wakeup_flag) break; schedule(); } __set_current_state(TASK_RUNNING); wakeup_flag = 0; CPU 1 - wakeup_flag = 1; wake_up_process(my_task); The underlying pattern is: CPU 0 CPU 1 - - write current->statewrite wakeup_flag smp_mb(); read wakeup_flagread my_task->state where set_current_state() does the write to current->state and automatically adds the smp_mb(), and wake_up_process() reads my_task->state to see whether the task needs to be woken up. The kerneldoc for wake_up_process() says that it has no implied memory barrier if it doesn't actually wake anything up. And even when it does, the implied barrier is only smp_wmb, not smp_mb. This is the so-called SB (Store Buffer) pattern, which is well known to require a full smp_mb on both sides. Since wake_up_process() doesn't include smp_mb(), isn't it correct that the caller must add it explicitly? In other words, shouldn't the code for CPU 1 really be: wakeup_flag = 1; smp_mb(); wake_up_process(task); If my reasoning is correct, then why doesn't wake_up_process() include this memory barrier automatically, the way set_current_state() does? There could be an alternate version (__wake_up_process()) which omits the barrier, just like __set_current_state(). Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html