subject:"Memory barrier needed with wake_up

Re: Memory barrier needed with wake_up_process()?

2017-01-16 Thread Felipe Balbi


Hi,

Felipe Balbi  writes:
> Alan Stern  writes:
>> On Mon, 16 Jan 2017, Felipe Balbi wrote:
>>
>>> Sorry for the long delay, I finally have more information on this. All
>>> this time I was doing something that I never considered to matter: I've
>>> been running host and peripheral on the same machine. Now that I have
>>> tracepoints on xHCI as well, I could see that these 30 seconds of
>>> "nothing" is actuall full of xHCI activity and I can see that for the
>>> duration of these 30 seconds preempt depth on the CPU that (eventually)
>>> queues a request on dwc3, is always > 1 (sometimes 2, most of the time
>>> 1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU
>>> and g_mass_storage is spinning for over 30 seconds at which point
>>> storage.ko (host side class driver) dequeues the request.
>>> 
>>> I'll see if I can capture a fresh trace with both xHCI and dwc3 with
>>> this happening, but probably not today (testing stuff for -rc).
>>
>> Does anything change if the host and peripheral are separate machines?
>
> couldn't reproduce the problem yet ;-)

*yet* is a keyword here. I wouldn't call this problem "done" until I
 successfully run my test case for at least a week.

-- 
balbi
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2017-01-16 Thread Felipe Balbi


Hi,

Alan Stern  writes:
> On Mon, 16 Jan 2017, Felipe Balbi wrote:
>
>> Sorry for the long delay, I finally have more information on this. All
>> this time I was doing something that I never considered to matter: I've
>> been running host and peripheral on the same machine. Now that I have
>> tracepoints on xHCI as well, I could see that these 30 seconds of
>> "nothing" is actuall full of xHCI activity and I can see that for the
>> duration of these 30 seconds preempt depth on the CPU that (eventually)
>> queues a request on dwc3, is always > 1 (sometimes 2, most of the time
>> 1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU
>> and g_mass_storage is spinning for over 30 seconds at which point
>> storage.ko (host side class driver) dequeues the request.
>> 
>> I'll see if I can capture a fresh trace with both xHCI and dwc3 with
>> this happening, but probably not today (testing stuff for -rc).
>
> Does anything change if the host and peripheral are separate machines?

couldn't reproduce the problem yet ;-)

-- 
balbi
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2017-01-16 Thread Alan Stern

On Mon, 16 Jan 2017, Felipe Balbi wrote:

> Sorry for the long delay, I finally have more information on this. All
> this time I was doing something that I never considered to matter: I've
> been running host and peripheral on the same machine. Now that I have
> tracepoints on xHCI as well, I could see that these 30 seconds of
> "nothing" is actuall full of xHCI activity and I can see that for the
> duration of these 30 seconds preempt depth on the CPU that (eventually)
> queues a request on dwc3, is always > 1 (sometimes 2, most of the time
> 1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU
> and g_mass_storage is spinning for over 30 seconds at which point
> storage.ko (host side class driver) dequeues the request.
> 
> I'll see if I can capture a fresh trace with both xHCI and dwc3 with
> this happening, but probably not today (testing stuff for -rc).

Does anything change if the host and peripheral are separate machines?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2017-01-16 Thread Felipe Balbi


Hi,

Alan Stern  writes:
> On Tue, 20 Sep 2016, Felipe Balbi wrote:
>
>> And here's trace output (complete, scroll to bottom). It seems to me
>> like the thread didn't wake up at all. It didn't even try to
>> execute. I'll add some more traces and try to get better information
>> about what's going on.
>> 
>> 
>> 
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 2865/2865   #P:4
>> #
>> #  _-=> irqs-off
>> # / _=> need-resched
>> #| / _---=> hardirq/softirq
>> #|| / _--=> preempt-depth
>> #||| / delay
>> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
>> #  | |   |      | |
>
> Skipping to the end...
>
>>  irq/17-dwc3-2522  [002] d...43.504199: bulk_out_complete: 0, 31/31
>> file-storage-2521  [001] 43.504202: fsg_main_thread: 
>> get_next_command -> 0
>> file-storage-2521  [001] 43.504288: fsg_main_thread: 
>> do_scsi_command -> 0
>> file-storage-2521  [001] 43.504295: fsg_main_thread: 
>> finish_reply -> 0
>> file-storage-2521  [001] 43.504298: fsg_main_thread: send_status 
>> -> 0
>>  irq/17-dwc3-2522  [002] d...43.504347: bulk_in_complete: 0, 
>> 16384/16384
>>  irq/17-dwc3-2522  [002] d...43.504351: bulk_in_complete: 0, 
>> 16384/16384
>>  irq/17-dwc3-2522  [002] d...43.504434: bulk_in_complete: 0, 
>> 16384/16384
>>  irq/17-dwc3-2522  [002] d...43.504438: bulk_in_complete: 0, 
>> 16384/16384
>>  irq/17-dwc3-2522  [002] d...43.504535: bulk_in_complete: 0, 
>> 16384/16384
>>  irq/17-dwc3-2522  [002] d...43.504539: bulk_in_complete: 0, 
>> 16384/16384
>>  irq/17-dwc3-2522  [002] d...43.504618: bulk_in_complete: 0, 
>> 16384/16384
>>  irq/17-dwc3-2522  [002] d...43.504703: bulk_in_complete: 0, 
>> 16384/16384
>>  irq/17-dwc3-2522  [002] d...43.504794: bulk_in_complete: 0, 13/13
>>  irq/17-dwc3-2522  [002] d...43.504797: bulk_out_complete: 0, 31/31
>
> Like you say, it appears that the thread didn't get woken up at all.
> But this is inconsistent with your earlier results.  On Sep. 9, you 
> posted a message that ended with these lines:
>
>>  irq/17-dwc3-3579  [003] d..1 21167.729666: bulk_out_complete: compl: bh 
>> 880111e6aac0 state 1
>> file-storage-3578  [002]  21167.729670: fsg_main_thread: next: bh 
>> 880111e6aac0 state 1
>
> This indicates that in the earlier test, the thread did start running 
> and get_next_command should have returned.
>
> The trace you posted after this one doesn't seem to show anything new, 
> as far as I can tell.
>
> So I still can't tell what's happening.  Maybe the patch below will 
> help.  It concentrates on the critical area.

Sorry for the long delay, I finally have more information on this. All
this time I was doing something that I never considered to matter: I've
been running host and peripheral on the same machine. Now that I have
tracepoints on xHCI as well, I could see that these 30 seconds of
"nothing" is actuall full of xHCI activity and I can see that for the
duration of these 30 seconds preempt depth on the CPU that (eventually)
queues a request on dwc3, is always > 1 (sometimes 2, most of the time
1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU
and g_mass_storage is spinning for over 30 seconds at which point
storage.ko (host side class driver) dequeues the request.

I'll see if I can capture a fresh trace with both xHCI and dwc3 with
this happening, but probably not today (testing stuff for -rc).

-- 
balbi


signature.asc
Description: PGP signature

Re: Memory barrier needed with wake_up_process()?

2016-09-20 Thread Alan Stern

On Tue, 20 Sep 2016, Felipe Balbi wrote:

> And here's trace output (complete, scroll to bottom). It seems to me
> like the thread didn't wake up at all. It didn't even try to
> execute. I'll add some more traces and try to get better information
> about what's going on.
> 
> 
> 
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 2865/2865   #P:4
> #
> #  _-=> irqs-off
> # / _=> need-resched
> #| / _---=> hardirq/softirq
> #|| / _--=> preempt-depth
> #||| / delay
> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
> #  | |   |      | |

Skipping to the end...

>  irq/17-dwc3-2522  [002] d...43.504199: bulk_out_complete: 0, 31/31
> file-storage-2521  [001] 43.504202: fsg_main_thread: 
> get_next_command -> 0
> file-storage-2521  [001] 43.504288: fsg_main_thread: 
> do_scsi_command -> 0
> file-storage-2521  [001] 43.504295: fsg_main_thread: finish_reply 
> -> 0
> file-storage-2521  [001] 43.504298: fsg_main_thread: send_status 
> -> 0
>  irq/17-dwc3-2522  [002] d...43.504347: bulk_in_complete: 0, 
> 16384/16384
>  irq/17-dwc3-2522  [002] d...43.504351: bulk_in_complete: 0, 
> 16384/16384
>  irq/17-dwc3-2522  [002] d...43.504434: bulk_in_complete: 0, 
> 16384/16384
>  irq/17-dwc3-2522  [002] d...43.504438: bulk_in_complete: 0, 
> 16384/16384
>  irq/17-dwc3-2522  [002] d...43.504535: bulk_in_complete: 0, 
> 16384/16384
>  irq/17-dwc3-2522  [002] d...43.504539: bulk_in_complete: 0, 
> 16384/16384
>  irq/17-dwc3-2522  [002] d...43.504618: bulk_in_complete: 0, 
> 16384/16384
>  irq/17-dwc3-2522  [002] d...43.504703: bulk_in_complete: 0, 
> 16384/16384
>  irq/17-dwc3-2522  [002] d...43.504794: bulk_in_complete: 0, 13/13
>  irq/17-dwc3-2522  [002] d...43.504797: bulk_out_complete: 0, 31/31

Like you say, it appears that the thread didn't get woken up at all.
But this is inconsistent with your earlier results.  On Sep. 9, you 
posted a message that ended with these lines:

>  irq/17-dwc3-3579  [003] d..1 21167.729666: bulk_out_complete: compl: bh 
> 880111e6aac0 state 1
> file-storage-3578  [002]  21167.729670: fsg_main_thread: next: bh 
> 880111e6aac0 state 1

This indicates that in the earlier test, the thread did start running 
and get_next_command should have returned.

The trace you posted after this one doesn't seem to show anything new, 
as far as I can tell.

So I still can't tell what's happening.  Maybe the patch below will 
help.  It concentrates on the critical area.

Alan Stern



Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
===
--- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c
+++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
@@ -402,8 +402,12 @@ static void wakeup_thread(struct fsg_com
smp_wmb();  /* ensure the write of bh->state is complete */
/* Tell the main thread that something has happened */
common->thread_wakeup_needed = 1;
-   if (common->thread_task)
-   wake_up_process(common->thread_task);
+   if (common->thread_task) {
+   if (wake_up_process(common->thread_task))
+   trace_printk("wakeup_thread 1\n");
+   else
+   trace_printk("wakeup_thread 0\n");
+   }
 }
 
 static void raise_exception(struct fsg_common *common, enum fsg_state 
new_state)
@@ -479,6 +483,8 @@ static void bulk_out_complete(struct usb
req->status, req->actual, bh->bulk_out_intended_length);
if (req->status == -ECONNRESET) /* Request was cancelled */
usb_ep_fifo_flush(ep);
+   trace_printk("bulk_out: %d, %u/%u\n", req->status, req->actual,
+   bh->bulk_out_intended_length);
 
/* Hold the lock while we update the request and buffer states */
smp_wmb();
@@ -2204,13 +2210,17 @@ static int get_next_command(struct fsg_c
 
/* Wait for the CBW to arrive */
while (bh->state != BUF_STATE_FULL) {
+   trace_printk("get_next: sleep\n");
rc = sleep_thread(common, true);
+   trace_printk("get_next: sleep -> %d loop %d\n",
+   rc, bh->state != BUF_STATE_FULL);
if (rc)
return rc;
}
smp_rmb();
rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;
bh->state = BUF_STATE_EMPTY;
+   trace_printk("get_next: return %d\n", rc);
 
return rc;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: Memory barrier needed with wake_up_process()?

2016-09-20 Thread Felipe Balbi


Hi,

Felipe Balbi  writes:
> [ Unknown signature status ]
>
> Hi,
>
> Alan Stern  writes:
>
> [...]
>
>>> $ grep -RnH "raise_exception\|fsg_main_thread" /tmp/trace.txt 
>>> /tmp/trace.txt:111743: irq/17-dwc3-3527  [003] d..164.833078: 
>>> raise_exception: raise_exception 4
>>> /tmp/trace.txt:111745:file-storage-3526  [002] 64.833139: 
>>> fsg_main_thread: get_next_command -> -4
>>> /tmp/trace.txt:111746:file-storage-3526  [002] 64.833140: 
>>> fsg_main_thread: handling exception
>>> /tmp/trace.txt:112950: irq/17-dwc3-3527  [003] d..164.951349: 
>>> raise_exception: raise_exception 4
>>> /tmp/trace.txt:112956:file-storage-3526  [002] 64.951401: 
>>> fsg_main_thread: handling exception
>>> 
>>> Any ideas?
>>
>> I'm afraid not.  The only thing I can think of to try next is complete 
>> tracing of fsg_main_thread, as in the patch below.  It will generate 
>> continuous output as long as the gadget is doing something, but there 
>> doesn't seem to be any way to avoid this.  At least when everything 
>> stops, it should be able to tell us exactly where and why.
>
> tried with your changes plus a trace_printk() added to both
> bulk_out_complete and bulk_in_complete. Here's the diff:

another version of diff and logs. Any ideas of what could be going on?
(full trace compressed and attached)

diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
b/drivers/usb/gadget/function/f_mass_storage.c
index 8f3659b65f53..de0a9ebe7f61 100644
--- a/drivers/usb/gadget/function/f_mass_storage.c
+++ b/drivers/usb/gadget/function/f_mass_storage.c
@@ -395,11 +395,24 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct 
usb_ep *ep)
 /* Caller must hold fsg->lock */
 static void wakeup_thread(struct fsg_common *common)
 {
+   struct task_struct *p = common->thread_task;
+
smp_wmb();  /* ensure the write of bh->state is complete */
/* Tell the main thread that something has happened */
common->thread_wakeup_needed = 1;
-   if (common->thread_task)
-   wake_up_process(common->thread_task);
+
+   trace_printk("wkup needed %d task %p [comm=%s pid=%d prio=%d 
target_cpu=%03d on_rq %d state %lx]\n",
+   common->thread_wakeup_needed, p, p->comm, p->pid,
+   p->prio, task_cpu(p), p->on_rq, p->state);
+
+   if (common->thread_task) {
+   int rc;
+   rc = wake_up_process(common->thread_task);
+   if (rc)
+   trace_printk("thread_task woken up\n");
+   else
+   trace_printk("NOT woken up\n");
+   }
 }
 
 static void raise_exception(struct fsg_common *common, enum fsg_state 
new_state)
@@ -411,6 +424,7 @@ static void raise_exception(struct fsg_common *common, enum 
fsg_state new_state)
 * If a lower-or-equal priority exception is in progress, preempt it
 * and notify the main thread by sending it a signal.
 */
+   trace_printk("raise_exception %d\n", new_state);
spin_lock_irqsave(>lock, flags);
if (common->state <= new_state) {
common->exception_req_tag = common->ep0_req_tag;
@@ -449,6 +463,8 @@ static void bulk_in_complete(struct usb_ep *ep, struct 
usb_request *req)
struct fsg_common   *common = ep->driver_data;
struct fsg_buffhd   *bh = req->context;
 
+   trace_printk("%d, %u/%u\n", req->status, req->actual, req->length);
+
if (req->status || req->actual != req->length)
DBG(common, "%s --> %d, %u/%u\n", __func__,
req->status, req->actual, req->length);
@@ -470,6 +486,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct 
usb_request *req)
struct fsg_buffhd   *bh = req->context;
 
dump_msg(common, "bulk-out", req->buf, req->actual);
+
+   trace_printk("%d, %u/%u\n", req->status, req->actual, 
bh->bulk_out_intended_length);
if (req->status || req->actual != bh->bulk_out_intended_length)
DBG(common, "%s --> %d, %u/%u\n", __func__,
req->status, req->actual, bh->bulk_out_intended_length);
@@ -573,6 +591,10 @@ static void start_transfer(struct fsg_dev *fsg, struct 
usb_ep *ep,
spin_unlock_irq(>common->lock);
 
rc = usb_ep_queue(ep, req, GFP_KERNEL);
+
+   trace_printk("%d: %s: req %u bytes state %d ---> %d\n", __LINE__, 
ep->name,
+   req->length, *state, rc);
+
if (rc == 0)
return;  /* All good, we're done */
 
@@ -2496,6 +2518,7 @@ static void handle_exception(struct fsg_common *common)
 static int fsg_main_thread(void *common_)
 {
struct fsg_common   *common = common_;
+   int rc;
 
/*
 * Allow the thread to be killed by a signal, but set the signal mask
@@ -2519,6 +2542,7 @@ static int fsg_main_thread(void *common_)

Re: Memory barrier needed with wake_up_process()?

2016-09-19 Thread Alan Stern

On Mon, 19 Sep 2016, Felipe Balbi wrote:

> >> file-storage-3578  [002]  21167.727072: fsg_main_thread: next: bh 
> >> 880111e69a80 state 1
> >> file-storage-3578  [002]  21167.729458: fsg_main_thread: next: bh 
> >> 880111e6aac0 state 2
> >>  irq/17-dwc3-3579  [003] d..1 21167.729666: bulk_out_complete: compl: 
> >> bh 880111e6aac0 state 1
> >> file-storage-3578  [002]  21167.729670: fsg_main_thread: next: bh 
> >> 880111e6aac0 state 1
> >
> > And this is where everything stopped?
> 
> yeah, that's everything.
> 
> > This also looks normal.  So the question is what happened when 
> > get_next_command() returned after this?
> >
> > Felipe, maybe the patch below (in place of your current patch) will
> > help.  Since the events that it logs are all supposed to be unusual,
> > you can use printk if you want, but I wrote it with trace_printk.
> 
> I've applied your patch and it wasn't giving me any output, which hinted
> that g_mass_storage wasn't returning any failures. So I enabled dwc3's
> traces to get more data out of it. Here's the final snippet (with
> comments, again). Let me know if you want the entire thing (it's
> ~14MiB).
> 
> > irq/17-dwc3-3527  [003] d..134.215214: dwc3_event: event (00110301): 
> > Link Change [U1]
> > irq/17-dwc3-3527  [003] d..134.215214: dwc3_event: event (4086): 
> > ep1in: Transfer In-Progress
> > irq/17-dwc3-3527  [003] d..134.215214: dwc3_complete_trb: ep1in: 2/255 
> > trb 880084813e70 buf 80808000 size 0 ctrl 0c18 
> > (hlcS:SC:normal)
> > irq/17-dwc3-3527  [003] d..134.215214: dwc3_gadget_giveback: ep1in: req 
> > 88016d046840 length 16384/16384 zsI ==> 0
> > irq/17-dwc3-3527  [003] d...34.215215: usb_gadget_giveback_request: 
> > ep1in: length 16384/16384 sgs 0/0 stream 0 zsI status 0 --> 0
> > irq/17-dwc3-3527  [003] d..134.215218: dwc3_gadget_ep_cmd: ep1in: cmd 
> > 'Update Transfer' [196615] params    --> status: 
> > Successful
> > irq/17-dwc3-3527  [003] d..134.215281: dwc3_event: event (00100301): 
> > Link Change [U0]
> > irq/17-dwc3-3527  [003] d..134.215281: dwc3_event: event (00110301): 
> > Link Change [U1]
> > irq/17-dwc3-3527  [003] d..134.215282: dwc3_event: event (4086): 
> > ep1in: Transfer In-Progress
> > irq/17-dwc3-3527  [003] d..134.215282: dwc3_complete_trb: ep1in: 1/255 
> > trb 880084813e80 buf 8080c000 size 0 ctrl 0c18 
> > (hlcS:SC:normal)
> > irq/17-dwc3-3527  [003] d..134.215282: dwc3_gadget_giveback: ep1in: req 
> > 88016d046f00 length 13/13 zsI ==> 0
> > irq/17-dwc3-3527  [003] d...34.215283: usb_gadget_giveback_request: 
> > ep1in: length 13/13 sgs 0/0 stream 0 zsI status 0 --> 0
> > irq/17-dwc3-3527  [003] d..134.215284: dwc3_event: event (00100301): 
> > Link Change [U0]
> > irq/17-dwc3-3527  [003] d..134.215284: dwc3_event: event (00110301): 
> > Link Change [U1]
> > irq/17-dwc3-3527  [003] d..134.215284: dwc3_event: event (6084): 
> > ep1out: Transfer In-Progress
> > irq/17-dwc3-3527  [003] d..134.215284: dwc3_complete_trb: ep1out: 1/255 
> > trb 880084804320 buf 8080c800 size 993 ctrl 0c18 
> > (hlcS:SC:normal)
> > irq/17-dwc3-3527  [003] d..134.215285: dwc3_gadget_giveback: ep1out: 
> > req 880154205300 length 31/1024 zsI ==> 0
> > irq/17-dwc3-3527  [003] d...34.215285: usb_gadget_giveback_request: 
> > ep1out: length 31/1024 sgs 0/0 stream 0 zsI status 0 --> 0
> 
> completed and gave back CBW.
> 
> > irq/17-dwc3-3527  [003] d..134.215349: dwc3_event: event (00100301): 
> > Link Change [U0]
> > irq/17-dwc3-3527  [003] d..134.215349: dwc3_event: event (00110301): 
> > Link Change [U1]
> > irq/17-dwc3-3527  [003] d..134.225616: dwc3_event: event (00120301): 
> > Link Change [U2]
> > irq/17-dwc3-3527  [003] d...34.225617: usb_gadget_vbus_draw: speed 5/5 
> > state 7 0mA [sg:out_aligned:self-powered:activated:connected] --> -95
> > irq/17-dwc3-3527  [003] d..164.832221: dwc3_event: event (00100301): 
> > Link Change [U0]
> 
> 30 seconds of nothing.
> 
> > irq/17-dwc3-3527  [003] d..164.832240: dwc3_event: event (c040): 
> > ep0out: Transfer Complete
> > irq/17-dwc3-3527  [003] d..164.832243: dwc3_ep0: ep0out: Transfer 
> > Complete: state 'Setup Phase'
> > irq/17-dwc3-3527  [003] d..164.832243: dwc3_ep0: Setup Phase
> > irq/17-dwc3-3527  [003] d..164.832244: dwc3_ctrl_req: bRequestType 00 
> > bRequest 01 wValue 0030 wIndex  wLength 0
> > irq/17-dwc3-3527  [003] d..164.832244: dwc3_ep0: USB_REQ_CLEAR_FEATURE
> > irq/17-dwc3-3527  [003] d..164.832253: dwc3_event: event (20c2): 
> > ep0in: Transfer Not Ready (Not Active)
> > irq/17-dwc3-3527  [003] d..164.832254: dwc3_ep0: ep0in: Transfer Not 
> > Ready (Not Active): state 'Setup Phase'
> > irq/17-dwc3-3527  [003] d..164.832254: dwc3_ep0: Control Status
> > irq/17-dwc3-3527  [003] d..164.832255: dwc3_prepare_trb:

Re: Memory barrier needed with wake_up_process()?

2016-09-19 Thread Felipe Balbi


Hi Alan,

Alan Stern  writes:
> On Fri, 9 Sep 2016, Felipe Balbi wrote:
>
>> Finally :-) Here's the diff I used:
>> 
>> diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
>> b/drivers/usb/gadget/function/f_mass_storage.c
>> index 8f3659b65f53..0716024f6b65 100644
>> --- a/drivers/usb/gadget/function/f_mass_storage.c
>> +++ b/drivers/usb/gadget/function/f_mass_storage.c
>> @@ -481,6 +481,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct 
>> usb_request *req)
>> spin_lock(>lock);
>> bh->outreq_busy = 0;
>> bh->state = BUF_STATE_FULL;
>> +   if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN)
>> +   trace_printk("compl: bh %p state %d\n", bh, bh->state);
>> wakeup_thread(common);
>> spin_unlock(>lock);
>>  }
>> @@ -2208,6 +2210,7 @@ static int get_next_command(struct fsg_common *common)
>> rc = sleep_thread(common, true);
>> if (rc)
>> return rc;
>> +   trace_printk("next: bh %p state %d\n", bh, bh->state);
>> }
>> smp_rmb();
>> rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;
>> 
>> 
>> And here's trace output:
>> 
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 1002/1002   #P:4
>> #
>> #  _-=> irqs-off
>> # / _=> need-resched
>> #| / _---=> hardirq/softirq
>> #|| / _--=> preempt-depth
>> #||| / delay
>> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
>> #  | |   |      | |
>> file-storage-3578  [000]  21166.789127: fsg_main_thread: next: bh 
>> 880111e69a00 state 2
>> file-storage-3578  [000]  21166.789312: fsg_main_thread: next: bh 
>> 880111e69a00 state 2
>>  irq/17-dwc3-3579  [003] d..1 21166.789395: bulk_out_complete: compl: bh 
>> 880111e69a00 state 1
>> file-storage-3578  [000]  21166.789445: fsg_main_thread: next: bh 
>> 880111e69a00 state 1
>
> Okay, that's normal.  2 = BUF_STATE_BUSY, 1 = BUF_STATE_FULL.  So we get woken
> up a couple of times while the transfer is in progress (probably because some
> earlier buffers have finished transferring), then the CBW transfer completes
> and the buffer is read.
>
> ...
>
>> file-storage-3578  [002]  21167.726827: fsg_main_thread: next: bh 
>> 880111e69a80 state 2
>>  irq/17-dwc3-3579  [003] d..1 21167.727066: bulk_out_complete: compl: bh 
>> 880111e69a80 state 1
>> file-storage-3578  [002]  21167.727072: fsg_main_thread: next: bh 
>> 880111e69a80 state 1
>> file-storage-3578  [002]  21167.729458: fsg_main_thread: next: bh 
>> 880111e6aac0 state 2
>>  irq/17-dwc3-3579  [003] d..1 21167.729666: bulk_out_complete: compl: bh 
>> 880111e6aac0 state 1
>> file-storage-3578  [002]  21167.729670: fsg_main_thread: next: bh 
>> 880111e6aac0 state 1
>
> And this is where everything stopped?

yeah, that's everything.

> This also looks normal.  So the question is what happened when 
> get_next_command() returned after this?
>
> Felipe, maybe the patch below (in place of your current patch) will
> help.  Since the events that it logs are all supposed to be unusual,
> you can use printk if you want, but I wrote it with trace_printk.

I've applied your patch and it wasn't giving me any output, which hinted
that g_mass_storage wasn't returning any failures. So I enabled dwc3's
traces to get more data out of it. Here's the final snippet (with
comments, again). Let me know if you want the entire thing (it's
~14MiB).

> irq/17-dwc3-3527  [003] d..134.215214: dwc3_event: event (00110301): Link 
> Change [U1]
> irq/17-dwc3-3527  [003] d..134.215214: dwc3_event: event (4086): 
> ep1in: Transfer In-Progress
> irq/17-dwc3-3527  [003] d..134.215214: dwc3_complete_trb: ep1in: 2/255 
> trb 880084813e70 buf 80808000 size 0 ctrl 0c18 
> (hlcS:SC:normal)
> irq/17-dwc3-3527  [003] d..134.215214: dwc3_gadget_giveback: ep1in: req 
> 88016d046840 length 16384/16384 zsI ==> 0
> irq/17-dwc3-3527  [003] d...34.215215: usb_gadget_giveback_request: 
> ep1in: length 16384/16384 sgs 0/0 stream 0 zsI status 0 --> 0
> irq/17-dwc3-3527  [003] d..134.215218: dwc3_gadget_ep_cmd: ep1in: cmd 
> 'Update Transfer' [196615] params    --> status: 
> Successful
> irq/17-dwc3-3527  [003] d..134.215281: dwc3_event: event (00100301): Link 
> Change [U0]
> irq/17-dwc3-3527  [003] d..134.215281: dwc3_event: event (00110301): Link 
> Change [U1]
> irq/17-dwc3-3527  [003] d..134.215282: dwc3_event: event (4086): 
> ep1in: Transfer In-Progress
> irq/17-dwc3-3527  [003] d..134.215282: dwc3_complete_trb: ep1in: 1/255 
> trb 880084813e80 buf 8080c000 size 0 ctrl 0c18 
> (hlcS:SC:normal)
>

Re: Memory barrier needed with wake_up_process()?

2016-09-09 Thread Alan Stern

On Fri, 9 Sep 2016, Felipe Balbi wrote:

> Finally :-) Here's the diff I used:
> 
> diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
> b/drivers/usb/gadget/function/f_mass_storage.c
> index 8f3659b65f53..0716024f6b65 100644
> --- a/drivers/usb/gadget/function/f_mass_storage.c
> +++ b/drivers/usb/gadget/function/f_mass_storage.c
> @@ -481,6 +481,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct 
> usb_request *req)
> spin_lock(>lock);
> bh->outreq_busy = 0;
> bh->state = BUF_STATE_FULL;
> +   if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN)
> +   trace_printk("compl: bh %p state %d\n", bh, bh->state);
> wakeup_thread(common);
> spin_unlock(>lock);
>  }
> @@ -2208,6 +2210,7 @@ static int get_next_command(struct fsg_common *common)
> rc = sleep_thread(common, true);
> if (rc)
> return rc;
> +   trace_printk("next: bh %p state %d\n", bh, bh->state);
> }
> smp_rmb();
> rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;
> 
> 
> And here's trace output:
> 
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 1002/1002   #P:4
> #
> #  _-=> irqs-off
> # / _=> need-resched
> #| / _---=> hardirq/softirq
> #|| / _--=> preempt-depth
> #||| / delay
> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
> #  | |   |      | |
> file-storage-3578  [000]  21166.789127: fsg_main_thread: next: bh 
> 880111e69a00 state 2
> file-storage-3578  [000]  21166.789312: fsg_main_thread: next: bh 
> 880111e69a00 state 2
>  irq/17-dwc3-3579  [003] d..1 21166.789395: bulk_out_complete: compl: bh 
> 880111e69a00 state 1
> file-storage-3578  [000]  21166.789445: fsg_main_thread: next: bh 
> 880111e69a00 state 1

Okay, that's normal.  2 = BUF_STATE_BUSY, 1 = BUF_STATE_FULL.  So we get woken
up a couple of times while the transfer is in progress (probably because some
earlier buffers have finished transferring), then the CBW transfer completes
and the buffer is read.

...

> file-storage-3578  [002]  21167.726827: fsg_main_thread: next: bh 
> 880111e69a80 state 2
>  irq/17-dwc3-3579  [003] d..1 21167.727066: bulk_out_complete: compl: bh 
> 880111e69a80 state 1
> file-storage-3578  [002]  21167.727072: fsg_main_thread: next: bh 
> 880111e69a80 state 1
> file-storage-3578  [002]  21167.729458: fsg_main_thread: next: bh 
> 880111e6aac0 state 2
>  irq/17-dwc3-3579  [003] d..1 21167.729666: bulk_out_complete: compl: bh 
> 880111e6aac0 state 1
> file-storage-3578  [002]  21167.729670: fsg_main_thread: next: bh 
> 880111e6aac0 state 1

And this is where everything stopped?

This also looks normal.  So the question is what happened when 
get_next_command() returned after this?

Felipe, maybe the patch below (in place of your current patch) will
help.  Since the events that it logs are all supposed to be unusual,
you can use printk if you want, but I wrote it with trace_printk.

Alan Stern



Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
===
--- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c
+++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
@@ -415,6 +415,7 @@ static void raise_exception(struct fsg_c
 * If a lower-or-equal priority exception is in progress, preempt it
 * and notify the main thread by sending it a signal.
 */
+   trace_printk("raise_exception %d\n", new_state);
spin_lock_irqsave(>lock, flags);
if (common->state <= new_state) {
common->exception_req_tag = common->ep0_req_tag;
@@ -2495,6 +2496,7 @@ static void handle_exception(struct fsg_
 static int fsg_main_thread(void *common_)
 {
struct fsg_common   *common = common_;
+   int rc;
 
/*
 * Allow the thread to be killed by a signal, but set the signal mask
@@ -2518,6 +2520,7 @@ static int fsg_main_thread(void *common_
/* The main loop */
while (common->state != FSG_STATE_TERMINATED) {
if (exception_in_progress(common) || signal_pending(current)) {
+   trace_printk("handling exception\n");
handle_exception(common);
continue;
}
@@ -2527,24 +2530,38 @@ static int fsg_main_thread(void *common_
continue;
}
 
-   if (get_next_command(common))
+   rc = get_next_command(common);
+   if (rc) {
+   trace_printk("get_next_command -> %d\n", rc);
continue;
+   }

Re: Memory barrier needed with wake_up_process()?

2016-09-09 Thread Felipe Balbi


Hi,

Felipe Balbi  writes:
> Alan Stern  writes:
>> On Tue, 6 Sep 2016, Peter Zijlstra wrote:
>>
>>> On Tue, Sep 06, 2016 at 01:49:37PM +0200, Peter Zijlstra wrote:
>>> > On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote:
>>> 
>>> > > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just
>>> > > adds extra overhead which makes the problem much, much less likely to
>>> > > happen. Does that sound plausible to you?
>>> > 
>>> > I did consider that, but I've not sufficiently grokked the code to rule
>>> > out actual fail. So let me stare at this a bit more.
>>> 
>>> OK, so I'm really not seeing it, we've got:
>>> 
>>> while (bh->state != FULL) {
>>> for (;;) {
>>> set_current_state(INTERRUPTIBLE); /* MB after */
>>> if (signal_pending(current))
>>> return -EINTR;
>>> if (common->thread_wakeup_needed)
>>> break;
>>> schedule(); /* MB */
>>> }
>>> __set_current_state(RUNNING);
>>> common->thread_wakeup_needed = 0;
>>> smp_rmb(); /* NOP */
>>> }
>>> 
>>> 
>>> VS.
>>> 
>>> 
>>> spin_lock(>lock); /* MB */
>>> bh->state = FULL;
>>> smp_wmb(); /* NOP */
>>> common->thread_wakeup_needed = 1;
>>> wake_up_process(common->thread_task); /* MB before */
>>> spin_unlock(>lock);
>>> 
>>> 
>>> 
>>> (the MB annotations specific to x86, not true in general)
>>> 
>>> 
>>> If we observe thread_wakeup_needed, we must also observe bh->state.
>>> 
>>> And the sleep/wakeup ordering is also correct, we either see
>>> thread_wakeup_needed and continue, or we see task->state == RUNNING
>>> (from the wakeup) and NO-OP schedule(). The MB from set_current_statE()
>>> then matches with the MB from wake_up_process() to ensure we must see
>>> thead_wakeup_needed.
>>> 
>>> Or, we go sleep, and get woken up, at which point the same happens.
>>> Since the waking CPU gets the task back on its RQ the happens-before
>>> chain includes the waking CPUs state along with the state of the task
>>> itself before it went to sleep.
>>> 
>>> At which point we're back where we started, once we see
>>> thread_wakeup_needed we must then also see bh->state (and all state
>>> prior to that on the waking CPU).
>>> 
>>> 
>>> 
>>> There's enough cruft in the while-sleep loop to force reload bh->state.
>>> 
>>> Load/store tearing cannot be a problem because all values are single
>>> bytes (the variables are multi bytes, but all values used only affect
>>> the LSB).
>>> 
>>> Colour me puzzled.
>>
>> Felipe, can you please try this patch on an unmodified tree?  If the 
>> problem still occurs, what shows up in the kernel log?
>>
>> Alan Stern
>>
>>
>>
>> Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
>> ===
>> --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c
>> +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
>> @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb
>>  spin_lock(>lock);
>>  bh->outreq_busy = 0;
>>  bh->state = BUF_STATE_FULL;
>> +if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN)
>> +INFO(common, "compl: bh %p state %d\n", bh, bh->state);
>>  wakeup_thread(common);
>>  spin_unlock(>lock);
>>  }
>> @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_c
>>  rc = sleep_thread(common, true);
>>  if (rc)
>>  return rc;
>> +INFO(common, "next: bh %p state %d\n", bh, bh->state);
>>  }
>>  smp_rmb();
>>  rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;
>
> I've replace INFO() with trace_printk() (which is what I have been using
> anyway):
>
> diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
> b/drivers/usb/gadget/function/f_mass_storage.c
> index 2505117e88e8..dbc6a380b38b 100644
> --- a/drivers/usb/gadget/function/f_mass_storage.c
> +++ b/drivers/usb/gadget/function/f_mass_storage.c
> @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct 
> usb_request *req)
>   spin_lock(>lock);
>   bh->outreq_busy = 0;
>   bh->state = BUF_STATE_FULL;
> + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN)
> + trace_printk("compl: bh %p state %d\n", bh, bh->state);
>   wakeup_thread(common);
>   spin_unlock(>lock);
>  }
> @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_common *common)
>   rc = sleep_thread(common, true);
>   if (rc)
>   return rc;
> + trace_printk("next: bh %p state %d\n", bh, bh->state);
>   }
>   smp_rmb();
>   rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;
>
> But I can't reproduce as reliably as before. I'll keep the thing running
> an infinite loop which will stop only when interrupts in UDC (dwc3 in
> this case) stop

Re: Memory barrier needed with wake_up_process()?

2016-09-07 Thread Felipe Balbi


Hi,

Alan Stern  writes:
> On Tue, 6 Sep 2016, Peter Zijlstra wrote:
>
>> On Tue, Sep 06, 2016 at 01:49:37PM +0200, Peter Zijlstra wrote:
>> > On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote:
>> 
>> > > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just
>> > > adds extra overhead which makes the problem much, much less likely to
>> > > happen. Does that sound plausible to you?
>> > 
>> > I did consider that, but I've not sufficiently grokked the code to rule
>> > out actual fail. So let me stare at this a bit more.
>> 
>> OK, so I'm really not seeing it, we've got:
>> 
>> while (bh->state != FULL) {
>> for (;;) {
>> set_current_state(INTERRUPTIBLE); /* MB after */
>> if (signal_pending(current))
>> return -EINTR;
>> if (common->thread_wakeup_needed)
>> break;
>> schedule(); /* MB */
>> }
>> __set_current_state(RUNNING);
>> common->thread_wakeup_needed = 0;
>> smp_rmb(); /* NOP */
>> }
>> 
>> 
>> VS.
>> 
>> 
>> spin_lock(>lock); /* MB */
>> bh->state = FULL;
>> smp_wmb(); /* NOP */
>> common->thread_wakeup_needed = 1;
>> wake_up_process(common->thread_task); /* MB before */
>> spin_unlock(>lock);
>> 
>> 
>> 
>> (the MB annotations specific to x86, not true in general)
>> 
>> 
>> If we observe thread_wakeup_needed, we must also observe bh->state.
>> 
>> And the sleep/wakeup ordering is also correct, we either see
>> thread_wakeup_needed and continue, or we see task->state == RUNNING
>> (from the wakeup) and NO-OP schedule(). The MB from set_current_statE()
>> then matches with the MB from wake_up_process() to ensure we must see
>> thead_wakeup_needed.
>> 
>> Or, we go sleep, and get woken up, at which point the same happens.
>> Since the waking CPU gets the task back on its RQ the happens-before
>> chain includes the waking CPUs state along with the state of the task
>> itself before it went to sleep.
>> 
>> At which point we're back where we started, once we see
>> thread_wakeup_needed we must then also see bh->state (and all state
>> prior to that on the waking CPU).
>> 
>> 
>> 
>> There's enough cruft in the while-sleep loop to force reload bh->state.
>> 
>> Load/store tearing cannot be a problem because all values are single
>> bytes (the variables are multi bytes, but all values used only affect
>> the LSB).
>> 
>> Colour me puzzled.
>
> Felipe, can you please try this patch on an unmodified tree?  If the 
> problem still occurs, what shows up in the kernel log?
>
> Alan Stern
>
>
>
> Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
> ===
> --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c
> +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
> @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb
>   spin_lock(>lock);
>   bh->outreq_busy = 0;
>   bh->state = BUF_STATE_FULL;
> + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN)
> + INFO(common, "compl: bh %p state %d\n", bh, bh->state);
>   wakeup_thread(common);
>   spin_unlock(>lock);
>  }
> @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_c
>   rc = sleep_thread(common, true);
>   if (rc)
>   return rc;
> + INFO(common, "next: bh %p state %d\n", bh, bh->state);
>   }
>   smp_rmb();
>   rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;

I've replace INFO() with trace_printk() (which is what I have been using
anyway):

diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
b/drivers/usb/gadget/function/f_mass_storage.c
index 2505117e88e8..dbc6a380b38b 100644
--- a/drivers/usb/gadget/function/f_mass_storage.c
+++ b/drivers/usb/gadget/function/f_mass_storage.c
@@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb_ep *ep, struct 
usb_request *req)
spin_lock(>lock);
bh->outreq_busy = 0;
bh->state = BUF_STATE_FULL;
+   if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN)
+   trace_printk("compl: bh %p state %d\n", bh, bh->state);
wakeup_thread(common);
spin_unlock(>lock);
 }
@@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_common *common)
rc = sleep_thread(common, true);
if (rc)
return rc;
+   trace_printk("next: bh %p state %d\n", bh, bh->state);
}
smp_rmb();
rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;

But I can't reproduce as reliably as before. I'll keep the thing running
an infinite loop which will stop only when interrupts in UDC (dwc3 in
this case) stop increasing.

-- 
balbi


signature.asc
Description: PGP signature

Re: Memory barrier needed with wake_up_process()?

2016-09-06 Thread Peter Zijlstra

On Tue, Sep 06, 2016 at 10:46:55AM -0400, Alan Stern wrote:

Not knowing where INFO() goes, you should use trace_printk() not
printk(), as the former is strictly per cpu, while the latter is
globally serialized and can hide all these problems.

> Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
> ===
> --- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c
> +++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
> @@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb
>   spin_lock(>lock);
>   bh->outreq_busy = 0;
>   bh->state = BUF_STATE_FULL;
> + if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN)
> + INFO(common, "compl: bh %p state %d\n", bh, bh->state);
>   wakeup_thread(common);
>   spin_unlock(>lock);
>  }
> @@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_c
>   rc = sleep_thread(common, true);
>   if (rc)
>   return rc;
> + INFO(common, "next: bh %p state %d\n", bh, bh->state);
>   }
>   smp_rmb();
>   rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-06 Thread Alan Stern

On Tue, 6 Sep 2016, Peter Zijlstra wrote:

> On Tue, Sep 06, 2016 at 01:49:37PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote:
> 
> > > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just
> > > adds extra overhead which makes the problem much, much less likely to
> > > happen. Does that sound plausible to you?
> > 
> > I did consider that, but I've not sufficiently grokked the code to rule
> > out actual fail. So let me stare at this a bit more.
> 
> OK, so I'm really not seeing it, we've got:
> 
> while (bh->state != FULL) {
> for (;;) {
> set_current_state(INTERRUPTIBLE); /* MB after */
> if (signal_pending(current))
> return -EINTR;
> if (common->thread_wakeup_needed)
> break;
> schedule(); /* MB */
> }
> __set_current_state(RUNNING);
> common->thread_wakeup_needed = 0;
> smp_rmb(); /* NOP */
> }
> 
> 
> VS.
> 
> 
> spin_lock(>lock); /* MB */
> bh->state = FULL;
> smp_wmb(); /* NOP */
> common->thread_wakeup_needed = 1;
> wake_up_process(common->thread_task); /* MB before */
> spin_unlock(>lock);
> 
> 
> 
> (the MB annotations specific to x86, not true in general)
> 
> 
> If we observe thread_wakeup_needed, we must also observe bh->state.
> 
> And the sleep/wakeup ordering is also correct, we either see
> thread_wakeup_needed and continue, or we see task->state == RUNNING
> (from the wakeup) and NO-OP schedule(). The MB from set_current_statE()
> then matches with the MB from wake_up_process() to ensure we must see
> thead_wakeup_needed.
> 
> Or, we go sleep, and get woken up, at which point the same happens.
> Since the waking CPU gets the task back on its RQ the happens-before
> chain includes the waking CPUs state along with the state of the task
> itself before it went to sleep.
> 
> At which point we're back where we started, once we see
> thread_wakeup_needed we must then also see bh->state (and all state
> prior to that on the waking CPU).
> 
> 
> 
> There's enough cruft in the while-sleep loop to force reload bh->state.
> 
> Load/store tearing cannot be a problem because all values are single
> bytes (the variables are multi bytes, but all values used only affect
> the LSB).
> 
> Colour me puzzled.

Felipe, can you please try this patch on an unmodified tree?  If the 
problem still occurs, what shows up in the kernel log?

Alan Stern



Index: usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
===
--- usb-4.x.orig/drivers/usb/gadget/function/f_mass_storage.c
+++ usb-4.x/drivers/usb/gadget/function/f_mass_storage.c
@@ -485,6 +485,8 @@ static void bulk_out_complete(struct usb
spin_lock(>lock);
bh->outreq_busy = 0;
bh->state = BUF_STATE_FULL;
+   if (bh->bulk_out_intended_length == US_BULK_CB_WRAP_LEN)
+   INFO(common, "compl: bh %p state %d\n", bh, bh->state);
wakeup_thread(common);
spin_unlock(>lock);
 }
@@ -2207,6 +2209,7 @@ static int get_next_command(struct fsg_c
rc = sleep_thread(common, true);
if (rc)
return rc;
+   INFO(common, "next: bh %p state %d\n", bh, bh->state);
}
smp_rmb();
rc = fsg_is_set(common) ? received_cbw(common->fsg, bh) : -EIO;

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-06 Thread Peter Zijlstra

On Tue, Sep 06, 2016 at 01:49:37PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote:

> > My fear now, however, is that changing smp_[rw]mb() to smp_mb() just
> > adds extra overhead which makes the problem much, much less likely to
> > happen. Does that sound plausible to you?
> 
> I did consider that, but I've not sufficiently grokked the code to rule
> out actual fail. So let me stare at this a bit more.

OK, so I'm really not seeing it, we've got:

while (bh->state != FULL) {
for (;;) {
set_current_state(INTERRUPTIBLE); /* MB after */
if (signal_pending(current))
return -EINTR;
if (common->thread_wakeup_needed)
break;
schedule(); /* MB */
}
__set_current_state(RUNNING);
common->thread_wakeup_needed = 0;
smp_rmb(); /* NOP */
}

VS.

spin_lock(>lock); /* MB */
bh->state = FULL;
smp_wmb(); /* NOP */
common->thread_wakeup_needed = 1;
wake_up_process(common->thread_task); /* MB before */
spin_unlock(>lock);

(the MB annotations specific to x86, not true in general)

If we observe thread_wakeup_needed, we must also observe bh->state.

And the sleep/wakeup ordering is also correct, we either see
thread_wakeup_needed and continue, or we see task->state == RUNNING
(from the wakeup) and NO-OP schedule(). The MB from set_current_statE()
then matches with the MB from wake_up_process() to ensure we must see
thead_wakeup_needed.

Or, we go sleep, and get woken up, at which point the same happens.
Since the waking CPU gets the task back on its RQ the happens-before
chain includes the waking CPUs state along with the state of the task
itself before it went to sleep.

At which point we're back where we started, once we see
thread_wakeup_needed we must then also see bh->state (and all state
prior to that on the waking CPU).

There's enough cruft in the while-sleep loop to force reload bh->state.

Load/store tearing cannot be a problem because all values are single
bytes (the variables are multi bytes, but all values used only affect
the LSB).

Colour me puzzled.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-06 Thread Peter Zijlstra

On Tue, Sep 06, 2016 at 02:43:39PM +0300, Felipe Balbi wrote:

> > Could you confirm that bulk_{in,out}_complete() work on different
> > usb_request structures, and they can not, at any time, get called on the
> > _same_ request?
> 
> usb_requests are allocated for a specific endpoint and USB Device
> Controller (UDC) drivers refuse to queue requests allocated for epX to
> epY, so this really can never happen.

Good, thanks!

> My fear now, however, is that changing smp_[rw]mb() to smp_mb() just
> adds extra overhead which makes the problem much, much less likely to
> happen. Does that sound plausible to you?

I did consider that, but I've not sufficiently grokked the code to rule
out actual fail. So let me stare at this a bit more.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-06 Thread Felipe Balbi


Hi,

Peter Zijlstra  writes:
> On Mon, Sep 05, 2016 at 11:29:26AM -0400, Alan Stern wrote:
>> On Mon, 5 Sep 2016, Peter Zijlstra wrote:
>> 
>> > > Actually, seeing it written out like this, one realizes that it really 
>> > > ought to be:
>> > > 
>> > >  CPU 0   CPU 1
>> > > --
>> > > Start DMAHandle DMA-complete irq
>> > > Sleep until bh->statesmp_mb()
>> > >  set bh->state
>> > >  Wake up CPU 0
>> > >  smp_mb()
>> > >  Compute rc based on contents of the DMA buffer
>> > > 
>> > > (Bear in mind also that on some platforms, the I/O operation is carried 
>> > > out by PIO rather than DMA.)
>
>> > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC.
>> > Its only defined to do CPU/CPU interactions.
>> 
>> Suppose the DMA master finishes filling up an input buffer and issues a
>> completion irq to CPU1.  Presumably the data would then be visible to
>> CPU1 if the interrupt handler looked at it.  So if CPU1 executes smp_mb
>> before setting bh->state and waking up CPU0, and if CPU0 executes
>> smp_mb after testing bh->state and before reading the data buffer,
>> wouldn't CPU0 then see the correct data in the buffer?  Even if CPU0 
>> never did go to sleep?
>
> Couple of notes here; I would expect the DMA master to make its stores
> _globally_ visible on 'completion'. Because I'm not sure our smp_mb()
> would help make it globally visible, since its only defined on CPU/CPU
> interactions, not external.
>
> Note that for example ARM has the distinction where smp_mb() uses
> DMB-ISH barrier, dma_[rw]mb() uses DMB-OSH{LD,ST} and mb() uses DSB-SY.
>
> The ISH domain is the Inner-SHarable (IIRC) and only includes CPUs. The
> OSH is Outer-SHarable and adds external agents like DMA (also includes
> CPUs). The DSB-SY thing is even heavier and syncs world or something; I
> always forget these details.
>
>> Or would something more be needed?
>
> The thing is, this is x86 (TSO). Most everything is globally
> ordered/visible and full barriers.
>
> The only reorder allowed by TSO is a later read can happen before a
> prior store. That is, only:
>
>   X = 1;
>   smp_mb();
>   r = Y;
>
> actually needs a full barrier. All the other variants are no-ops.
>
> Note that x86's dma_[rw]mb() are no-ops (like the regular smp_[rw]mb()).
> Which seems to imply DMA is coherent and globally visible.
>
>> > I would very much expect the device IO stuff to order things for us in
>> > this case. "Start DMA" should very much include sufficient fences to
>> > ensure the data under DMA is visible to the DMA engine, this would very
>> > much include things like flushing store buffers and maybe even writeback
>> > caches, depending on platform needs.
>> > 
>> > At the same time, I would expect "Handle DMA-complete irq", even if it
>> > were done with a PIO polling loop, to guarantee ordering against later
>> > operations such that 'complete' really means that.
>> 
>> That's what I would expect too.
>> 
>> Back in the original email thread where the problem was first reported, 
>> Felipe says that the problem appears to be something else.  Here's what 
>> it looks like now, in schematic form:
>> 
>>  CPU0
>>  
>>  get_next_command():
>>while (bh->state != BUF_STATE_EMPTY)
>>  sleep_thread();
>>start an input request for bh
>>while (bh->state != BUF_STATE_FULL)
>>  sleep_thread();
>> 
>> As mentioned above, the input involves DMA and is terminated by an irq.
>> The request's completion handler is bulk_out_complete():
>> 
>>  CPU1
>>  
>>  bulk_out_complete():
>>bh->state = BUF_STATE_FULL;
>>wakeup_thread();
>> 
>> According to Felipe, when CPU0 wakes up and checks bh->state, it sees a 
>> value different from BUF_STATE_FULL.  So it goes back to sleep again 
>> and doesn't make any forward progress.
>> 
>> It's possible that something else is changing bh->state when it 
>> shouldn't.  But if this were the explanation, why would Felipe see that 
>> the problem goes away when he changes the memory barriers in 
>> sleep_thread() and wakeup_thread() to smp_mb()?
>
> Good question, let me stare at the code more.
>
> Could you confirm that bulk_{in,out}_complete() work on different
> usb_request structures, and they can not, at any time, get called on the
> _same_ request?

usb_requests are allocated for a specific endpoint and USB Device
Controller (UDC) drivers refuse to queue requests allocated for epX to
epY, so this really can never happen.

My fear now, however, is that changing smp_[rw]mb() to smp_mb() just
adds extra overhead which makes the problem much, much less likely to
happen. Does that sound plausible to you?

-- 
balbi


signature.asc
Description: PGP signature

Re: Memory barrier needed with wake_up_process()?

2016-09-06 Thread Peter Zijlstra

On Mon, Sep 05, 2016 at 11:29:26AM -0400, Alan Stern wrote:
> On Mon, 5 Sep 2016, Peter Zijlstra wrote:
> 
> > > Actually, seeing it written out like this, one realizes that it really 
> > > ought to be:
> > > 
> > >   CPU 0   CPU 1
> > > - -
> > > Start DMA Handle DMA-complete irq
> > > Sleep until bh->state smp_mb()
> > >   set bh->state
> > >   Wake up CPU 0
> > >   smp_mb()
> > >   Compute rc based on contents of the DMA buffer
> > > 
> > > (Bear in mind also that on some platforms, the I/O operation is carried 
> > > out by PIO rather than DMA.)

> > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC.
> > Its only defined to do CPU/CPU interactions.
> 
> Suppose the DMA master finishes filling up an input buffer and issues a
> completion irq to CPU1.  Presumably the data would then be visible to
> CPU1 if the interrupt handler looked at it.  So if CPU1 executes smp_mb
> before setting bh->state and waking up CPU0, and if CPU0 executes
> smp_mb after testing bh->state and before reading the data buffer,
> wouldn't CPU0 then see the correct data in the buffer?  Even if CPU0 
> never did go to sleep?

Couple of notes here; I would expect the DMA master to make its stores
_globally_ visible on 'completion'. Because I'm not sure our smp_mb()
would help make it globally visible, since its only defined on CPU/CPU
interactions, not external.

Note that for example ARM has the distinction where smp_mb() uses
DMB-ISH barrier, dma_[rw]mb() uses DMB-OSH{LD,ST} and mb() uses DSB-SY.

The ISH domain is the Inner-SHarable (IIRC) and only includes CPUs. The
OSH is Outer-SHarable and adds external agents like DMA (also includes
CPUs). The DSB-SY thing is even heavier and syncs world or something; I
always forget these details.

> Or would something more be needed?

The thing is, this is x86 (TSO). Most everything is globally
ordered/visible and full barriers.

The only reorder allowed by TSO is a later read can happen before a
prior store. That is, only:

X = 1;
smp_mb();
r = Y;

actually needs a full barrier. All the other variants are no-ops.

Note that x86's dma_[rw]mb() are no-ops (like the regular smp_[rw]mb()).
Which seems to imply DMA is coherent and globally visible.

> > I would very much expect the device IO stuff to order things for us in
> > this case. "Start DMA" should very much include sufficient fences to
> > ensure the data under DMA is visible to the DMA engine, this would very
> > much include things like flushing store buffers and maybe even writeback
> > caches, depending on platform needs.
> > 
> > At the same time, I would expect "Handle DMA-complete irq", even if it
> > were done with a PIO polling loop, to guarantee ordering against later
> > operations such that 'complete' really means that.
> 
> That's what I would expect too.
> 
> Back in the original email thread where the problem was first reported, 
> Felipe says that the problem appears to be something else.  Here's what 
> it looks like now, in schematic form:
> 
>   CPU0
>   
>   get_next_command():
> while (bh->state != BUF_STATE_EMPTY)
>   sleep_thread();
> start an input request for bh
> while (bh->state != BUF_STATE_FULL)
>   sleep_thread();
> 
> As mentioned above, the input involves DMA and is terminated by an irq.
> The request's completion handler is bulk_out_complete():
> 
>   CPU1
>   
>   bulk_out_complete():
> bh->state = BUF_STATE_FULL;
> wakeup_thread();
> 
> According to Felipe, when CPU0 wakes up and checks bh->state, it sees a 
> value different from BUF_STATE_FULL.  So it goes back to sleep again 
> and doesn't make any forward progress.
> 
> It's possible that something else is changing bh->state when it 
> shouldn't.  But if this were the explanation, why would Felipe see that 
> the problem goes away when he changes the memory barriers in 
> sleep_thread() and wakeup_thread() to smp_mb()?

Good question, let me stare at the code more.

Could you confirm that bulk_{in,out}_complete() work on different
usb_request structures, and they can not, at any time, get called on the
_same_ request?


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-06 Thread Peter Zijlstra

On Mon, Sep 05, 2016 at 10:43:11AM +0100, Will Deacon wrote:
> On Sat, Sep 03, 2016 at 12:16:29AM +0200, Peter Zijlstra wrote:

> > Forgot to Cc Will. Will, does ARM64 need to make smp_mb__before_spinlock
> > smp_mb() too?
> 
> Yes, probably. Just to confirm, the test is something like:
> 
> 
> CPU0
> 
> 
> Wx=1
> smp_mb__before_spinlock()
> LOCK(y)
> Rz=0
> 
> CPU1
> 
> 
> Wz=1
> smp_mb()
> Rx=0
> 
> 
> and that should be forbidden?

Indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-05 Thread Alan Stern

On Mon, 5 Sep 2016, Peter Zijlstra wrote:

> > Actually, seeing it written out like this, one realizes that it really 
> > ought to be:
> > 
> > CPU 0   CPU 1
> > -   -
> > Start DMA   Handle DMA-complete irq
> > Sleep until bh->state   smp_mb()
> > set bh->state
> > Wake up CPU 0
> > smp_mb()
> > Compute rc based on contents of the DMA buffer
> > 
> > (Bear in mind also that on some platforms, the I/O operation is carried 
> > out by PIO rather than DMA.)
> 
> I'm sorry, but I still don't follow. This could be because I seldom
> interact with DMA agents and therefore am not familiar with that stuff.

I haven't seen the details of memory ordering requirements in the
presence of DMA spelled out anywhere.  The documents I have read merely
state that you have to careful to flush caches before doing DMA OUT and
to avoid filling caches before DMA IN is complete.  Neither of those is
an issue here, apparently.

> Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC.
> Its only defined to do CPU/CPU interactions.

Suppose the DMA master finishes filling up an input buffer and issues a
completion irq to CPU1.  Presumably the data would then be visible to
CPU1 if the interrupt handler looked at it.  So if CPU1 executes smp_mb
before setting bh->state and waking up CPU0, and if CPU0 executes
smp_mb after testing bh->state and before reading the data buffer,
wouldn't CPU0 then see the correct data in the buffer?  Even if CPU0 
never did go to sleep?

Or would something more be needed?

> I would very much expect the device IO stuff to order things for us in
> this case. "Start DMA" should very much include sufficient fences to
> ensure the data under DMA is visible to the DMA engine, this would very
> much include things like flushing store buffers and maybe even writeback
> caches, depending on platform needs.
> 
> At the same time, I would expect "Handle DMA-complete irq", even if it
> were done with a PIO polling loop, to guarantee ordering against later
> operations such that 'complete' really means that.

That's what I would expect too.

Back in the original email thread where the problem was first reported, 
Felipe says that the problem appears to be something else.  Here's what 
it looks like now, in schematic form:

CPU0

get_next_command():
  while (bh->state != BUF_STATE_EMPTY)
sleep_thread();
  start an input request for bh
  while (bh->state != BUF_STATE_FULL)
sleep_thread();

As mentioned above, the input involves DMA and is terminated by an irq.
The request's completion handler is bulk_out_complete():

CPU1

bulk_out_complete():
  bh->state = BUF_STATE_FULL;
  wakeup_thread();

According to Felipe, when CPU0 wakes up and checks bh->state, it sees a 
value different from BUF_STATE_FULL.  So it goes back to sleep again 
and doesn't make any forward progress.

It's possible that something else is changing bh->state when it 
shouldn't.  But if this were the explanation, why would Felipe see that 
the problem goes away when he changes the memory barriers in 
sleep_thread() and wakeup_thread() to smp_mb()?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-05 Thread Alan Stern

On Mon, 5 Sep 2016, Peter Zijlstra wrote:

> > You know, I never went through and verified that _all_ the invocations 
> > of sleep_thread() are like that. 
> 
> Well, thing is, they're all inside a loop which checks other conditions
> for forward progress. Therefore the loop inside sleep_thread() is
> pointless. Even if you were to return early, you'd simply loop in the
> outer loop and go back to sleep again.
> 
> > In fact, I wrote the sleep/wakeup 
> > routines _before_ the rest of the code, and I didn't know in advance 
> > exactly how they were going to be called.
> 
> Still seems strange to me, why not use wait-queues for the first cut?
> 
> Only if you find a performance issue with wait-queues, which cannot be
> fixed in the wait-queue proper, then do you do custom thingies.
> 
> Starting with a custom sleeper, just doesn't make sense to me.

I really don't remember.  Felipe says that the ancient history shows
the initial implementation did use a wait-queue, and then it was
changed.  Perhaps I was imitating the structure of
scsi_error_handler().

> > The problem may be that when the thread wakes up (or skips going to 
> > sleep), it needs to see more than just bh->state.  Those other values 
> > it needs are not written by the same CPU that calls wakeup_thread(), 
> > and so to ensure that they are visible that smp_wmb() really ought to 
> > be smp_mb() (and correspondingly in the thread.  That's what Felipe has 
> > been testing.
> 
> So you're saying something like:
> 
> 
>   CPU0CPU1CPU2
> 
>   X = 1   sleep_thread()
>   wakeup_thread()
>   r = X
> 
> But how does CPU1 know to do the wakeup? That is, how are CPU0 and CPU1
> coupled.

As mentioned later on, "CPU0" is actually a DMA master, not another 
CPU.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-05 Thread Will Deacon

On Sat, Sep 03, 2016 at 12:16:29AM +0200, Peter Zijlstra wrote:
> On Sat, Sep 03, 2016 at 12:14:13AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote:
> > > 
> > > Actually, that's not entirely true (although presumably it works okay
> > > for most architectures).
> > 
> > Yeah, all load-store archs (with exception of PowerPC and ARM64 and
> > possibly MIPS) implement ACQUIRE with a general fence (after the ll/sc).
> > 
> > ( and MIPS doesn't use their fancy barriers in Linux )
> > 
> > PowerPC does the full fence for smp_mb__before_spinlock, which leaves
> > ARM64, I'm not sure its correct, but I'm way too tired to think about
> > that now.
> > 
> > The TSO archs imply full barriers with all atomic RmW ops and are
> > therefore also good.
> > 
> 
> Forgot to Cc Will. Will, does ARM64 need to make smp_mb__before_spinlock
> smp_mb() too?

Yes, probably. Just to confirm, the test is something like:


CPU0


Wx=1
smp_mb__before_spinlock()
LOCK(y)
Rz=0

CPU1


Wz=1
smp_mb()
Rx=0


and that should be forbidden?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-05 Thread Peter Zijlstra

On Sat, Sep 03, 2016 at 10:49:39AM -0400, Alan Stern wrote:
> On Sat, 3 Sep 2016, Alan Stern wrote:
> 
> > In other words, we have:
> > 
> > CPU 0   CPU 1
> > -   -
> > Start DMA   Handle DMA-complete irq
> > Sleep until bh->state   Set bh->state
> > smp_wmb()
> > Wake up CPU 0
> > smp_rmb()
> > Compute rc based on contents
> > of the DMA buffer
> > 
> > This was written many years ago, at a time when I did not fully
> > understand all the details of memory ordering.  Do you agree that both
> > of those barriers should really be smp_mb()?  That's what Felipe has
> > been testing.
> 
> Actually, seeing it written out like this, one realizes that it really 
> ought to be:
> 
>   CPU 0   CPU 1
> - -
> Start DMA Handle DMA-complete irq
> Sleep until bh->state smp_mb()
>   set bh->state
>   Wake up CPU 0
>   smp_mb()
>   Compute rc based on contents of the DMA buffer
> 
> (Bear in mind also that on some platforms, the I/O operation is carried 
> out by PIO rather than DMA.)

I'm sorry, but I still don't follow. This could be because I seldom
interact with DMA agents and therefore am not familiar with that stuff.

Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC.
Its only defined to do CPU/CPU interactions.

I would very much expect the device IO stuff to order things for us in
this case. "Start DMA" should very much include sufficient fences to
ensure the data under DMA is visible to the DMA engine, this would very
much include things like flushing store buffers and maybe even writeback
caches, depending on platform needs.

At the same time, I would expect "Handle DMA-complete irq", even if it
were done with a PIO polling loop, to guarantee ordering against later
operations such that 'complete' really means that.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-05 Thread Peter Zijlstra

On Sat, Sep 03, 2016 at 04:51:07PM +0300, Felipe Balbi wrote:
> > That said, I cannot spot an obvious fail,
> 
> okay, but a fail does exist. Any hints on what extra information I could
> capture to help figuring this one out?

More information on which sleep is not waking woudl help I suppose. That
greatly limits the amount of code one has to stare at.

Also, a better picture of all threads involved in the wakeup. Alan seems
to suggest there's multiple CPUs involved in the wakeup.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-05 Thread Peter Zijlstra

On Sat, Sep 03, 2016 at 10:16:31AM -0400, Alan Stern wrote:

> > Sorry, but that is horrible code. A barrier cannot ensure writes are
> > 'complete', at best they can ensure order between writes (or reads
> > etc..).
> 
> The code is better than the comment.  What I really meant was that the 
> write of bh->state needs to be visible to the thread after it wakes up 
> (or after it checks the wakeup condition and skips going to sleep).

Yeah, I got that.

> > Also, looking at that thing, that common->thread_wakeup_needed variable
> > is 100% redundant. All sleep_thread() invocations are inside a loop of
> > sorts and basically wait for other conditions to become true.
> > 
> > For example:
> > 
> > while (bh->state != BUF_STATE_EMPTY) {
> > rc = sleep_thread(common, false);
> > if (rc)
> > return rc;
> > }
> > 
> > All you care about there is bh->state, _not_
> > common->thread_wakeup_needed.
> 
> You know, I never went through and verified that _all_ the invocations 
> of sleep_thread() are like that. 

Well, thing is, they're all inside a loop which checks other conditions
for forward progress. Therefore the loop inside sleep_thread() is
pointless. Even if you were to return early, you'd simply loop in the
outer loop and go back to sleep again.

> In fact, I wrote the sleep/wakeup 
> routines _before_ the rest of the code, and I didn't know in advance 
> exactly how they were going to be called.

Still seems strange to me, why not use wait-queues for the first cut?

Only if you find a performance issue with wait-queues, which cannot be
fixed in the wait-queue proper, then do you do custom thingies.

Starting with a custom sleeper, just doesn't make sense to me.

> > That said, I cannot spot an obvious fail, but the code can certainly use
> > help.
> 
> The problem may be that when the thread wakes up (or skips going to 
> sleep), it needs to see more than just bh->state.  Those other values 
> it needs are not written by the same CPU that calls wakeup_thread(), 
> and so to ensure that they are visible that smp_wmb() really ought to 
> be smp_mb() (and correspondingly in the thread.  That's what Felipe has 
> been testing.

So you're saying something like:


CPU0CPU1CPU2

X = 1   sleep_thread()
wakeup_thread()
r = X

But how does CPU1 know to do the wakeup? That is, how are CPU0 and CPU1
coupled.


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-03 Thread Alan Stern

On Sat, 3 Sep 2016, Alan Stern wrote:

> In other words, we have:
> 
>   CPU 0   CPU 1
>   -   -
>   Start DMA   Handle DMA-complete irq
>   Sleep until bh->state   Set bh->state
>   smp_wmb()
>   Wake up CPU 0
>   smp_rmb()
>   Compute rc based on contents
>   of the DMA buffer
> 
> This was written many years ago, at a time when I did not fully
> understand all the details of memory ordering.  Do you agree that both
> of those barriers should really be smp_mb()?  That's what Felipe has
> been testing.

Actually, seeing it written out like this, one realizes that it really 
ought to be:

CPU 0   CPU 1
-   -
Start DMA   Handle DMA-complete irq
Sleep until bh->state   smp_mb()
set bh->state
Wake up CPU 0
smp_mb()
Compute rc based on contents of the DMA buffer

(Bear in mind also that on some platforms, the I/O operation is carried 
out by PIO rather than DMA.)

Also, the smp_wmb() in bulk_out_complete() looks unnecessary.  I can't 
remember why I put it there originally.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-03 Thread Alan Stern

On Sat, 3 Sep 2016, Peter Zijlstra wrote:

> On Fri, Sep 02, 2016 at 04:29:19PM -0400, Alan Stern wrote:
> > I'm afraid so.  The code doesn't use wait_event(), in part because
> > there's no wait_queue (since only one task is involved).
> 
> You can use wait_queue fine with just one task, and it would clean up
> the code tremendously.
> 
> You can replace things like the earlier mentioned:
> 
>   while (bh->state != BUF_STATE_EMPTY) {
>   rc = sleep_thread(common, false);
>   if (rc)
>   return rc;
>   }
> 
> with:
> 
>   rc = wait_event_interruptible(>wq, bh->state == 
> BUF_STATE_EMPTY);
>   if (rc)
>   return rc;

If someone wants to devote time and effort to cleaning up the driver, 
that would be a good start.

> > But maybe there's another barrier which needs to be fixed.  Felipe, can
> > you check to see if received_cbw() is getting called in
> > get_next_command(), and if so, what value it returns?  Or is the
> > preceding sleep_thread() the one that never wakes up?
> > 
> > It could be that the smp_wmb() in wakeup_thread() needs to be smp_mb().  
> > The reason being that get_next_command() runs outside the protection of 
> > the spinlock.
> 
> Being somewhat confused by the code, I fail to follow that argument.
> wakeup_thread() is always called under that spinlock(), but since the
> critical section is 2 stores, I fail to see how a smp_mb() can make any
> difference over the smp_wmb() already there.

But sleep_thread() and the code that follows it are _not_ called under
the spinlock.  And the following code examines values that were written
by DMA, not by the CPU calling wakeup_thread().  (Although that CPU
_is_ the one that receives the DMA-completion notice.)

In other words, we have:

CPU 0   CPU 1
-   -
Start DMA   Handle DMA-complete irq
Sleep until bh->state   Set bh->state
smp_wmb()
Wake up CPU 0
smp_rmb()
Compute rc based on contents
of the DMA buffer

This was written many years ago, at a time when I did not fully
understand all the details of memory ordering.  Do you agree that both
of those barriers should really be smp_mb()?  That's what Felipe has
been testing.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-03 Thread Alan Stern

On Sat, 3 Sep 2016, Peter Zijlstra wrote:

> On Sat, Sep 03, 2016 at 09:58:09AM +0300, Felipe Balbi wrote:
> 
> > > What arch are you seeing this on?
> > 
> > x86. Skylake to be exact.
> 
> So it _cannot_ be the thing Alan mentioned. By the simple fact that
> spin_lock() is a full barrier on x86 (every LOCK prefixed instruction
> is).

True, my guess was wrong.

> > The following change survived through the night:
> > 
> > diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
> > b/drivers/usb/gadget/function/f_mass_storage.c
> > index 8f3659b65f53..d31581dd5ce5 100644
> > --- a/drivers/usb/gadget/function/f_mass_storage.c
> > +++ b/drivers/usb/gadget/function/f_mass_storage.c
> > @@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct 
> > usb_ep *ep)
> >  /* Caller must hold fsg->lock */
> >  static void wakeup_thread(struct fsg_common *common)
> >  {
> > -   smp_wmb();  /* ensure the write of bh->state is complete */
> > +   smp_mb();   /* ensure the write of bh->state is complete */
> > /* Tell the main thread that something has happened */
> > common->thread_wakeup_needed = 1;
> > if (common->thread_task)
> > @@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool 
> > can_freeze)
> > }
> > __set_current_state(TASK_RUNNING);
> > common->thread_wakeup_needed = 0;
> > -   smp_rmb();  /* ensure the latest bh->state is visible */
> > +   smp_mb();   /* ensure the latest bh->state is visible */
> > return rc;
> >  }
> 
> Sorry, but that is horrible code. A barrier cannot ensure writes are
> 'complete', at best they can ensure order between writes (or reads
> etc..).

The code is better than the comment.  What I really meant was that the 
write of bh->state needs to be visible to the thread after it wakes up 
(or after it checks the wakeup condition and skips going to sleep).

> Also, looking at that thing, that common->thread_wakeup_needed variable
> is 100% redundant. All sleep_thread() invocations are inside a loop of
> sorts and basically wait for other conditions to become true.
> 
> For example:
> 
>   while (bh->state != BUF_STATE_EMPTY) {
>   rc = sleep_thread(common, false);
>   if (rc)
>   return rc;
>   }
> 
> All you care about there is bh->state, _not_
> common->thread_wakeup_needed.

You know, I never went through and verified that _all_ the invocations 
of sleep_thread() are like that.  In fact, I wrote the sleep/wakeup 
routines _before_ the rest of the code, and I didn't know in advance 
exactly how they were going to be called.

> That said, I cannot spot an obvious fail, but the code can certainly use
> help.

The problem may be that when the thread wakes up (or skips going to 
sleep), it needs to see more than just bh->state.  Those other values 
it needs are not written by the same CPU that calls wakeup_thread(), 
and so to ensure that they are visible that smp_wmb() really ought to 
be smp_mb() (and correspondingly in the thread.  That's what Felipe has 
been testing.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-03 Thread Felipe Balbi


Hi,

Peter Zijlstra  writes:
> On Sat, Sep 03, 2016 at 09:58:09AM +0300, Felipe Balbi wrote:
>
>> > What arch are you seeing this on?
>> 
>> x86. Skylake to be exact.
>
> So it _cannot_ be the thing Alan mentioned. By the simple fact that
> spin_lock() is a full barrier on x86 (every LOCK prefixed instruction
> is).

I still have this working even after 15 hours of runtime on a test case
that was failing consistently within few minutes. At a minimum smp_mb()
has some side effect which is hiding the actual problem.

>> The following change survived through the night:
>> 
>> diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
>> b/drivers/usb/gadget/function/f_mass_storage.c
>> index 8f3659b65f53..d31581dd5ce5 100644
>> --- a/drivers/usb/gadget/function/f_mass_storage.c
>> +++ b/drivers/usb/gadget/function/f_mass_storage.c
>> @@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct 
>> usb_ep *ep)
>>  /* Caller must hold fsg->lock */
>>  static void wakeup_thread(struct fsg_common *common)
>>  {
>> -smp_wmb();  /* ensure the write of bh->state is complete */
>> +smp_mb();   /* ensure the write of bh->state is complete */
>>  /* Tell the main thread that something has happened */
>>  common->thread_wakeup_needed = 1;
>>  if (common->thread_task)
>> @@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool 
>> can_freeze)
>>  }
>>  __set_current_state(TASK_RUNNING);
>>  common->thread_wakeup_needed = 0;
>> -smp_rmb();  /* ensure the latest bh->state is visible */
>> +smp_mb();   /* ensure the latest bh->state is visible */
>>  return rc;
>>  }
>
> Sorry, but that is horrible code. A barrier cannot ensure writes are
> 'complete', at best they can ensure order between writes (or reads
> etc..).

not arguing ;-)

> Also, looking at that thing, that common->thread_wakeup_needed variable
> is 100% redundant. All sleep_thread() invocations are inside a loop of
> sorts and basically wait for other conditions to become true.
>
> For example:
>
>   while (bh->state != BUF_STATE_EMPTY) {
>   rc = sleep_thread(common, false);
>   if (rc)
>   return rc;
>   }

right

> All you care about there is bh->state, _not_
> common->thread_wakeup_needed.
>
> That said, I cannot spot an obvious fail,

okay, but a fail does exist. Any hints on what extra information I could
capture to help figuring this one out?

> but the code can certainly use help.

Sure, that can be done for v4.9 (if I have time) or v4.10 merge
window. Meanwhile, we're trying to find a minimal fix for the -rc which
can also be backported to stable, right?

-- 
balbi
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-03 Thread Peter Zijlstra

On Sat, Sep 03, 2016 at 09:58:09AM +0300, Felipe Balbi wrote:

> > What arch are you seeing this on?
> 
> x86. Skylake to be exact.

So it _cannot_ be the thing Alan mentioned. By the simple fact that
spin_lock() is a full barrier on x86 (every LOCK prefixed instruction
is).

> The following change survived through the night:
> 
> diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
> b/drivers/usb/gadget/function/f_mass_storage.c
> index 8f3659b65f53..d31581dd5ce5 100644
> --- a/drivers/usb/gadget/function/f_mass_storage.c
> +++ b/drivers/usb/gadget/function/f_mass_storage.c
> @@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct 
> usb_ep *ep)
>  /* Caller must hold fsg->lock */
>  static void wakeup_thread(struct fsg_common *common)
>  {
> - smp_wmb();  /* ensure the write of bh->state is complete */
> + smp_mb();   /* ensure the write of bh->state is complete */
>   /* Tell the main thread that something has happened */
>   common->thread_wakeup_needed = 1;
>   if (common->thread_task)
> @@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool 
> can_freeze)
>   }
>   __set_current_state(TASK_RUNNING);
>   common->thread_wakeup_needed = 0;
> - smp_rmb();  /* ensure the latest bh->state is visible */
> + smp_mb();   /* ensure the latest bh->state is visible */
>   return rc;
>  }

Sorry, but that is horrible code. A barrier cannot ensure writes are
'complete', at best they can ensure order between writes (or reads
etc..).

Also, looking at that thing, that common->thread_wakeup_needed variable
is 100% redundant. All sleep_thread() invocations are inside a loop of
sorts and basically wait for other conditions to become true.

For example:

while (bh->state != BUF_STATE_EMPTY) {
rc = sleep_thread(common, false);
if (rc)
return rc;
}

All you care about there is bh->state, _not_
common->thread_wakeup_needed.

That said, I cannot spot an obvious fail, but the code can certainly use
help.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-03 Thread Peter Zijlstra

On Fri, Sep 02, 2016 at 04:29:19PM -0400, Alan Stern wrote:
> I'm afraid so.  The code doesn't use wait_event(), in part because
> there's no wait_queue (since only one task is involved).

You can use wait_queue fine with just one task, and it would clean up
the code tremendously.

You can replace things like the earlier mentioned:

while (bh->state != BUF_STATE_EMPTY) {
rc = sleep_thread(common, false);
if (rc)
return rc;
}

with:

rc = wait_event_interruptible(>wq, bh->state == 
BUF_STATE_EMPTY);
if (rc)
return rc;

> But maybe there's another barrier which needs to be fixed.  Felipe, can
> you check to see if received_cbw() is getting called in
> get_next_command(), and if so, what value it returns?  Or is the
> preceding sleep_thread() the one that never wakes up?
> 
> It could be that the smp_wmb() in wakeup_thread() needs to be smp_mb().  
> The reason being that get_next_command() runs outside the protection of 
> the spinlock.

Being somewhat confused by the code, I fail to follow that argument.
wakeup_thread() is always called under that spinlock(), but since the
critical section is 2 stores, I fail to see how a smp_mb() can make any
difference over the smp_wmb() already there.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-03 Thread Paul E. McKenney

On Fri, Sep 02, 2016 at 04:29:19PM -0400, Alan Stern wrote:
> On Fri, 2 Sep 2016, Paul E. McKenney wrote:
> 
> > On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote:
> > > Paul, Peter, and Ingo:
> > > 
> > > This must have come up before, but I don't know what was decided.
> > > 
> > > Isn't it often true that a memory barrier is needed before a call to 
> > > wake_up_process()?  A typical scenario might look like this:
> > > 
> > >   CPU 0
> > >   -
> > >   for (;;) {
> > >   set_current_state(TASK_INTERRUPTIBLE);
> > >   if (signal_pending(current))
> > >   break;
> > >   if (wakeup_flag)
> > >   break;
> > >   schedule();
> > >   }
> > >   __set_current_state(TASK_RUNNING);
> > >   wakeup_flag = 0;
> > > 
> > > 
> > >   CPU 1
> > >   -
> > >   wakeup_flag = 1;
> > >   wake_up_process(my_task);
> > > 
> > > The underlying pattern is:
> > > 
> > >   CPU 0   CPU 1
> > >   -   -
> > >   write current->statewrite wakeup_flag
> > >   smp_mb();
> > >   read wakeup_flagread my_task->state
> > > 
> > > where set_current_state() does the write to current->state and 
> > > automatically adds the smp_mb(), and wake_up_process() reads 
> > > my_task->state to see whether the task needs to be woken up.
> > > 
> > > The kerneldoc for wake_up_process() says that it has no implied memory
> > > barrier if it doesn't actually wake anything up.  And even when it
> > > does, the implied barrier is only smp_wmb, not smp_mb.
> > > 
> > > This is the so-called SB (Store Buffer) pattern, which is well known to
> > > require a full smp_mb on both sides.  Since wake_up_process() doesn't
> > > include smp_mb(), isn't it correct that the caller must add it
> > > explicitly?
> > > 
> > > In other words, shouldn't the code for CPU 1 really be:
> > > 
> > >   wakeup_flag = 1;
> > >   smp_mb();
> > >   wake_up_process(task);
> > > 
> > > If my reasoning is correct, then why doesn't wake_up_process() include 
> > > this memory barrier automatically, the way set_current_state() does?  
> > > There could be an alternate version (__wake_up_process()) which omits 
> > > the barrier, just like __set_current_state().
> > 
> > A common case uses locking, in which case additional memory barriers
> > inside of the wait/wakeup functions are not needed.  Any accesses made
> > while holding the lock before invoking the wakeup function (e.g.,
> > wake_up()) are guaranteed to be seen after acquiring that same
> > lock following return from the wait function (e.g., wait_event()).
> > In this case, adding barriers to the wait and wakeup functions would
> > just add overhead.
> > 
> > But yes, this decision does mean that people using the wait/wakeup
> > functions without locking need to be more careful.  Something like
> > this:
> > 
> > /* prior accesses. */
> > smp_mb();
> > wakeup_flag = 1;
> > wake_up(...);
> > 
> > And on the other task:
> > 
> > wait_event(... wakeup_flag == 1 ...);
> > smp_mb();
> > /* The waker's prior accesses will be visible here. */
> > 
> > Or am I missing your point?
> 
> I'm afraid so.  The code doesn't use wait_event(), in part because
> there's no wait_queue (since only one task is involved).

Ah, got it.

The required pattern should be very similar, however.

> But maybe there's another barrier which needs to be fixed.  Felipe, can
> you check to see if received_cbw() is getting called in
> get_next_command(), and if so, what value it returns?  Or is the
> preceding sleep_thread() the one that never wakes up?
> 
> It could be that the smp_wmb() in wakeup_thread() needs to be smp_mb().  
> The reason being that get_next_command() runs outside the protection of 
> the spinlock.

This sounds very likely to me.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-03 Thread Felipe Balbi


hi,

Peter Zijlstra  writes:
> On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote:
>> Felipe, your tests will show whether my guess was totally off-base.  
>> For the new people, Felipe is tracking down a problem that involves
>> exactly the code sequence listed at the top of the email, where we know
>> that the wakeup routine runs but nevertheless the task sleeps.  At
>> least, that's what it looks like at the moment.
>
> What arch are you seeing this on?

x86. Skylake to be exact.

The following change survived through the night:

diff --git a/drivers/usb/gadget/function/f_mass_storage.c 
b/drivers/usb/gadget/function/f_mass_storage.c
index 8f3659b65f53..d31581dd5ce5 100644
--- a/drivers/usb/gadget/function/f_mass_storage.c
+++ b/drivers/usb/gadget/function/f_mass_storage.c
@@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct usb_ep 
*ep)
 /* Caller must hold fsg->lock */
 static void wakeup_thread(struct fsg_common *common)
 {
-   smp_wmb();  /* ensure the write of bh->state is complete */
+   smp_mb();   /* ensure the write of bh->state is complete */
/* Tell the main thread that something has happened */
common->thread_wakeup_needed = 1;
if (common->thread_task)
@@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool 
can_freeze)
}
__set_current_state(TASK_RUNNING);
common->thread_wakeup_needed = 0;
-   smp_rmb();  /* ensure the latest bh->state is visible */
+   smp_mb();   /* ensure the latest bh->state is visible */
return rc;
 }
 

I'll keep it running until Monday at least

-- 
balbi
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-02 Thread Peter Zijlstra

On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote:
> Felipe, your tests will show whether my guess was totally off-base.  
> For the new people, Felipe is tracking down a problem that involves
> exactly the code sequence listed at the top of the email, where we know
> that the wakeup routine runs but nevertheless the task sleeps.  At
> least, that's what it looks like at the moment.

What arch are you seeing this on?
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-02 Thread Peter Zijlstra

On Sat, Sep 03, 2016 at 12:14:13AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote:
> > 
> > Actually, that's not entirely true (although presumably it works okay
> > for most architectures).
> 
> Yeah, all load-store archs (with exception of PowerPC and ARM64 and
> possibly MIPS) implement ACQUIRE with a general fence (after the ll/sc).
> 
> ( and MIPS doesn't use their fancy barriers in Linux )
> 
> PowerPC does the full fence for smp_mb__before_spinlock, which leaves
> ARM64, I'm not sure its correct, but I'm way too tired to think about
> that now.
> 
> The TSO archs imply full barriers with all atomic RmW ops and are
> therefore also good.
> 

Forgot to Cc Will. Will, does ARM64 need to make smp_mb__before_spinlock
smp_mb() too?
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-02 Thread Peter Zijlstra

On Fri, Sep 02, 2016 at 04:16:54PM -0400, Alan Stern wrote:
> 
> Actually, that's not entirely true (although presumably it works okay
> for most architectures).

Yeah, all load-store archs (with exception of PowerPC and ARM64 and
possibly MIPS) implement ACQUIRE with a general fence (after the ll/sc).

( and MIPS doesn't use their fancy barriers in Linux )

PowerPC does the full fence for smp_mb__before_spinlock, which leaves
ARM64, I'm not sure its correct, but I'm way too tired to think about
that now.

The TSO archs imply full barriers with all atomic RmW ops and are
therefore also good.

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-02 Thread Alan Stern

On Fri, 2 Sep 2016, Paul E. McKenney wrote:

> On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote:
> > Paul, Peter, and Ingo:
> > 
> > This must have come up before, but I don't know what was decided.
> > 
> > Isn't it often true that a memory barrier is needed before a call to 
> > wake_up_process()?  A typical scenario might look like this:
> > 
> > CPU 0
> > -
> > for (;;) {
> > set_current_state(TASK_INTERRUPTIBLE);
> > if (signal_pending(current))
> > break;
> > if (wakeup_flag)
> > break;
> > schedule();
> > }
> > __set_current_state(TASK_RUNNING);
> > wakeup_flag = 0;
> > 
> > 
> > CPU 1
> > -
> > wakeup_flag = 1;
> > wake_up_process(my_task);
> > 
> > The underlying pattern is:
> > 
> > CPU 0   CPU 1
> > -   -
> > write current->statewrite wakeup_flag
> > smp_mb();
> > read wakeup_flagread my_task->state
> > 
> > where set_current_state() does the write to current->state and 
> > automatically adds the smp_mb(), and wake_up_process() reads 
> > my_task->state to see whether the task needs to be woken up.
> > 
> > The kerneldoc for wake_up_process() says that it has no implied memory
> > barrier if it doesn't actually wake anything up.  And even when it
> > does, the implied barrier is only smp_wmb, not smp_mb.
> > 
> > This is the so-called SB (Store Buffer) pattern, which is well known to
> > require a full smp_mb on both sides.  Since wake_up_process() doesn't
> > include smp_mb(), isn't it correct that the caller must add it
> > explicitly?
> > 
> > In other words, shouldn't the code for CPU 1 really be:
> > 
> > wakeup_flag = 1;
> > smp_mb();
> > wake_up_process(task);
> > 
> > If my reasoning is correct, then why doesn't wake_up_process() include 
> > this memory barrier automatically, the way set_current_state() does?  
> > There could be an alternate version (__wake_up_process()) which omits 
> > the barrier, just like __set_current_state().
> 
> A common case uses locking, in which case additional memory barriers
> inside of the wait/wakeup functions are not needed.  Any accesses made
> while holding the lock before invoking the wakeup function (e.g.,
> wake_up()) are guaranteed to be seen after acquiring that same
> lock following return from the wait function (e.g., wait_event()).
> In this case, adding barriers to the wait and wakeup functions would
> just add overhead.
> 
> But yes, this decision does mean that people using the wait/wakeup
> functions without locking need to be more careful.  Something like
> this:
> 
>   /* prior accesses. */
>   smp_mb();
>   wakeup_flag = 1;
>   wake_up(...);
> 
> And on the other task:
> 
>   wait_event(... wakeup_flag == 1 ...);
>   smp_mb();
>   /* The waker's prior accesses will be visible here. */
> 
> Or am I missing your point?

I'm afraid so.  The code doesn't use wait_event(), in part because
there's no wait_queue (since only one task is involved).

But maybe there's another barrier which needs to be fixed.  Felipe, can
you check to see if received_cbw() is getting called in
get_next_command(), and if so, what value it returns?  Or is the
preceding sleep_thread() the one that never wakes up?

It could be that the smp_wmb() in wakeup_thread() needs to be smp_mb().  
The reason being that get_next_command() runs outside the protection of 
the spinlock.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-02 Thread Alan Stern

On Fri, 2 Sep 2016, Peter Zijlstra wrote:

> On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote:
> > Paul, Peter, and Ingo:
> > 
> > This must have come up before, but I don't know what was decided.
> > 
> > Isn't it often true that a memory barrier is needed before a call to 
> > wake_up_process()?  A typical scenario might look like this:
> > 
> > CPU 0
> > -
> > for (;;) {
> > set_current_state(TASK_INTERRUPTIBLE);
> > if (signal_pending(current))
> > break;
> > if (wakeup_flag)
> > break;
> > schedule();
> > }
> > __set_current_state(TASK_RUNNING);
> > wakeup_flag = 0;
> > 
> > 
> > CPU 1
> > -
> > wakeup_flag = 1;
> > wake_up_process(my_task);
> > 
> > The underlying pattern is:
> > 
> > CPU 0   CPU 1
> > -   -
> > write current->statewrite wakeup_flag
> > smp_mb();
> > read wakeup_flagread my_task->state
> > 
> > where set_current_state() does the write to current->state and 
> > automatically adds the smp_mb(), and wake_up_process() reads 
> > my_task->state to see whether the task needs to be woken up.
> > 
> > The kerneldoc for wake_up_process() says that it has no implied memory
> > barrier if it doesn't actually wake anything up.  And even when it
> > does, the implied barrier is only smp_wmb, not smp_mb.
> > 
> > This is the so-called SB (Store Buffer) pattern, which is well known to
> > require a full smp_mb on both sides.  Since wake_up_process() doesn't
> > include smp_mb(), isn't it correct that the caller must add it
> > explicitly?
> > 
> > In other words, shouldn't the code for CPU 1 really be:
> > 
> > wakeup_flag = 1;
> > smp_mb();
> > wake_up_process(task);
> > 
> 
> No, it doesn't need to do that. try_to_wake_up() does the right thing.
> 
> It does:
> 
>   smp_mb__before_spinlock();
>   raw_spin_lock_irqsave(>pi_lock);
> 
> Now, smp_mb__before_spinlock() is a bit of an odd duck, if you look at
> its comment it says:
> 
> /*
>  * Despite its name it doesn't necessarily has to be a full barrier.
>  * It should only guarantee that a STORE before the critical section
>  * can not be reordered with LOADs and STOREs inside this section.
>  * spin_lock() is the one-way barrier, this LOAD can not escape out
>  * of the region. So the default implementation simply ensures that
>  * a STORE can not move into the critical section, smp_wmb() should
>  * serialize it with another STORE done by spin_lock().
>  */
> #ifndef smp_mb__before_spinlock
> #define smp_mb__before_spinlock() smp_wmb()
> #endif

I see.  So the kerneldoc for wake_up_process() is misleading at best.

> So per default it ends up being:
> 
>   WMB
>   LOCK
> 
> Which is sufficient to order the prior store vs the later load as is
> required. Note that a spinlock acquire _must_ imply a store (we need to
> mark the lock as taken), therefore the prior store is ordered against
> the lock store per the wmb, and since the lock must imply an ACQUIRE
> that limits the load.

Actually, that's not entirely true (although presumably it works okay
for most architectures).  In theory, the first write and the WMB can
move down after the lock's ACQUIRE.  Then the later read could move
earlier, before the lock's store, the WMB, and the first write, but
still after the ACQUIRE.  Thus the later read could be reordered with
the first write.

In other words, these actions:

store wakeup_flag
WMB
load-ACQUIRE lock
store lock
read task->state

can be reordered to:

load-ACQUIRE lock
store wakeup_flag
WMB
store lock
read task->state

and then to:

load-ACQUIRE lock
read task->state
store wakeup_flag
WMB
store lock

without violating any ordering rules.

> Now, PowerPC defines smp_mb__before_spinlock as smp_mb(), and this is
> because PowerPC ACQUIRE is a bit of an exception, if you want more
> details I'm sure I or Paul can dredge them up :-)

PPC can't do the reordering described above, but it does have other
weaknesses.  Without smp_mb(), the write to wakeup_flag doesn't have to
propagate from one CPU to the other before the second CPU tries to read
it.

Felipe, your tests will show whether my guess was totally off-base.  
For the new people, Felipe is tracking down a problem that involves
exactly the code sequence listed at the top of the email, where we know
that the wakeup routine runs but nevertheless the task sleeps.  At
least, that's what it looks like at the moment.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-02 Thread Peter Zijlstra

On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote:
> Paul, Peter, and Ingo:
> 
> This must have come up before, but I don't know what was decided.
> 
> Isn't it often true that a memory barrier is needed before a call to 
> wake_up_process()?  A typical scenario might look like this:
> 
>   CPU 0
>   -
>   for (;;) {
>   set_current_state(TASK_INTERRUPTIBLE);
>   if (signal_pending(current))
>   break;
>   if (wakeup_flag)
>   break;
>   schedule();
>   }
>   __set_current_state(TASK_RUNNING);
>   wakeup_flag = 0;
> 
> 
>   CPU 1
>   -
>   wakeup_flag = 1;
>   wake_up_process(my_task);
> 
> The underlying pattern is:
> 
>   CPU 0   CPU 1
>   -   -
>   write current->statewrite wakeup_flag
>   smp_mb();
>   read wakeup_flagread my_task->state
> 
> where set_current_state() does the write to current->state and 
> automatically adds the smp_mb(), and wake_up_process() reads 
> my_task->state to see whether the task needs to be woken up.
> 
> The kerneldoc for wake_up_process() says that it has no implied memory
> barrier if it doesn't actually wake anything up.  And even when it
> does, the implied barrier is only smp_wmb, not smp_mb.
> 
> This is the so-called SB (Store Buffer) pattern, which is well known to
> require a full smp_mb on both sides.  Since wake_up_process() doesn't
> include smp_mb(), isn't it correct that the caller must add it
> explicitly?
> 
> In other words, shouldn't the code for CPU 1 really be:
> 
>   wakeup_flag = 1;
>   smp_mb();
>   wake_up_process(task);
> 

No, it doesn't need to do that. try_to_wake_up() does the right thing.

It does:

smp_mb__before_spinlock();
raw_spin_lock_irqsave(>pi_lock);

Now, smp_mb__before_spinlock() is a bit of an odd duck, if you look at
its comment it says:

/*
 * Despite its name it doesn't necessarily has to be a full barrier.
 * It should only guarantee that a STORE before the critical section
 * can not be reordered with LOADs and STOREs inside this section.
 * spin_lock() is the one-way barrier, this LOAD can not escape out
 * of the region. So the default implementation simply ensures that
 * a STORE can not move into the critical section, smp_wmb() should
 * serialize it with another STORE done by spin_lock().
 */
#ifndef smp_mb__before_spinlock
#define smp_mb__before_spinlock()   smp_wmb()
#endif

So per default it ends up being:

WMB
LOCK

Which is sufficient to order the prior store vs the later load as is
required. Note that a spinlock acquire _must_ imply a store (we need to
mark the lock as taken), therefore the prior store is ordered against
the lock store per the wmb, and since the lock must imply an ACQUIRE
that limits the load.

Now, PowerPC defines smp_mb__before_spinlock as smp_mb(), and this is
because PowerPC ACQUIRE is a bit of an exception, if you want more
details I'm sure I or Paul can dredge them up :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Memory barrier needed with wake_up_process()?

2016-09-02 Thread Paul E. McKenney

On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote:
> Paul, Peter, and Ingo:
> 
> This must have come up before, but I don't know what was decided.
> 
> Isn't it often true that a memory barrier is needed before a call to 
> wake_up_process()?  A typical scenario might look like this:
> 
>   CPU 0
>   -
>   for (;;) {
>   set_current_state(TASK_INTERRUPTIBLE);
>   if (signal_pending(current))
>   break;
>   if (wakeup_flag)
>   break;
>   schedule();
>   }
>   __set_current_state(TASK_RUNNING);
>   wakeup_flag = 0;
> 
> 
>   CPU 1
>   -
>   wakeup_flag = 1;
>   wake_up_process(my_task);
> 
> The underlying pattern is:
> 
>   CPU 0   CPU 1
>   -   -
>   write current->statewrite wakeup_flag
>   smp_mb();
>   read wakeup_flagread my_task->state
> 
> where set_current_state() does the write to current->state and 
> automatically adds the smp_mb(), and wake_up_process() reads 
> my_task->state to see whether the task needs to be woken up.
> 
> The kerneldoc for wake_up_process() says that it has no implied memory
> barrier if it doesn't actually wake anything up.  And even when it
> does, the implied barrier is only smp_wmb, not smp_mb.
> 
> This is the so-called SB (Store Buffer) pattern, which is well known to
> require a full smp_mb on both sides.  Since wake_up_process() doesn't
> include smp_mb(), isn't it correct that the caller must add it
> explicitly?
> 
> In other words, shouldn't the code for CPU 1 really be:
> 
>   wakeup_flag = 1;
>   smp_mb();
>   wake_up_process(task);
> 
> If my reasoning is correct, then why doesn't wake_up_process() include 
> this memory barrier automatically, the way set_current_state() does?  
> There could be an alternate version (__wake_up_process()) which omits 
> the barrier, just like __set_current_state().

A common case uses locking, in which case additional memory barriers
inside of the wait/wakeup functions are not needed.  Any accesses made
while holding the lock before invoking the wakeup function (e.g.,
wake_up()) are guaranteed to be seen after acquiring that same
lock following return from the wait function (e.g., wait_event()).
In this case, adding barriers to the wait and wakeup functions would
just add overhead.

But yes, this decision does mean that people using the wait/wakeup
functions without locking need to be more careful.  Something like
this:

/* prior accesses. */
smp_mb();
wakeup_flag = 1;
wake_up(...);

And on the other task:

wait_event(... wakeup_flag == 1 ...);
smp_mb();
/* The waker's prior accesses will be visible here. */

Or am I missing your point?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Memory barrier needed with wake_up_process()?

2016-09-02 Thread Alan Stern

Paul, Peter, and Ingo:

This must have come up before, but I don't know what was decided.

Isn't it often true that a memory barrier is needed before a call to 
wake_up_process()?  A typical scenario might look like this:

CPU 0
-
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (signal_pending(current))
break;
if (wakeup_flag)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
wakeup_flag = 0;


CPU 1
-
wakeup_flag = 1;
wake_up_process(my_task);

The underlying pattern is:

CPU 0   CPU 1
-   -
write current->statewrite wakeup_flag
smp_mb();
read wakeup_flagread my_task->state

where set_current_state() does the write to current->state and 
automatically adds the smp_mb(), and wake_up_process() reads 
my_task->state to see whether the task needs to be woken up.

The kerneldoc for wake_up_process() says that it has no implied memory
barrier if it doesn't actually wake anything up.  And even when it
does, the implied barrier is only smp_wmb, not smp_mb.

This is the so-called SB (Store Buffer) pattern, which is well known to
require a full smp_mb on both sides.  Since wake_up_process() doesn't
include smp_mb(), isn't it correct that the caller must add it
explicitly?

In other words, shouldn't the code for CPU 1 really be:

wakeup_flag = 1;
smp_mb();
wake_up_process(task);

If my reasoning is correct, then why doesn't wake_up_process() include 
this memory barrier automatically, the way set_current_state() does?  
There could be an alternate version (__wake_up_process()) which omits 
the barrier, just like __set_current_state().

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

40 matches

Mail list logo