Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2017-01-02 Thread Baolin Wang
On 2 January 2017 at 22:57, Mathias Nyman  wrote:
> On 27.12.2016 05:07, Baolin Wang wrote:
>>
>> Hi,
>>
>> On 21 December 2016 at 21:00, Mathias Nyman
>>  wrote:
>>>
>>> On 21.12.2016 04:22, Baolin Wang wrote:


 Hi Mathias,

 On 20 December 2016 at 23:13, Mathias Nyman
  wrote:
>
>
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
> timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and  make my
> life
> easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of
> counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
> command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.



 How about applying below patch in your 'timeout_race_fixes' branch?
 [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is
 timeout
 https://lkml.org/lkml/2016/12/13/94

>>>
>>> usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
>>> which will cleanup the command queue. But I need to look into that
>>
>>
>> usb_hc_died() did not call xhci_mem_cleanup() to clean up command
>> queue, it just disconnects all children devices attached on the dying
>> hub. I did not find where it will clean up the command queue when
>> issuing usb_hc_died(). Also before issuing usb_hc_died() in
>> xhci_handle_command_timeout(), we will call
>> xhci_cleanup_command_queue(). I think it should same as in
>> xhci_stop_endpoint_command_watchdog().
>>
>
> You're right, it doesn't call xhci_mem_cleanup.
> Need to look at this after getting first series of fixes to usb-linus

Thanks.

-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2017-01-02 Thread Mathias Nyman

On 27.12.2016 05:07, Baolin Wang wrote:

Hi,

On 21 December 2016 at 21:00, Mathias Nyman
 wrote:

On 21.12.2016 04:22, Baolin Wang wrote:


Hi Mathias,

On 20 December 2016 at 23:13, Mathias Nyman
 wrote:


On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and  make my life
easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of
counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.



How about applying below patch in your 'timeout_race_fixes' branch?
[PATCH] usb: host: xhci: Clean up commands when stop endpoint command is
timeout
https://lkml.org/lkml/2016/12/13/94



usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
which will cleanup the command queue. But I need to look into that


usb_hc_died() did not call xhci_mem_cleanup() to clean up command
queue, it just disconnects all children devices attached on the dying
hub. I did not find where it will clean up the command queue when
issuing usb_hc_died(). Also before issuing usb_hc_died() in
xhci_handle_command_timeout(), we will call
xhci_cleanup_command_queue(). I think it should same as in
xhci_stop_endpoint_command_watchdog().



You're right, it doesn't call xhci_mem_cleanup.
Need to look at this after getting first series of fixes to usb-linus

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-26 Thread Baolin Wang
Hi,

On 21 December 2016 at 21:00, Mathias Nyman
 wrote:
> On 21.12.2016 04:22, Baolin Wang wrote:
>>
>> Hi Mathias,
>>
>> On 20 December 2016 at 23:13, Mathias Nyman
>>  wrote:
>>>
>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>> ...
>>>
>>> Alright, I gathered all current work related to xhci races and timeouts
>>> and put them into a branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
>>> timeout_race_fixes
>>>
>>> Its based on 4.9
>>> It includes a few other patches just to avoid conflicts and  make my life
>>> easier
>>>
>>> Interesting patches are:
>>>
>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of
>>> counter.
>>> 4f2535f xhci: Handle command completion and timeout race
>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
>>> command
>>> 529a5a0 usb: xhci: fix possible wild pointer
>>> 4766555 xhci: Fix race related to abort operation
>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>> 69973b8 Linux 4.9
>>>
>>> The fixes for command queue races will go to usb-linus and stable, the
>>> reworks for stop ep watchdog timer will go to usb-next.
>>
>>
>> How about applying below patch in your 'timeout_race_fixes' branch?
>> [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is
>> timeout
>> https://lkml.org/lkml/2016/12/13/94
>>
>
> usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
> which will cleanup the command queue. But I need to look into that

usb_hc_died() did not call xhci_mem_cleanup() to clean up command
queue, it just disconnects all children devices attached on the dying
hub. I did not find where it will clean up the command queue when
issuing usb_hc_died(). Also before issuing usb_hc_died() in
xhci_handle_command_timeout(), we will call
xhci_cleanup_command_queue(). I think it should same as in
xhci_stop_endpoint_command_watchdog().

-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-23 Thread Mathias Nyman

On 22.12.2016 03:46, Lu Baolu wrote:

Hi,

On 12/21/2016 11:18 PM, OGAWA Hirofumi wrote:

Mathias Nyman  writes:


We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
is for taking lock for register though, I guess it should be enough just
lock around of read=>write of ->cmd_ring if need lock.

After your patch it should be enough to have the lock only while
reading and writing the cmd_ring register.

If we want a locking fix that applies more easily to older stable
releases before your change then the lock needs to cover set
CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
checking CRR bit.  Otherwise the stop cmd ring interrupt handler may
restart the ring just before we start checing CRR. The stop cmd ring
interrupt will set the CMD_RING_STATE_ABORTED to
CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
handler.

Just for record (no chance to make patch I myself for now, sorry), while
checking locking slightly, I noticed unrelated missing locking.

xhci_cleanup_command_queue()

We are calling it without locking, but we need to lock for accessing list.


Yeah. I can make the patch.



Force updated timeout_race_fixes branch.

I'll be out for a few days during xmas, continue testing after that

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread Lu Baolu
Hi,

On 12/21/2016 11:18 PM, OGAWA Hirofumi wrote:
> Mathias Nyman  writes:
>
>>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
>>> is for taking lock for register though, I guess it should be enough just
>>> lock around of read=>write of ->cmd_ring if need lock.
>> After your patch it should be enough to have the lock only while
>> reading and writing the cmd_ring register.
>>
>> If we want a locking fix that applies more easily to older stable
>> releases before your change then the lock needs to cover set
>> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
>> checking CRR bit.  Otherwise the stop cmd ring interrupt handler may
>> restart the ring just before we start checing CRR. The stop cmd ring
>> interrupt will set the CMD_RING_STATE_ABORTED to
>> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
>> handler.
> Just for record (no chance to make patch I myself for now, sorry), while
> checking locking slightly, I noticed unrelated missing locking.
>
>   xhci_cleanup_command_queue()
>
> We are calling it without locking, but we need to lock for accessing list.

Yeah. I can make the patch.

Best regards,
Lu Baolu
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread Lu Baolu
Hi,

On 12/21/2016 08:48 PM, Mathias Nyman wrote:
> On 21.12.2016 08:17, Lu Baolu wrote:
>> Hi Mathias,
>>
>> I have some comments for the implementation of xhci_abort_cmd_ring() below.
>>
>> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>> ...
>>>
>>> Alright, I gathered all current work related to xhci races and timeouts
>>> and put them into a branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git 
>>> timeout_race_fixes
>>>
>>> Its based on 4.9
>>> It includes a few other patches just to avoid conflicts and  make my life 
>>> easier
>>>
>>> Interesting patches are:
>>>
>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of 
>>> counter.
>>> 4f2535f xhci: Handle command completion and timeout race
>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort 
>>> command
>>> 529a5a0 usb: xhci: fix possible wild pointer
>>> 4766555 xhci: Fix race related to abort operation
>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>> 69973b8 Linux 4.9
>>>
>>> The fixes for command queue races will go to usb-linus and stable, the
>>> reworks for stop ep watchdog timer will go to usb-next.
>>>
>>> Still completely untested, (well it compiles)
>>>
>>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>>> devicecommands to test these, I'll try to set that up.
>>>
>>> All additional testing is welcome, especially if you can trigger timeouts
>>> and races
>>>
>>> -Mathias
>>>
>>>
>>
>> Below is the latest code. I put my comments in line.
>>
>>   322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>>   323 {
>>   324 u64 temp_64;
>>   325 int ret;
>>   326
>>   327 xhci_dbg(xhci, "Abort command ring\n");
>>   328
>>   329 reinit_completion(>cmd_ring_stop_completion);
>>   330
>>   331 temp_64 = xhci_read_64(xhci, >op_regs->cmd_ring);
>>   332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>>   333 >op_regs->cmd_ring);
>>
>> We should hold xhci->lock when we are modifying xhci registers
>> at runtime.
>>
>
> Makes sense, but we need to unlock it before sleeping or waiting for 
> completion.
> I need to look into that in more detail.
>
> But this was an issue already before these changes.
>
>> The retry of setting CMD_RING_ABORT is not necessary according to
>> previous discussion. We have cleaned code for second try in
>> xhci_handle_command_timeout(). Need to clean up here as well.
>>
>
> Yes it can be cleaned up as well, but the two cases are a bit different.
> The cleaned up one was about command ring not starting again after it was 
> stopped.
>
> This second try is a workaround for what we thought was the command ring 
> failing
> to stop in the first place, but is most likely due to the race that OGAWA 
> Hirofumi
> fixed.  It races if the stop command ring interrupt happens between writing 
> the abort
> bit and polling for the ring stopped bit. The interrupt hander may start the 
> command
> ring again, and we would believe we failed to stop it in the first place.
>
> This race could probably be fixed by just extending the lock (and preventing
> interrupts) to cover both writing the abort bit and polling for the command 
> ring
> running bit, as you pointed out here previously.
>
> But then again I really like OGAWA Hiroumi's solution that separates the
> command ring stopping from aborting commands and restarting the ring.
>
> The current way of always restarting the command ring as a response to
> a stop command ring command really limits its usage.
>
> So, with this in mind most reasonable would be to
> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
> 3. remove unnecessary second abort try as a separate patch, send to usb-next
> 4. remove polling for the Command ring running (CRR), waiting for completion
>is enough, if completion times out then we can check CRR. for usb-next
>I'll fix the typos these patches would introduce. Fixing old typos can be 
> done as separate
> patches later.

This is exactly the same as what I am thinking of. I will submit the patches 
later.

Best regards,
Lu Baolu
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread Lu Baolu
Hi,

On 12/21/2016 08:57 PM, Mathias Nyman wrote:
> On 21.12.2016 08:57, Lu Baolu wrote:
>> Hi Mathias,
>>
>> I have some comments for the implementation of
>> xhci_handle_command_timeout() as well.
>>
>> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>> ...
>>>
>>> Alright, I gathered all current work related to xhci races and timeouts
>>> and put them into a branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git 
>>> timeout_race_fixes
>>>
>>> Its based on 4.9
>>> It includes a few other patches just to avoid conflicts and  make my life 
>>> easier
>>>
>>> Interesting patches are:
>>>
>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of 
>>> counter.
>>> 4f2535f xhci: Handle command completion and timeout race
>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort 
>>> command
>>> 529a5a0 usb: xhci: fix possible wild pointer
>>> 4766555 xhci: Fix race related to abort operation
>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>> 69973b8 Linux 4.9
>>>
>>> The fixes for command queue races will go to usb-linus and stable, the
>>> reworks for stop ep watchdog timer will go to usb-next.
>>>
>>> Still completely untested, (well it compiles)
>>>
>>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>>> devicecommands to test these, I'll try to set that up.
>>>
>>> All additional testing is welcome, especially if you can trigger timeouts
>>> and races
>>>
>>> -Mathias
>>>
>>>
>>
>> I post the code below and add my comments in line.
>>
>> 1276 void xhci_handle_command_timeout(struct work_struct *work)
>> 1277 {
>> 1278 struct xhci_hcd *xhci;
>> 1279 int ret;
>> 1280 unsigned long flags;
>> 1281 u64 hw_ring_state;
>> 1282
>> 1283 xhci = container_of(to_delayed_work(work), struct xhci_hcd, 
>> cmd_timer);
>> 1284
>> 1285 spin_lock_irqsave(>lock, flags);
>> 1286
>> 1287 /*
>> 1288  * If timeout work is pending, or current_cmd is NULL, it means 
>> we
>> 1289  * raced with command completion. Command is handled so just 
>> return.
>> 1290  */
>> 1291 if (!xhci->current_cmd || 
>> delayed_work_pending(>cmd_timer)) {
>> 1292 spin_unlock_irqrestore(>lock, flags);
>> 1293 return;
>> 1294 }
>> 1295 /* mark this command to be cancelled */
>> 1296 xhci->current_cmd->status = COMP_CMD_ABORT;
>> 1297
>> 1298 /* Make sure command ring is running before aborting it */
>> 1299 hw_ring_state = xhci_read_64(xhci, >op_regs->cmd_ring);
>> 1300 if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
>> 1301 (hw_ring_state & CMD_RING_RUNNING))  {
>> 1302 /* Prevent new doorbell, and start command abort */
>> 1303 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
>> 1304 spin_unlock_irqrestore(>lock, flags);
>> 1305 xhci_dbg(xhci, "Command timeout\n");
>> 1306 ret = xhci_abort_cmd_ring(xhci);
>> 1307 if (unlikely(ret == -ESHUTDOWN)) {
>> 1308 xhci_err(xhci, "Abort command ring failed\n");
>> 1309 xhci_cleanup_command_queue(xhci);
>> 1310 usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
>> 1311 xhci_dbg(xhci, "xHCI host controller is 
>> dead.\n");
>> 1312 }
>> 1313 return;
>> 1314 }
>> 1315
>> 1316 /* host removed. Bail out */
>> 1317 if (xhci->xhc_state & XHCI_STATE_REMOVING) {
>> 1318 spin_unlock_irqrestore(>lock, flags);
>> 1319 xhci_dbg(xhci, "host removed, ring start fail?\n");
>> 1320 xhci_cleanup_command_queue(xhci);
>> 1321 return;
>> 1322 }
>>
>> I think this part of code should be moved up to line 1295.
>
> The XHCI_STATE_REMOVING and XHCI_STATE_DYING needs a rework,
> I'm working on that.
>
> Basically we want XHCI_STATE_REMOVING to mean that  all devices are going,
> away and driver will be removed. Don't bother with re-calculating available
> bandwidths after every device removal, but do use xhci hardware to disable
> devices cleanly etc.
>
> XHCI_STATE_DYING should mean hardware is not working/responding. Don't
> bother writing any registers or queuing anything. Just return all
> pending and cancelled URBs, notify core we died, and free all allocated 
> memory.

Okay, thanks for the information.

>
>>
>> 1323
>> 1324 /* command timeout on stopped ring, ring can't be aborted */
>> 1325 xhci_dbg(xhci, "Command timeout on stopped ring\n");
>> 1326 xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
>> 1327 spin_unlock_irqrestore(>lock, flags);
>>
>> This part of code is tricky. I have no idea 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread OGAWA Hirofumi
Mathias Nyman  writes:

>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
>> is for taking lock for register though, I guess it should be enough just
>> lock around of read=>write of ->cmd_ring if need lock.
>
> After your patch it should be enough to have the lock only while
> reading and writing the cmd_ring register.
>
> If we want a locking fix that applies more easily to older stable
> releases before your change then the lock needs to cover set
> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
> checking CRR bit.  Otherwise the stop cmd ring interrupt handler may
> restart the ring just before we start checing CRR. The stop cmd ring
> interrupt will set the CMD_RING_STATE_ABORTED to
> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
> handler.

Just for record (no chance to make patch I myself for now, sorry), while
checking locking slightly, I noticed unrelated missing locking.

xhci_cleanup_command_queue()

We are calling it without locking, but we need to lock for accessing list.

Thanks.
-- 
OGAWA Hirofumi 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread Mathias Nyman

On 21.12.2016 16:10, OGAWA Hirofumi wrote:

Mathias Nyman  writes:


Below is the latest code. I put my comments in line.

   322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
   323 {
   324 u64 temp_64;
   325 int ret;
   326
   327 xhci_dbg(xhci, "Abort command ring\n");
   328
   329 reinit_completion(>cmd_ring_stop_completion);
   330
   331 temp_64 = xhci_read_64(xhci, >op_regs->cmd_ring);
   332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
   333 >op_regs->cmd_ring);

We should hold xhci->lock when we are modifying xhci registers
at runtime.



Makes sense, but we need to unlock it before sleeping or waiting for
completion.  I need to look into that in more detail.

But this was an issue already before these changes.


We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
is for taking lock for register though, I guess it should be enough just
lock around of read=>write of ->cmd_ring if need lock.


After your patch it should be enough to have the lock only while
reading and writing the cmd_ring register.

If we want a locking fix that applies more easily to older stable releases 
before
your change then the lock needs to cover set CMD_RING_STATE_ABORT, read cmd_reg,
write cmd_reg and busiloop checking CRR bit.  Otherwise the stop cmd ring 
interrupt
handler may restart the ring just before we start checing CRR. The stop cmd ring
interrupt will set the CMD_RING_STATE_ABORTED to CMD_RING_STATE_RUNNING so
ring will really restart in the interrupt handler.



[Rather ->cmd_ring user should check CMD_RING_STATE_ABORTED state.]


But then again I really like OGAWA Hiroumi's solution that separates the
command ring stopping from aborting commands and restarting the ring.

The current way of always restarting the command ring as a response to
a stop command ring command really limits its usage.

So, with this in mind most reasonable would be to
1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
3. remove unnecessary second abort try as a separate patch, send to usb-next
4. remove polling for the Command ring running (CRR), waiting for completion
 is enough, if completion times out then we can check CRR. for usb-next


I think we should check both of CRR and even of stop completion. Because
CRR and stop completion was not same time (both can be first).


We can keep both, maybe change the order and do the busylooping-checking after
waiting for completion, but thats a optimization for usb-next sometimes later.

Thanks

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread OGAWA Hirofumi
Mathias Nyman  writes:

>> Below is the latest code. I put my comments in line.
>>
>>   322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>>   323 {
>>   324 u64 temp_64;
>>   325 int ret;
>>   326
>>   327 xhci_dbg(xhci, "Abort command ring\n");
>>   328
>>   329 reinit_completion(>cmd_ring_stop_completion);
>>   330
>>   331 temp_64 = xhci_read_64(xhci, >op_regs->cmd_ring);
>>   332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>>   333 >op_regs->cmd_ring);
>>
>> We should hold xhci->lock when we are modifying xhci registers
>> at runtime.
>>
>
> Makes sense, but we need to unlock it before sleeping or waiting for
> completion.  I need to look into that in more detail.
>
> But this was an issue already before these changes.

We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
is for taking lock for register though, I guess it should be enough just
lock around of read=>write of ->cmd_ring if need lock.

[Rather ->cmd_ring user should check CMD_RING_STATE_ABORTED state.]

> But then again I really like OGAWA Hiroumi's solution that separates the
> command ring stopping from aborting commands and restarting the ring.
>
> The current way of always restarting the command ring as a response to
> a stop command ring command really limits its usage.
>
> So, with this in mind most reasonable would be to
> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
> 3. remove unnecessary second abort try as a separate patch, send to usb-next
> 4. remove polling for the Command ring running (CRR), waiting for completion
> is enough, if completion times out then we can check CRR. for usb-next

I think we should check both of CRR and even of stop completion. Because
CRR and stop completion was not same time (both can be first).

Thanks.
-- 
OGAWA Hirofumi 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread Mathias Nyman

On 21.12.2016 04:22, Baolin Wang wrote:

Hi Mathias,

On 20 December 2016 at 23:13, Mathias Nyman
 wrote:

On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and  make my life
easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of
counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.


How about applying below patch in your 'timeout_race_fixes' branch?
[PATCH] usb: host: xhci: Clean up commands when stop endpoint command is timeout
https://lkml.org/lkml/2016/12/13/94



usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
which will cleanup the command queue. But I need to look into that

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread Mathias Nyman

On 21.12.2016 08:57, Lu Baolu wrote:

Hi Mathias,

I have some comments for the implementation of
xhci_handle_command_timeout() as well.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:

On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and  make my life easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.

Still completely untested, (well it compiles)

Felipe gave instructions how to modify dwc3 driver to timeout on address
devicecommands to test these, I'll try to set that up.

All additional testing is welcome, especially if you can trigger timeouts
and races

-Mathias




I post the code below and add my comments in line.

1276 void xhci_handle_command_timeout(struct work_struct *work)
1277 {
1278 struct xhci_hcd *xhci;
1279 int ret;
1280 unsigned long flags;
1281 u64 hw_ring_state;
1282
1283 xhci = container_of(to_delayed_work(work), struct xhci_hcd, 
cmd_timer);
1284
1285 spin_lock_irqsave(>lock, flags);
1286
1287 /*
1288  * If timeout work is pending, or current_cmd is NULL, it means we
1289  * raced with command completion. Command is handled so just 
return.
1290  */
1291 if (!xhci->current_cmd || delayed_work_pending(>cmd_timer)) {
1292 spin_unlock_irqrestore(>lock, flags);
1293 return;
1294 }
1295 /* mark this command to be cancelled */
1296 xhci->current_cmd->status = COMP_CMD_ABORT;
1297
1298 /* Make sure command ring is running before aborting it */
1299 hw_ring_state = xhci_read_64(xhci, >op_regs->cmd_ring);
1300 if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
1301 (hw_ring_state & CMD_RING_RUNNING))  {
1302 /* Prevent new doorbell, and start command abort */
1303 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
1304 spin_unlock_irqrestore(>lock, flags);
1305 xhci_dbg(xhci, "Command timeout\n");
1306 ret = xhci_abort_cmd_ring(xhci);
1307 if (unlikely(ret == -ESHUTDOWN)) {
1308 xhci_err(xhci, "Abort command ring failed\n");
1309 xhci_cleanup_command_queue(xhci);
1310 usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
1311 xhci_dbg(xhci, "xHCI host controller is dead.\n");
1312 }
1313 return;
1314 }
1315
1316 /* host removed. Bail out */
1317 if (xhci->xhc_state & XHCI_STATE_REMOVING) {
1318 spin_unlock_irqrestore(>lock, flags);
1319 xhci_dbg(xhci, "host removed, ring start fail?\n");
1320 xhci_cleanup_command_queue(xhci);
1321 return;
1322 }

I think this part of code should be moved up to line 1295.


The XHCI_STATE_REMOVING and XHCI_STATE_DYING needs a rework,
I'm working on that.

Basically we want XHCI_STATE_REMOVING to mean that  all devices are going,
away and driver will be removed. Don't bother with re-calculating available
bandwidths after every device removal, but do use xhci hardware to disable
devices cleanly etc.

XHCI_STATE_DYING should mean hardware is not working/responding. Don't
bother writing any registers or queuing anything. Just return all
pending and cancelled URBs, notify core we died, and free all allocated memory.



1323
1324 /* command timeout on stopped ring, ring can't be aborted */
1325 xhci_dbg(xhci, "Command timeout on stopped ring\n");
1326 xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
1327 spin_unlock_irqrestore(>lock, flags);

This part of code is tricky. I have no idea about in which case should this
code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
here, right?



This isn't changed it these patches.

It will remove the aborted commands and restart the ring. It's useful if we
want to abort a command but command ring was not running. (if for some
unkown reason it was stopped, or forgot to restart.

-Mathias


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-21 Thread Mathias Nyman

On 21.12.2016 08:17, Lu Baolu wrote:

Hi Mathias,

I have some comments for the implementation of xhci_abort_cmd_ring() below.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:

On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and  make my life easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.

Still completely untested, (well it compiles)

Felipe gave instructions how to modify dwc3 driver to timeout on address
devicecommands to test these, I'll try to set that up.

All additional testing is welcome, especially if you can trigger timeouts
and races

-Mathias




Below is the latest code. I put my comments in line.

  322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
  323 {
  324 u64 temp_64;
  325 int ret;
  326
  327 xhci_dbg(xhci, "Abort command ring\n");
  328
  329 reinit_completion(>cmd_ring_stop_completion);
  330
  331 temp_64 = xhci_read_64(xhci, >op_regs->cmd_ring);
  332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
  333 >op_regs->cmd_ring);

We should hold xhci->lock when we are modifying xhci registers
at runtime.



Makes sense, but we need to unlock it before sleeping or waiting for completion.
I need to look into that in more detail.

But this was an issue already before these changes.


The retry of setting CMD_RING_ABORT is not necessary according to
previous discussion. We have cleaned code for second try in
xhci_handle_command_timeout(). Need to clean up here as well.



Yes it can be cleaned up as well, but the two cases are a bit different.
The cleaned up one was about command ring not starting again after it was 
stopped.

This second try is a workaround for what we thought was the command ring failing
to stop in the first place, but is most likely due to the race that OGAWA 
Hirofumi
fixed.  It races if the stop command ring interrupt happens between writing the 
abort
bit and polling for the ring stopped bit. The interrupt hander may start the 
command
ring again, and we would believe we failed to stop it in the first place.

This race could probably be fixed by just extending the lock (and preventing
interrupts) to cover both writing the abort bit and polling for the command ring
running bit, as you pointed out here previously.

But then again I really like OGAWA Hiroumi's solution that separates the
command ring stopping from aborting commands and restarting the ring.

The current way of always restarting the command ring as a response to
a stop command ring command really limits its usage.

So, with this in mind most reasonable would be to
1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
3. remove unnecessary second abort try as a separate patch, send to usb-next
4. remove polling for the Command ring running (CRR), waiting for completion
   is enough, if completion times out then we can check CRR. for usb-next
   
I'll fix the typos these patches would introduce. Fixing old typos can be done as separate

patches later.

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-20 Thread Lu Baolu
Hi Mathias,

I have some comments for the implementation of
xhci_handle_command_timeout() as well.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git 
> timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and  make my life 
> easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of 
> counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.
>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races
>
> -Mathias
>
>

I post the code below and add my comments in line.

1276 void xhci_handle_command_timeout(struct work_struct *work)
1277 {
1278 struct xhci_hcd *xhci;
1279 int ret;
1280 unsigned long flags;
1281 u64 hw_ring_state;
1282
1283 xhci = container_of(to_delayed_work(work), struct xhci_hcd, 
cmd_timer);
1284
1285 spin_lock_irqsave(>lock, flags);
1286
1287 /*
1288  * If timeout work is pending, or current_cmd is NULL, it means we
1289  * raced with command completion. Command is handled so just 
return.
1290  */
1291 if (!xhci->current_cmd || delayed_work_pending(>cmd_timer)) {
1292 spin_unlock_irqrestore(>lock, flags);
1293 return;
1294 }
1295 /* mark this command to be cancelled */
1296 xhci->current_cmd->status = COMP_CMD_ABORT;
1297
1298 /* Make sure command ring is running before aborting it */
1299 hw_ring_state = xhci_read_64(xhci, >op_regs->cmd_ring);
1300 if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
1301 (hw_ring_state & CMD_RING_RUNNING))  {
1302 /* Prevent new doorbell, and start command abort */
1303 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
1304 spin_unlock_irqrestore(>lock, flags);
1305 xhci_dbg(xhci, "Command timeout\n");
1306 ret = xhci_abort_cmd_ring(xhci);
1307 if (unlikely(ret == -ESHUTDOWN)) {
1308 xhci_err(xhci, "Abort command ring failed\n");
1309 xhci_cleanup_command_queue(xhci);
1310 usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
1311 xhci_dbg(xhci, "xHCI host controller is dead.\n");
1312 }
1313 return;
1314 }
1315
1316 /* host removed. Bail out */
1317 if (xhci->xhc_state & XHCI_STATE_REMOVING) {
1318 spin_unlock_irqrestore(>lock, flags);
1319 xhci_dbg(xhci, "host removed, ring start fail?\n");
1320 xhci_cleanup_command_queue(xhci);
1321 return;
1322 }

I think this part of code should be moved up to line 1295.

1323
1324 /* command timeout on stopped ring, ring can't be aborted */
1325 xhci_dbg(xhci, "Command timeout on stopped ring\n");
1326 xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
1327 spin_unlock_irqrestore(>lock, flags);

This part of code is tricky. I have no idea about in which case should this
code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
here, right?

1328 return;
1329 }

Best regards,
Lu Baolu
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-20 Thread Lu Baolu
Hi Mathias,

I have some comments for the implementation of xhci_abort_cmd_ring() below.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git 
> timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and  make my life 
> easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of 
> counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.
>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races
>
> -Mathias
>
>

Below is the latest code. I put my comments in line.

 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
 323 {
 324 u64 temp_64;
 325 int ret;
 326
 327 xhci_dbg(xhci, "Abort command ring\n");
 328
 329 reinit_completion(>cmd_ring_stop_completion);
 330
 331 temp_64 = xhci_read_64(xhci, >op_regs->cmd_ring);
 332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
 333 >op_regs->cmd_ring);

We should hold xhci->lock when we are modifying xhci registers
at runtime.


 334
 335 /* Section 4.6.1.2 of xHCI 1.0 spec says software should
 336  * time the completion od all xHCI commands, including

s/od/of/g

 337  * the Command Abort operation. If software doesn't see
 338  * CRR negated in a timely manner (e.g. longer than 5
 339  * seconds), then it should assume that the there are
 340  * larger problems with the xHC and assert HCRST.
 341  */
 342 ret = xhci_handshake(>op_regs->cmd_ring,
 343 CMD_RING_RUNNING, 0, 5 * 1000 * 1000);
 344 if (ret < 0) {
 345 /* we are about to kill xhci, give it one more chance */

The retry of setting CMD_RING_ABORT is not necessary according to
previous discussion. We have cleaned code for second try in
xhci_handle_command_timeout(). Need to clean up here as well.

 346 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
 347   >op_regs->cmd_ring);
 348 udelay(1000);
 349 ret = xhci_handshake(>op_regs->cmd_ring,
 350  CMD_RING_RUNNING, 0, 3 * 1000 * 1000);
 351 if (ret < 0) {
 352 xhci_err(xhci, "Stopped the command ring failed, "
 353  "maybe the host is dead\n");
 354 xhci->xhc_state |= XHCI_STATE_DYING;
 355 xhci_halt(xhci);
 356 return -ESHUTDOWN;
 357 }
 358 }
 359 /*
 360  * Writing the CMD_RING_ABORT bit should cause a cmd completion 
event,
 361  * however on some host hw the CMD_RING_RUNNING bit is correctly 
cleared
 362  * but the completion event in never sent. Wait 2 secs (arbitrary

s/in never sent/is never sent/g

 363  * number) to handle those cases after negation of 
CMD_RING_RUNNING.
 364  */

This should be implemented with a quirk bit. It's not common for all host 
controllers.

 365 if (!wait_for_completion_timeout(>cmd_ring_stop_completion,
 366  2 * HZ)) {
 367 xhci_dbg(xhci, "No stop event for abort, ring start 
fail?\n");
 368 xhci_cleanup_command_queue(xhci);
 369 } else {
 370 unsigned long flags;
 371
 372 spin_lock_irqsave(>lock, flags);
 373 xhci_handle_stopped_cmd_ring(xhci, 
xhci_next_queued_cmd(xhci));
 374 spin_unlock_irqrestore(>lock, flags);
 375 }
 376 return 0;
 377 }

Best regards,
Lu Baolu
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-20 Thread Baolin Wang
Hi Mathias,

On 20 December 2016 at 23:13, Mathias Nyman
 wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
> timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and  make my life
> easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of
> counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
> command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.

How about applying below patch in your 'timeout_race_fixes' branch?
[PATCH] usb: host: xhci: Clean up commands when stop endpoint command is timeout
https://lkml.org/lkml/2016/12/13/94

>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races

I will try to test these patches.

-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-20 Thread Mathias Nyman

On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and  make my life easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.

Still completely untested, (well it compiles)

Felipe gave instructions how to modify dwc3 driver to timeout on address
devicecommands to test these, I'll try to set that up.

All additional testing is welcome, especially if you can trigger timeouts
and races

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Baolin Wang
Hi,

On 20 December 2016 at 15:18, Lu Baolu  wrote:
> Hi,
>
> On 12/20/2016 02:46 PM, Baolin Wang wrote:
>> On 20 December 2016 at 14:39, Lu Baolu  wrote:
>>> Hi,
>>>
>>> On 12/20/2016 02:06 PM, Baolin Wang wrote:
 Hi,

 On 20 December 2016 at 12:29, Lu Baolu  wrote:
> Hi Mathias,
>
> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>> On 19.12.2016 13:34, Baolin Wang wrote:
>>> Hi Mathias,
>>>
>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>  wrote:
 On 13.12.2016 05:21, Baolin Wang wrote:
> Hi Mathias,
>
> On 12 December 2016 at 23:52, Mathias Nyman
>  wrote:
>> On 05.12.2016 09:51, Baolin Wang wrote:
>>> If a command event is found on the event ring during an interrupt,
>>> we need to stop the command timer with del_timer(). Since 
>>> del_timer()
>>> can fail if the timer is running and waiting on the xHCI lock, then
>>> it maybe get the wrong timeout command in 
>>> xhci_handle_command_timeout()
>>> if host fetched a new command and updated the xhci->current_cmd in
>>> handle_cmd_completion(). For this situation, we need a way to signal
>>> to the command timer that everything is fine and it should exit.
>>
>> Ah, right, this could actually happen.
>>
>>> We should introduce a counter (xhci->current_cmd_pending) for the 
>>> number
>>> of pending commands. If we need to cancel the command timer and
>>> del_timer()
>>> succeeds, we decrement the number of pending commands. If 
>>> del_timer()
>>> fails,
>>> we leave the number of pending commands alone.
>>>
>>> For handling timeout command, in xhci_handle_command_timeout() we 
>>> will
>>> check
>>> the counter after decrementing it, if the counter
>>> (xhci->current_cmd_pending)
>>> is 0, which means xhci->current_cmd is the right timeout command. 
>>> If the
>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>> current
>>> timeout command has been handled by host and host has fetched new
>>> command
>>> as
>>> xhci->current_cmd, then just return and wait for new current 
>>> command.
>>
>> A counter like this could work.
>>
>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>> event, this seems to cover both.
>>
>> quick check, case 1: timeout and cmd completion at the same time.
>>
>> cpu1cpu2
>>
>> queue_command(first), p++ (=1)
>> queue_command(more),
>> --completion irq fires---- timer times out at same 
>> time--
>> handle_cmd_completion() handle_cmd_timeout(),)
>> lock(xhci_lock  )   spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> cur_cmd = list_next(), p++ (=2)
>> unlock(xhci_lock)
>>   lock(xhci_lock)
>>   p-- (=1)
>>   if (p > 0), exit
>> OK works
>>
>> case 2: normal timeout case with ABORT+STOP, no race.
>>
>> cpu1cpu2
>>
>> queue_command(first), p++ (=1)
>> queue_command(more),
>>   handle_cmd_timeout()
>>   p-- (P=0), don't exit
>>   mod_timer(), p++ (P=1)
>>   write_abort_bit()
>> handle_cmd_comletion(ABORT)
>> del_timer(), ok, p-- (p = 0)
>> handle_cmd_completion(STOP)
>> del_timer(), fail, (P=0)
>> handle_stopped_cmd_ring()
>> cur_cmd = list_next(), p++ (=1)
>> mod_timer()
>>
>> OK, works, and same for just STOP case, with the only difference that
>> during handle_cmd_completion(STOP) p is decremented (p--)
> Yes, that's the cases what I want to handle, thanks for your explicit
> explanation.
>
 Gave this some more thought over the weekend, and this implementation
 doesn't solve the case when the last command times out and races with 
 the
 completion handler:

 cpu1cpu2

 queue_command(first), p++ (=1)
 --completion irq fires---- timer times out at same 
 time--
 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Lu Baolu
Hi,

On 12/20/2016 02:46 PM, Baolin Wang wrote:
> On 20 December 2016 at 14:39, Lu Baolu  wrote:
>> Hi,
>>
>> On 12/20/2016 02:06 PM, Baolin Wang wrote:
>>> Hi,
>>>
>>> On 20 December 2016 at 12:29, Lu Baolu  wrote:
 Hi Mathias,

 On 12/19/2016 08:13 PM, Mathias Nyman wrote:
> On 19.12.2016 13:34, Baolin Wang wrote:
>> Hi Mathias,
>>
>> On 19 December 2016 at 18:33, Mathias Nyman
>>  wrote:
>>> On 13.12.2016 05:21, Baolin Wang wrote:
 Hi Mathias,

 On 12 December 2016 at 23:52, Mathias Nyman
  wrote:
> On 05.12.2016 09:51, Baolin Wang wrote:
>> If a command event is found on the event ring during an interrupt,
>> we need to stop the command timer with del_timer(). Since del_timer()
>> can fail if the timer is running and waiting on the xHCI lock, then
>> it maybe get the wrong timeout command in 
>> xhci_handle_command_timeout()
>> if host fetched a new command and updated the xhci->current_cmd in
>> handle_cmd_completion(). For this situation, we need a way to signal
>> to the command timer that everything is fine and it should exit.
>
> Ah, right, this could actually happen.
>
>> We should introduce a counter (xhci->current_cmd_pending) for the 
>> number
>> of pending commands. If we need to cancel the command timer and
>> del_timer()
>> succeeds, we decrement the number of pending commands. If del_timer()
>> fails,
>> we leave the number of pending commands alone.
>>
>> For handling timeout command, in xhci_handle_command_timeout() we 
>> will
>> check
>> the counter after decrementing it, if the counter
>> (xhci->current_cmd_pending)
>> is 0, which means xhci->current_cmd is the right timeout command. If 
>> the
>> counter (xhci->current_cmd_pending) is greater than 0, which means
>> current
>> timeout command has been handled by host and host has fetched new
>> command
>> as
>> xhci->current_cmd, then just return and wait for new current command.
>
> A counter like this could work.
>
> Writing the abort bit can generate either ABORT+STOP, or just STOP
> event, this seems to cover both.
>
> quick check, case 1: timeout and cmd completion at the same time.
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
> --completion irq fires---- timer times out at same 
> time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock  )   spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> cur_cmd = list_next(), p++ (=2)
> unlock(xhci_lock)
>   lock(xhci_lock)
>   p-- (=1)
>   if (p > 0), exit
> OK works
>
> case 2: normal timeout case with ABORT+STOP, no race.
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
>   handle_cmd_timeout()
>   p-- (P=0), don't exit
>   mod_timer(), p++ (P=1)
>   write_abort_bit()
> handle_cmd_comletion(ABORT)
> del_timer(), ok, p-- (p = 0)
> handle_cmd_completion(STOP)
> del_timer(), fail, (P=0)
> handle_stopped_cmd_ring()
> cur_cmd = list_next(), p++ (=1)
> mod_timer()
>
> OK, works, and same for just STOP case, with the only difference that
> during handle_cmd_completion(STOP) p is decremented (p--)
 Yes, that's the cases what I want to handle, thanks for your explicit
 explanation.

>>> Gave this some more thought over the weekend, and this implementation
>>> doesn't solve the case when the last command times out and races with 
>>> the
>>> completion handler:
>>>
>>> cpu1cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> --completion irq fires---- timer times out at same 
>>> time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock )spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> no more commands, P (=1, nochange)
>>> unlock(xhci_lock)

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Baolin Wang
On 20 December 2016 at 14:39, Lu Baolu  wrote:
> Hi,
>
> On 12/20/2016 02:06 PM, Baolin Wang wrote:
>> Hi,
>>
>> On 20 December 2016 at 12:29, Lu Baolu  wrote:
>>> Hi Mathias,
>>>
>>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
 On 19.12.2016 13:34, Baolin Wang wrote:
> Hi Mathias,
>
> On 19 December 2016 at 18:33, Mathias Nyman
>  wrote:
>> On 13.12.2016 05:21, Baolin Wang wrote:
>>> Hi Mathias,
>>>
>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>  wrote:
 On 05.12.2016 09:51, Baolin Wang wrote:
>
> If a command event is found on the event ring during an interrupt,
> we need to stop the command timer with del_timer(). Since del_timer()
> can fail if the timer is running and waiting on the xHCI lock, then
> it maybe get the wrong timeout command in 
> xhci_handle_command_timeout()
> if host fetched a new command and updated the xhci->current_cmd in
> handle_cmd_completion(). For this situation, we need a way to signal
> to the command timer that everything is fine and it should exit.


 Ah, right, this could actually happen.

>
> We should introduce a counter (xhci->current_cmd_pending) for the 
> number
> of pending commands. If we need to cancel the command timer and
> del_timer()
> succeeds, we decrement the number of pending commands. If del_timer()
> fails,
> we leave the number of pending commands alone.
>
> For handling timeout command, in xhci_handle_command_timeout() we will
> check
> the counter after decrementing it, if the counter
> (xhci->current_cmd_pending)
> is 0, which means xhci->current_cmd is the right timeout command. If 
> the
> counter (xhci->current_cmd_pending) is greater than 0, which means
> current
> timeout command has been handled by host and host has fetched new
> command
> as
> xhci->current_cmd, then just return and wait for new current command.


 A counter like this could work.

 Writing the abort bit can generate either ABORT+STOP, or just STOP
 event, this seems to cover both.

 quick check, case 1: timeout and cmd completion at the same time.

 cpu1cpu2

 queue_command(first), p++ (=1)
 queue_command(more),
 --completion irq fires---- timer times out at same 
 time--
 handle_cmd_completion() handle_cmd_timeout(),)
 lock(xhci_lock  )   spin_on(xhci_lock)
 del_timer() fail, p (=1, nochange)
 cur_cmd = list_next(), p++ (=2)
 unlock(xhci_lock)
   lock(xhci_lock)
   p-- (=1)
   if (p > 0), exit
 OK works

 case 2: normal timeout case with ABORT+STOP, no race.

 cpu1cpu2

 queue_command(first), p++ (=1)
 queue_command(more),
   handle_cmd_timeout()
   p-- (P=0), don't exit
   mod_timer(), p++ (P=1)
   write_abort_bit()
 handle_cmd_comletion(ABORT)
 del_timer(), ok, p-- (p = 0)
 handle_cmd_completion(STOP)
 del_timer(), fail, (P=0)
 handle_stopped_cmd_ring()
 cur_cmd = list_next(), p++ (=1)
 mod_timer()

 OK, works, and same for just STOP case, with the only difference that
 during handle_cmd_completion(STOP) p is decremented (p--)
>>>
>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>> explanation.
>>>
>> Gave this some more thought over the weekend, and this implementation
>> doesn't solve the case when the last command times out and races with the
>> completion handler:
>>
>> cpu1cpu2
>>
>> queue_command(first), p++ (=1)
>> --completion irq fires---- timer times out at same time--
>> handle_cmd_completion() handle_cmd_timeout(),)
>> lock(xhci_lock )spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> no more commands, P (=1, nochange)
>> unlock(xhci_lock)
>>  lock(xhci_lock)
>>  p-- (=0)
>> 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Lu Baolu
Hi,

On 12/20/2016 02:06 PM, Baolin Wang wrote:
> Hi,
>
> On 20 December 2016 at 12:29, Lu Baolu  wrote:
>> Hi Mathias,
>>
>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>> On 19.12.2016 13:34, Baolin Wang wrote:
 Hi Mathias,

 On 19 December 2016 at 18:33, Mathias Nyman
  wrote:
> On 13.12.2016 05:21, Baolin Wang wrote:
>> Hi Mathias,
>>
>> On 12 December 2016 at 23:52, Mathias Nyman
>>  wrote:
>>> On 05.12.2016 09:51, Baolin Wang wrote:

 If a command event is found on the event ring during an interrupt,
 we need to stop the command timer with del_timer(). Since del_timer()
 can fail if the timer is running and waiting on the xHCI lock, then
 it maybe get the wrong timeout command in xhci_handle_command_timeout()
 if host fetched a new command and updated the xhci->current_cmd in
 handle_cmd_completion(). For this situation, we need a way to signal
 to the command timer that everything is fine and it should exit.
>>>
>>>
>>> Ah, right, this could actually happen.
>>>

 We should introduce a counter (xhci->current_cmd_pending) for the 
 number
 of pending commands. If we need to cancel the command timer and
 del_timer()
 succeeds, we decrement the number of pending commands. If del_timer()
 fails,
 we leave the number of pending commands alone.

 For handling timeout command, in xhci_handle_command_timeout() we will
 check
 the counter after decrementing it, if the counter
 (xhci->current_cmd_pending)
 is 0, which means xhci->current_cmd is the right timeout command. If 
 the
 counter (xhci->current_cmd_pending) is greater than 0, which means
 current
 timeout command has been handled by host and host has fetched new
 command
 as
 xhci->current_cmd, then just return and wait for new current command.
>>>
>>>
>>> A counter like this could work.
>>>
>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>> event, this seems to cover both.
>>>
>>> quick check, case 1: timeout and cmd completion at the same time.
>>>
>>> cpu1cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> queue_command(more),
>>> --completion irq fires---- timer times out at same 
>>> time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock  )   spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> cur_cmd = list_next(), p++ (=2)
>>> unlock(xhci_lock)
>>>   lock(xhci_lock)
>>>   p-- (=1)
>>>   if (p > 0), exit
>>> OK works
>>>
>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>
>>> cpu1cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> queue_command(more),
>>>   handle_cmd_timeout()
>>>   p-- (P=0), don't exit
>>>   mod_timer(), p++ (P=1)
>>>   write_abort_bit()
>>> handle_cmd_comletion(ABORT)
>>> del_timer(), ok, p-- (p = 0)
>>> handle_cmd_completion(STOP)
>>> del_timer(), fail, (P=0)
>>> handle_stopped_cmd_ring()
>>> cur_cmd = list_next(), p++ (=1)
>>> mod_timer()
>>>
>>> OK, works, and same for just STOP case, with the only difference that
>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>
>> Yes, that's the cases what I want to handle, thanks for your explicit
>> explanation.
>>
> Gave this some more thought over the weekend, and this implementation
> doesn't solve the case when the last command times out and races with the
> completion handler:
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> --completion irq fires---- timer times out at same time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock )spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> no more commands, P (=1, nochange)
> unlock(xhci_lock)
>  lock(xhci_lock)
>  p-- (=0)
>  p == 0, continue, even if we 
> should
> not.
>For this we still need to rely 
> on
> checking cur_cmd == NULL in the 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Baolin Wang
Hi,

On 20 December 2016 at 12:29, Lu Baolu  wrote:
> Hi Mathias,
>
> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>> On 19.12.2016 13:34, Baolin Wang wrote:
>>> Hi Mathias,
>>>
>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>  wrote:
 On 13.12.2016 05:21, Baolin Wang wrote:
>
> Hi Mathias,
>
> On 12 December 2016 at 23:52, Mathias Nyman
>  wrote:
>>
>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>
>>>
>>> If a command event is found on the event ring during an interrupt,
>>> we need to stop the command timer with del_timer(). Since del_timer()
>>> can fail if the timer is running and waiting on the xHCI lock, then
>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>> if host fetched a new command and updated the xhci->current_cmd in
>>> handle_cmd_completion(). For this situation, we need a way to signal
>>> to the command timer that everything is fine and it should exit.
>>
>>
>>
>> Ah, right, this could actually happen.
>>
>>>
>>>
>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>> of pending commands. If we need to cancel the command timer and
>>> del_timer()
>>> succeeds, we decrement the number of pending commands. If del_timer()
>>> fails,
>>> we leave the number of pending commands alone.
>>>
>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>> check
>>> the counter after decrementing it, if the counter
>>> (xhci->current_cmd_pending)
>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>> current
>>> timeout command has been handled by host and host has fetched new
>>> command
>>> as
>>> xhci->current_cmd, then just return and wait for new current command.
>>
>>
>>
>> A counter like this could work.
>>
>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>> event, this seems to cover both.
>>
>> quick check, case 1: timeout and cmd completion at the same time.
>>
>> cpu1cpu2
>>
>> queue_command(first), p++ (=1)
>> queue_command(more),
>> --completion irq fires---- timer times out at same time--
>> handle_cmd_completion() handle_cmd_timeout(),)
>> lock(xhci_lock  )   spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> cur_cmd = list_next(), p++ (=2)
>> unlock(xhci_lock)
>>   lock(xhci_lock)
>>   p-- (=1)
>>   if (p > 0), exit
>> OK works
>>
>> case 2: normal timeout case with ABORT+STOP, no race.
>>
>> cpu1cpu2
>>
>> queue_command(first), p++ (=1)
>> queue_command(more),
>>   handle_cmd_timeout()
>>   p-- (P=0), don't exit
>>   mod_timer(), p++ (P=1)
>>   write_abort_bit()
>> handle_cmd_comletion(ABORT)
>> del_timer(), ok, p-- (p = 0)
>> handle_cmd_completion(STOP)
>> del_timer(), fail, (P=0)
>> handle_stopped_cmd_ring()
>> cur_cmd = list_next(), p++ (=1)
>> mod_timer()
>>
>> OK, works, and same for just STOP case, with the only difference that
>> during handle_cmd_completion(STOP) p is decremented (p--)
>
>
> Yes, that's the cases what I want to handle, thanks for your explicit
> explanation.
>

 Gave this some more thought over the weekend, and this implementation
 doesn't solve the case when the last command times out and races with the
 completion handler:

 cpu1cpu2

 queue_command(first), p++ (=1)
 --completion irq fires---- timer times out at same time--
 handle_cmd_completion() handle_cmd_timeout(),)
 lock(xhci_lock )spin_on(xhci_lock)
 del_timer() fail, p (=1, nochange)
 no more commands, P (=1, nochange)
 unlock(xhci_lock)
  lock(xhci_lock)
  p-- (=0)
  p == 0, continue, even if we 
 should
 not.
For this we still need to rely 
 on
 checking cur_cmd == NULL in the timeout function.
 (Baolus patch sets it to NULL if there are no more commands pending)
>>>
>>> As I pointed out in patch 1 of this patchset, 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Lu Baolu
Hi Mathias,

On 12/19/2016 08:13 PM, Mathias Nyman wrote:
> On 19.12.2016 13:34, Baolin Wang wrote:
>> Hi Mathias,
>>
>> On 19 December 2016 at 18:33, Mathias Nyman
>>  wrote:
>>> On 13.12.2016 05:21, Baolin Wang wrote:

 Hi Mathias,

 On 12 December 2016 at 23:52, Mathias Nyman
  wrote:
>
> On 05.12.2016 09:51, Baolin Wang wrote:
>>
>>
>> If a command event is found on the event ring during an interrupt,
>> we need to stop the command timer with del_timer(). Since del_timer()
>> can fail if the timer is running and waiting on the xHCI lock, then
>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>> if host fetched a new command and updated the xhci->current_cmd in
>> handle_cmd_completion(). For this situation, we need a way to signal
>> to the command timer that everything is fine and it should exit.
>
>
>
> Ah, right, this could actually happen.
>
>>
>>
>> We should introduce a counter (xhci->current_cmd_pending) for the number
>> of pending commands. If we need to cancel the command timer and
>> del_timer()
>> succeeds, we decrement the number of pending commands. If del_timer()
>> fails,
>> we leave the number of pending commands alone.
>>
>> For handling timeout command, in xhci_handle_command_timeout() we will
>> check
>> the counter after decrementing it, if the counter
>> (xhci->current_cmd_pending)
>> is 0, which means xhci->current_cmd is the right timeout command. If the
>> counter (xhci->current_cmd_pending) is greater than 0, which means
>> current
>> timeout command has been handled by host and host has fetched new
>> command
>> as
>> xhci->current_cmd, then just return and wait for new current command.
>
>
>
> A counter like this could work.
>
> Writing the abort bit can generate either ABORT+STOP, or just STOP
> event, this seems to cover both.
>
> quick check, case 1: timeout and cmd completion at the same time.
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
> --completion irq fires---- timer times out at same time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock  )   spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> cur_cmd = list_next(), p++ (=2)
> unlock(xhci_lock)
>   lock(xhci_lock)
>   p-- (=1)
>   if (p > 0), exit
> OK works
>
> case 2: normal timeout case with ABORT+STOP, no race.
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
>   handle_cmd_timeout()
>   p-- (P=0), don't exit
>   mod_timer(), p++ (P=1)
>   write_abort_bit()
> handle_cmd_comletion(ABORT)
> del_timer(), ok, p-- (p = 0)
> handle_cmd_completion(STOP)
> del_timer(), fail, (P=0)
> handle_stopped_cmd_ring()
> cur_cmd = list_next(), p++ (=1)
> mod_timer()
>
> OK, works, and same for just STOP case, with the only difference that
> during handle_cmd_completion(STOP) p is decremented (p--)


 Yes, that's the cases what I want to handle, thanks for your explicit
 explanation.

>>>
>>> Gave this some more thought over the weekend, and this implementation
>>> doesn't solve the case when the last command times out and races with the
>>> completion handler:
>>>
>>> cpu1cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> --completion irq fires---- timer times out at same time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock )spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> no more commands, P (=1, nochange)
>>> unlock(xhci_lock)
>>>  lock(xhci_lock)
>>>  p-- (=0)
>>>  p == 0, continue, even if we should
>>> not.
>>>For this we still need to rely on
>>> checking cur_cmd == NULL in the timeout function.
>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>
>> As I pointed out in patch 1 of this patchset, this patchset is based
>> on Lu Baolu's new fix patch:
>> usb: xhci: fix possible wild pointer
>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>
>> After applying Baolu's patch, after decrement the 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Baolin Wang
On 19 December 2016 at 20:13, Mathias Nyman  wrote:
> On 19.12.2016 13:34, Baolin Wang wrote:
>>
>> Hi Mathias,
>>
>> On 19 December 2016 at 18:33, Mathias Nyman
>>  wrote:
>>>
>>> On 13.12.2016 05:21, Baolin Wang wrote:


 Hi Mathias,

 On 12 December 2016 at 23:52, Mathias Nyman
  wrote:
>
>
> On 05.12.2016 09:51, Baolin Wang wrote:
>>
>>
>>
>> If a command event is found on the event ring during an interrupt,
>> we need to stop the command timer with del_timer(). Since del_timer()
>> can fail if the timer is running and waiting on the xHCI lock, then
>> it maybe get the wrong timeout command in
>> xhci_handle_command_timeout()
>> if host fetched a new command and updated the xhci->current_cmd in
>> handle_cmd_completion(). For this situation, we need a way to signal
>> to the command timer that everything is fine and it should exit.
>
>
>
>
> Ah, right, this could actually happen.
>
>>
>>
>> We should introduce a counter (xhci->current_cmd_pending) for the
>> number
>> of pending commands. If we need to cancel the command timer and
>> del_timer()
>> succeeds, we decrement the number of pending commands. If del_timer()
>> fails,
>> we leave the number of pending commands alone.
>>
>> For handling timeout command, in xhci_handle_command_timeout() we will
>> check
>> the counter after decrementing it, if the counter
>> (xhci->current_cmd_pending)
>> is 0, which means xhci->current_cmd is the right timeout command. If
>> the
>> counter (xhci->current_cmd_pending) is greater than 0, which means
>> current
>> timeout command has been handled by host and host has fetched new
>> command
>> as
>> xhci->current_cmd, then just return and wait for new current command.
>
>
>
>
> A counter like this could work.
>
> Writing the abort bit can generate either ABORT+STOP, or just STOP
> event, this seems to cover both.
>
> quick check, case 1: timeout and cmd completion at the same time.
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
> --completion irq fires---- timer times out at same
> time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock  )   spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> cur_cmd = list_next(), p++ (=2)
> unlock(xhci_lock)
>   lock(xhci_lock)
>   p-- (=1)
>   if (p > 0), exit
> OK works
>
> case 2: normal timeout case with ABORT+STOP, no race.
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
>   handle_cmd_timeout()
>   p-- (P=0), don't exit
>   mod_timer(), p++ (P=1)
>   write_abort_bit()
> handle_cmd_comletion(ABORT)
> del_timer(), ok, p-- (p = 0)
> handle_cmd_completion(STOP)
> del_timer(), fail, (P=0)
> handle_stopped_cmd_ring()
> cur_cmd = list_next(), p++ (=1)
> mod_timer()
>
> OK, works, and same for just STOP case, with the only difference that
> during handle_cmd_completion(STOP) p is decremented (p--)



 Yes, that's the cases what I want to handle, thanks for your explicit
 explanation.

>>>
>>> Gave this some more thought over the weekend, and this implementation
>>> doesn't solve the case when the last command times out and races with the
>>> completion handler:
>>>
>>> cpu1cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> --completion irq fires---- timer times out at same time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock )spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> no more commands, P (=1, nochange)
>>> unlock(xhci_lock)
>>>  lock(xhci_lock)
>>>  p-- (=0)
>>>  p == 0, continue, even if we
>>> should
>>> not.
>>>For this we still need to rely
>>> on
>>> checking cur_cmd == NULL in the timeout function.
>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>
>>
>> As I pointed out in patch 1 of this patchset, this patchset is based
>> on Lu Baolu's new fix patch:
>> usb: xhci: fix possible wild pointer
>> 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Mathias Nyman

On 19.12.2016 13:34, Baolin Wang wrote:

Hi Mathias,

On 19 December 2016 at 18:33, Mathias Nyman
 wrote:

On 13.12.2016 05:21, Baolin Wang wrote:


Hi Mathias,

On 12 December 2016 at 23:52, Mathias Nyman
 wrote:


On 05.12.2016 09:51, Baolin Wang wrote:



If a command event is found on the event ring during an interrupt,
we need to stop the command timer with del_timer(). Since del_timer()
can fail if the timer is running and waiting on the xHCI lock, then
it maybe get the wrong timeout command in xhci_handle_command_timeout()
if host fetched a new command and updated the xhci->current_cmd in
handle_cmd_completion(). For this situation, we need a way to signal
to the command timer that everything is fine and it should exit.




Ah, right, this could actually happen.




We should introduce a counter (xhci->current_cmd_pending) for the number
of pending commands. If we need to cancel the command timer and
del_timer()
succeeds, we decrement the number of pending commands. If del_timer()
fails,
we leave the number of pending commands alone.

For handling timeout command, in xhci_handle_command_timeout() we will
check
the counter after decrementing it, if the counter
(xhci->current_cmd_pending)
is 0, which means xhci->current_cmd is the right timeout command. If the
counter (xhci->current_cmd_pending) is greater than 0, which means
current
timeout command has been handled by host and host has fetched new
command
as
xhci->current_cmd, then just return and wait for new current command.




A counter like this could work.

Writing the abort bit can generate either ABORT+STOP, or just STOP
event, this seems to cover both.

quick check, case 1: timeout and cmd completion at the same time.

cpu1cpu2

queue_command(first), p++ (=1)
queue_command(more),
--completion irq fires---- timer times out at same time--
handle_cmd_completion() handle_cmd_timeout(),)
lock(xhci_lock  )   spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
cur_cmd = list_next(), p++ (=2)
unlock(xhci_lock)
  lock(xhci_lock)
  p-- (=1)
  if (p > 0), exit
OK works

case 2: normal timeout case with ABORT+STOP, no race.

cpu1cpu2

queue_command(first), p++ (=1)
queue_command(more),
  handle_cmd_timeout()
  p-- (P=0), don't exit
  mod_timer(), p++ (P=1)
  write_abort_bit()
handle_cmd_comletion(ABORT)
del_timer(), ok, p-- (p = 0)
handle_cmd_completion(STOP)
del_timer(), fail, (P=0)
handle_stopped_cmd_ring()
cur_cmd = list_next(), p++ (=1)
mod_timer()

OK, works, and same for just STOP case, with the only difference that
during handle_cmd_completion(STOP) p is decremented (p--)



Yes, that's the cases what I want to handle, thanks for your explicit
explanation.



Gave this some more thought over the weekend, and this implementation
doesn't solve the case when the last command times out and races with the
completion handler:

cpu1cpu2

queue_command(first), p++ (=1)
--completion irq fires---- timer times out at same time--
handle_cmd_completion() handle_cmd_timeout(),)
lock(xhci_lock )spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
no more commands, P (=1, nochange)
unlock(xhci_lock)
 lock(xhci_lock)
 p-- (=0)
 p == 0, continue, even if we should
not.
   For this we still need to rely on
checking cur_cmd == NULL in the timeout function.
(Baolus patch sets it to NULL if there are no more commands pending)


As I pointed out in patch 1 of this patchset, this patchset is based
on Lu Baolu's new fix patch:
usb: xhci: fix possible wild pointer
https://www.spinics.net/lists/linux-usb/msg150219.html

After applying Baolu's patch, after decrement the counter, we will
check the xhci->cur_command if is NULL. So in this situation:
cpu1cpu2

  queue_command(first), p++ (=1)
  --completion irq fires---- timer times out at same time--
  handle_cmd_completion() handle_cmd_timeout(),)
  lock(xhci_lock )spin_on(xhci_lock)
  del_timer() fail, p (=1, nochange)
  no more commands, P (=1, nochange)
  unlock(xhci_lock)
  lock(xhci_lock)
  p-- (=0)
  no current command, return
  if (!xhci->current_cmd) {
 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Baolin Wang
Hi Mathias,

On 19 December 2016 at 18:33, Mathias Nyman
 wrote:
> On 13.12.2016 05:21, Baolin Wang wrote:
>>
>> Hi Mathias,
>>
>> On 12 December 2016 at 23:52, Mathias Nyman
>>  wrote:
>>>
>>> On 05.12.2016 09:51, Baolin Wang wrote:


 If a command event is found on the event ring during an interrupt,
 we need to stop the command timer with del_timer(). Since del_timer()
 can fail if the timer is running and waiting on the xHCI lock, then
 it maybe get the wrong timeout command in xhci_handle_command_timeout()
 if host fetched a new command and updated the xhci->current_cmd in
 handle_cmd_completion(). For this situation, we need a way to signal
 to the command timer that everything is fine and it should exit.
>>>
>>>
>>>
>>> Ah, right, this could actually happen.
>>>


 We should introduce a counter (xhci->current_cmd_pending) for the number
 of pending commands. If we need to cancel the command timer and
 del_timer()
 succeeds, we decrement the number of pending commands. If del_timer()
 fails,
 we leave the number of pending commands alone.

 For handling timeout command, in xhci_handle_command_timeout() we will
 check
 the counter after decrementing it, if the counter
 (xhci->current_cmd_pending)
 is 0, which means xhci->current_cmd is the right timeout command. If the
 counter (xhci->current_cmd_pending) is greater than 0, which means
 current
 timeout command has been handled by host and host has fetched new
 command
 as
 xhci->current_cmd, then just return and wait for new current command.
>>>
>>>
>>>
>>> A counter like this could work.
>>>
>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>> event, this seems to cover both.
>>>
>>> quick check, case 1: timeout and cmd completion at the same time.
>>>
>>> cpu1cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> queue_command(more),
>>> --completion irq fires---- timer times out at same time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock  )   spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> cur_cmd = list_next(), p++ (=2)
>>> unlock(xhci_lock)
>>>  lock(xhci_lock)
>>>  p-- (=1)
>>>  if (p > 0), exit
>>> OK works
>>>
>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>
>>> cpu1cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> queue_command(more),
>>>  handle_cmd_timeout()
>>>  p-- (P=0), don't exit
>>>  mod_timer(), p++ (P=1)
>>>  write_abort_bit()
>>> handle_cmd_comletion(ABORT)
>>> del_timer(), ok, p-- (p = 0)
>>> handle_cmd_completion(STOP)
>>> del_timer(), fail, (P=0)
>>> handle_stopped_cmd_ring()
>>> cur_cmd = list_next(), p++ (=1)
>>> mod_timer()
>>>
>>> OK, works, and same for just STOP case, with the only difference that
>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>
>>
>> Yes, that's the cases what I want to handle, thanks for your explicit
>> explanation.
>>
>
> Gave this some more thought over the weekend, and this implementation
> doesn't solve the case when the last command times out and races with the
> completion handler:
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> --completion irq fires---- timer times out at same time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock )spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> no more commands, P (=1, nochange)
> unlock(xhci_lock)
> lock(xhci_lock)
> p-- (=0)
> p == 0, continue, even if we should
> not.
>   For this we still need to rely on
> checking cur_cmd == NULL in the timeout function.
> (Baolus patch sets it to NULL if there are no more commands pending)

As I pointed out in patch 1 of this patchset, this patchset is based
on Lu Baolu's new fix patch:
usb: xhci: fix possible wild pointer
https://www.spinics.net/lists/linux-usb/msg150219.html

After applying Baolu's patch, after decrement the counter, we will
check the xhci->cur_command if is NULL. So in this situation:
cpu1cpu2

 queue_command(first), p++ (=1)
 --completion irq fires---- timer times out at same time--
 handle_cmd_completion() handle_cmd_timeout(),)
 lock(xhci_lock )spin_on(xhci_lock)
 del_timer() 

Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-19 Thread Mathias Nyman

On 13.12.2016 05:21, Baolin Wang wrote:

Hi Mathias,

On 12 December 2016 at 23:52, Mathias Nyman
 wrote:

On 05.12.2016 09:51, Baolin Wang wrote:


If a command event is found on the event ring during an interrupt,
we need to stop the command timer with del_timer(). Since del_timer()
can fail if the timer is running and waiting on the xHCI lock, then
it maybe get the wrong timeout command in xhci_handle_command_timeout()
if host fetched a new command and updated the xhci->current_cmd in
handle_cmd_completion(). For this situation, we need a way to signal
to the command timer that everything is fine and it should exit.



Ah, right, this could actually happen.




We should introduce a counter (xhci->current_cmd_pending) for the number
of pending commands. If we need to cancel the command timer and
del_timer()
succeeds, we decrement the number of pending commands. If del_timer()
fails,
we leave the number of pending commands alone.

For handling timeout command, in xhci_handle_command_timeout() we will
check
the counter after decrementing it, if the counter
(xhci->current_cmd_pending)
is 0, which means xhci->current_cmd is the right timeout command. If the
counter (xhci->current_cmd_pending) is greater than 0, which means current
timeout command has been handled by host and host has fetched new command
as
xhci->current_cmd, then just return and wait for new current command.



A counter like this could work.

Writing the abort bit can generate either ABORT+STOP, or just STOP
event, this seems to cover both.

quick check, case 1: timeout and cmd completion at the same time.

cpu1cpu2

queue_command(first), p++ (=1)
queue_command(more),
--completion irq fires---- timer times out at same time--
handle_cmd_completion() handle_cmd_timeout(),)
lock(xhci_lock  )   spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
cur_cmd = list_next(), p++ (=2)
unlock(xhci_lock)
 lock(xhci_lock)
 p-- (=1)
 if (p > 0), exit
OK works

case 2: normal timeout case with ABORT+STOP, no race.

cpu1cpu2

queue_command(first), p++ (=1)
queue_command(more),
 handle_cmd_timeout()
 p-- (P=0), don't exit
 mod_timer(), p++ (P=1)
 write_abort_bit()
handle_cmd_comletion(ABORT)
del_timer(), ok, p-- (p = 0)
handle_cmd_completion(STOP)
del_timer(), fail, (P=0)
handle_stopped_cmd_ring()
cur_cmd = list_next(), p++ (=1)
mod_timer()

OK, works, and same for just STOP case, with the only difference that
during handle_cmd_completion(STOP) p is decremented (p--)


Yes, that's the cases what I want to handle, thanks for your explicit
explanation.



Gave this some more thought over the weekend, and this implementation
doesn't solve the case when the last command times out and races with the
completion handler:

cpu1cpu2

queue_command(first), p++ (=1)
--completion irq fires---- timer times out at same time--
handle_cmd_completion() handle_cmd_timeout(),)
lock(xhci_lock )spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
no more commands, P (=1, nochange)
unlock(xhci_lock)
lock(xhci_lock)
p-- (=0)
p == 0, continue, even if we should not.
  
For this we still need to rely on checking cur_cmd == NULL in the timeout function.

(Baolus patch sets it to NULL if there are no more commands pending)

And then we could replace the whole counter with a simple check if the timeout 
timer
is pending in the timeout function:

xhci_handle_command_timeout()
lock()
if (!cur_cmd || timer_pending(timeout_timer)) {
unlock();
return;
}

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-12 Thread Baolin Wang
Hi Mathias,

On 12 December 2016 at 23:52, Mathias Nyman
 wrote:
> On 05.12.2016 09:51, Baolin Wang wrote:
>>
>> If a command event is found on the event ring during an interrupt,
>> we need to stop the command timer with del_timer(). Since del_timer()
>> can fail if the timer is running and waiting on the xHCI lock, then
>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>> if host fetched a new command and updated the xhci->current_cmd in
>> handle_cmd_completion(). For this situation, we need a way to signal
>> to the command timer that everything is fine and it should exit.
>
>
> Ah, right, this could actually happen.
>
>>
>>
>> We should introduce a counter (xhci->current_cmd_pending) for the number
>> of pending commands. If we need to cancel the command timer and
>> del_timer()
>> succeeds, we decrement the number of pending commands. If del_timer()
>> fails,
>> we leave the number of pending commands alone.
>>
>> For handling timeout command, in xhci_handle_command_timeout() we will
>> check
>> the counter after decrementing it, if the counter
>> (xhci->current_cmd_pending)
>> is 0, which means xhci->current_cmd is the right timeout command. If the
>> counter (xhci->current_cmd_pending) is greater than 0, which means current
>> timeout command has been handled by host and host has fetched new command
>> as
>> xhci->current_cmd, then just return and wait for new current command.
>
>
> A counter like this could work.
>
> Writing the abort bit can generate either ABORT+STOP, or just STOP
> event, this seems to cover both.
>
> quick check, case 1: timeout and cmd completion at the same time.
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
> --completion irq fires---- timer times out at same time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock  )   spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> cur_cmd = list_next(), p++ (=2)
> unlock(xhci_lock)
> lock(xhci_lock)
> p-- (=1)
> if (p > 0), exit
> OK works
>
> case 2: normal timeout case with ABORT+STOP, no race.
>
> cpu1cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
> handle_cmd_timeout()
> p-- (P=0), don't exit
> mod_timer(), p++ (P=1)
> write_abort_bit()
> handle_cmd_comletion(ABORT)
> del_timer(), ok, p-- (p = 0)
> handle_cmd_completion(STOP)
> del_timer(), fail, (P=0)
> handle_stopped_cmd_ring()
> cur_cmd = list_next(), p++ (=1)
> mod_timer()
>
> OK, works, and same for just STOP case, with the only difference that
> during handle_cmd_completion(STOP) p is decremented (p--)

Yes, that's the cases what I want to handle, thanks for your explicit
explanation.

>
> So unless there is a way to find out if cur_cmd is valid in command timeout
> in command timeout with the help of existing flags and lists this would be a
> working
> solution.
>
> -Mathias
>



-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-12 Thread Mathias Nyman

On 05.12.2016 09:51, Baolin Wang wrote:

If a command event is found on the event ring during an interrupt,
we need to stop the command timer with del_timer(). Since del_timer()
can fail if the timer is running and waiting on the xHCI lock, then
it maybe get the wrong timeout command in xhci_handle_command_timeout()
if host fetched a new command and updated the xhci->current_cmd in
handle_cmd_completion(). For this situation, we need a way to signal
to the command timer that everything is fine and it should exit.


Ah, right, this could actually happen.
 


We should introduce a counter (xhci->current_cmd_pending) for the number
of pending commands. If we need to cancel the command timer and del_timer()
succeeds, we decrement the number of pending commands. If del_timer() fails,
we leave the number of pending commands alone.

For handling timeout command, in xhci_handle_command_timeout() we will check
the counter after decrementing it, if the counter (xhci->current_cmd_pending)
is 0, which means xhci->current_cmd is the right timeout command. If the
counter (xhci->current_cmd_pending) is greater than 0, which means current
timeout command has been handled by host and host has fetched new command as
xhci->current_cmd, then just return and wait for new current command.


A counter like this could work.

Writing the abort bit can generate either ABORT+STOP, or just STOP
event, this seems to cover both.

quick check, case 1: timeout and cmd completion at the same time.

cpu1cpu2

queue_command(first), p++ (=1)
queue_command(more),
--completion irq fires---- timer times out at same time--
handle_cmd_completion() handle_cmd_timeout(),)
lock(xhci_lock  )   spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
cur_cmd = list_next(), p++ (=2)
unlock(xhci_lock)   
lock(xhci_lock)
p-- (=1)
if (p > 0), exit
OK works

case 2: normal timeout case with ABORT+STOP, no race.

cpu1cpu2

queue_command(first), p++ (=1)
queue_command(more),
handle_cmd_timeout()
p-- (P=0), don't exit
mod_timer(), p++ (P=1)
write_abort_bit()
handle_cmd_comletion(ABORT)
del_timer(), ok, p-- (p = 0)
handle_cmd_completion(STOP) 
del_timer(), fail, (P=0)
handle_stopped_cmd_ring()
cur_cmd = list_next(), p++ (=1)
mod_timer()

OK, works, and same for just STOP case, with the only difference that
during handle_cmd_completion(STOP) p is decremented (p--)

So unless there is a way to find out if cur_cmd is valid in command timeout
in command timeout with the help of existing flags and lists this would be a 
working
solution.

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] usb: host: xhci: Handle the right timeout command

2016-12-05 Thread Baolin Wang
If a command event is found on the event ring during an interrupt,
we need to stop the command timer with del_timer(). Since del_timer()
can fail if the timer is running and waiting on the xHCI lock, then
it maybe get the wrong timeout command in xhci_handle_command_timeout()
if host fetched a new command and updated the xhci->current_cmd in
handle_cmd_completion(). For this situation, we need a way to signal
to the command timer that everything is fine and it should exit.

We should introduce a counter (xhci->current_cmd_pending) for the number
of pending commands. If we need to cancel the command timer and del_timer()
succeeds, we decrement the number of pending commands. If del_timer() fails,
we leave the number of pending commands alone.

For handling timeout command, in xhci_handle_command_timeout() we will check
the counter after decrementing it, if the counter (xhci->current_cmd_pending)
is 0, which means xhci->current_cmd is the right timeout command. If the
counter (xhci->current_cmd_pending) is greater than 0, which means current
timeout command has been handled by host and host has fetched new command as
xhci->current_cmd, then just return and wait for new current command.

Signed-off-by: Baolin Wang 
---
 drivers/usb/host/xhci-ring.c |   29 -
 drivers/usb/host/xhci.h  |1 +
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 9965a4c..edc9ac2 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1253,6 +1253,7 @@ static void xhci_handle_stopped_cmd_ring(struct xhci_hcd 
*xhci,
if ((xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) &&
!(xhci->xhc_state & XHCI_STATE_DYING)) {
xhci->current_cmd = cur_cmd;
+   xhci->current_cmd_pending++;
mod_timer(>cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
xhci_ring_cmd_db(xhci);
}
@@ -1269,11 +1270,27 @@ void xhci_handle_command_timeout(unsigned long data)
xhci = (struct xhci_hcd *) data;
 
spin_lock_irqsave(>lock, flags);
+   xhci->current_cmd_pending--;
+
if (!xhci->current_cmd) {
spin_unlock_irqrestore(>lock, flags);
return;
}
 
+   /*
+* If the current_cmd_pending is 0, which means current command is
+* timeout.
+*
+* If the current_cmd_pending is greater than 0, which means current
+* timeout command has been handled by host and host has fetched new
+* command as xhci->current_cmd, then just return and wait for new
+* current command.
+*/
+   if (xhci->current_cmd_pending > 0) {
+   spin_unlock_irqrestore(>lock, flags);
+   return;
+   }
+
if (xhci->current_cmd->status == COMP_CMD_ABORT)
second_timeout = true;
xhci->current_cmd->status = COMP_CMD_ABORT;
@@ -1282,6 +1299,8 @@ void xhci_handle_command_timeout(unsigned long data)
hw_ring_state = xhci_read_64(xhci, >op_regs->cmd_ring);
if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
(hw_ring_state & CMD_RING_RUNNING))  {
+   /* Will add command timer again to wait for abort event */
+   xhci->current_cmd_pending++;
spin_unlock_irqrestore(>lock, flags);
xhci_dbg(xhci, "Command timeout\n");
ret = xhci_abort_cmd_ring(xhci);
@@ -1336,7 +1355,13 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
 
cmd = list_entry(xhci->cmd_list.next, struct xhci_command, cmd_list);
 
-   del_timer(>cmd_timer);
+   /*
+* If the command timer is running on another CPU, we don't decrement
+* current_cmd_pending, since we didn't successfully stop the command
+* timer.
+*/
+   if (del_timer(>cmd_timer))
+   xhci->current_cmd_pending--;
 
trace_xhci_cmd_completion(cmd_trb, (struct xhci_generic_trb *) event);
 
@@ -1427,6 +1452,7 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
if (cmd->cmd_list.next != >cmd_list) {
xhci->current_cmd = list_entry(cmd->cmd_list.next,
   struct xhci_command, cmd_list);
+   xhci->current_cmd_pending++;
mod_timer(>cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
} else if (xhci->current_cmd == cmd) {
xhci->current_cmd = NULL;
@@ -3927,6 +3953,7 @@ static int queue_command(struct xhci_hcd *xhci, struct 
xhci_command *cmd,
if (xhci->cmd_list.next == >cmd_list &&
!timer_pending(>cmd_timer)) {
xhci->current_cmd = cmd;
+   xhci->current_cmd_pending++;
mod_timer(>cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
}
 
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index