Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-04-02 Thread Adrian Hunter
On 2/04/19 10:59 AM, Faiz Abbas wrote:
> Hi Adrian,
> 
> On 26/03/19 1:03 PM, Adrian Hunter wrote:
>> On 18/03/19 11:33 AM, Ulf Hansson wrote:
>>> + Arnd, Grygorii
>>>
>>> On Fri, 15 Feb 2019 at 20:17, Faiz Abbas  wrote:

 sdhci.c has two bottom halves implemented. A threaded_irq for handling
 card insert/remove operations and a tasklet for finishing mmc requests.
 With the addition of external dma support, dmaengine APIs need to
 terminate in non-atomic context before unmapping the dma buffers.

 To facilitate this, remove the finish_tasklet and move the call of
 sdhci_request_done() to the threaded_irq() callback. Also move the
 interrupt result variable to sdhci_host so it can be populated from
 anywhere inside the sdhci_irq handler.

 Signed-off-by: Faiz Abbas 
>>>
>>> Adrian, I think it makes sense to apply this patch, even if there is
>>> very minor negative impact throughput wise.
>>>
>>> To me, it doesn't seems like MMC/SD/SDIO has good justification for
>>> using tasklets, besides from the legacy point of view, of course.
>>> Instead, I think we should try to move all mmc hosts into using
>>> threaded IRQs.
>>>
>>> So, what do you think? Can you overlook the throughput drop and
>>> instead we can try to recover this on top with other optimizations?
>>
>> I tend to favour good results as expressed here:
>>
>>  https://lkml.org/lkml/2007/6/22/360
>>
>> So I want to do optimization first.
>>
>> But performance is not the only problem with the patch.  Give me a few
>> days and I will see what I can come up with.
>>
> 
> Gentle ping on this.

Working on it :-)


Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-04-02 Thread Faiz Abbas
Hi Adrian,

On 26/03/19 1:03 PM, Adrian Hunter wrote:
> On 18/03/19 11:33 AM, Ulf Hansson wrote:
>> + Arnd, Grygorii
>>
>> On Fri, 15 Feb 2019 at 20:17, Faiz Abbas  wrote:
>>>
>>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>>> card insert/remove operations and a tasklet for finishing mmc requests.
>>> With the addition of external dma support, dmaengine APIs need to
>>> terminate in non-atomic context before unmapping the dma buffers.
>>>
>>> To facilitate this, remove the finish_tasklet and move the call of
>>> sdhci_request_done() to the threaded_irq() callback. Also move the
>>> interrupt result variable to sdhci_host so it can be populated from
>>> anywhere inside the sdhci_irq handler.
>>>
>>> Signed-off-by: Faiz Abbas 
>>
>> Adrian, I think it makes sense to apply this patch, even if there is
>> very minor negative impact throughput wise.
>>
>> To me, it doesn't seems like MMC/SD/SDIO has good justification for
>> using tasklets, besides from the legacy point of view, of course.
>> Instead, I think we should try to move all mmc hosts into using
>> threaded IRQs.
>>
>> So, what do you think? Can you overlook the throughput drop and
>> instead we can try to recover this on top with other optimizations?
> 
> I tend to favour good results as expressed here:
> 
>   https://lkml.org/lkml/2007/6/22/360
> 
> So I want to do optimization first.
> 
> But performance is not the only problem with the patch.  Give me a few
> days and I will see what I can come up with.
> 

Gentle ping on this.

Thanks,
Faiz


Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-03-26 Thread Adrian Hunter
On 18/03/19 11:33 AM, Ulf Hansson wrote:
> + Arnd, Grygorii
> 
> On Fri, 15 Feb 2019 at 20:17, Faiz Abbas  wrote:
>>
>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>> card insert/remove operations and a tasklet for finishing mmc requests.
>> With the addition of external dma support, dmaengine APIs need to
>> terminate in non-atomic context before unmapping the dma buffers.
>>
>> To facilitate this, remove the finish_tasklet and move the call of
>> sdhci_request_done() to the threaded_irq() callback. Also move the
>> interrupt result variable to sdhci_host so it can be populated from
>> anywhere inside the sdhci_irq handler.
>>
>> Signed-off-by: Faiz Abbas 
> 
> Adrian, I think it makes sense to apply this patch, even if there is
> very minor negative impact throughput wise.
> 
> To me, it doesn't seems like MMC/SD/SDIO has good justification for
> using tasklets, besides from the legacy point of view, of course.
> Instead, I think we should try to move all mmc hosts into using
> threaded IRQs.
> 
> So, what do you think? Can you overlook the throughput drop and
> instead we can try to recover this on top with other optimizations?

I tend to favour good results as expressed here:

https://lkml.org/lkml/2007/6/22/360

So I want to do optimization first.

But performance is not the only problem with the patch.  Give me a few
days and I will see what I can come up with.

> 
> Kind regards
> Uffe
> 
>> ---
>>  drivers/mmc/host/sdhci.c | 43 +++-
>>  drivers/mmc/host/sdhci.h |  2 +-
>>  2 files changed, 17 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>> index eba9bcc92ad3..20ed09b896d7 100644
>> --- a/drivers/mmc/host/sdhci.c
>> +++ b/drivers/mmc/host/sdhci.c
>> @@ -1232,7 +1232,7 @@ static void __sdhci_finish_mrq(struct sdhci_host 
>> *host, struct mmc_request *mrq)
>>
>> WARN_ON(i >= SDHCI_MAX_MRQS);
>>
>> -   tasklet_schedule(>finish_tasklet);
>> +   host->result = IRQ_WAKE_THREAD;
>>  }
>>
>>  static void sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request 
>> *mrq)
>> @@ -2705,14 +2705,6 @@ static bool sdhci_request_done(struct sdhci_host 
>> *host)
>> return false;
>>  }
>>
>> -static void sdhci_tasklet_finish(unsigned long param)
>> -{
>> -   struct sdhci_host *host = (struct sdhci_host *)param;
>> -
>> -   while (!sdhci_request_done(host))
>> -   ;
>> -}
>> -
>>  static void sdhci_timeout_timer(struct timer_list *t)
>>  {
>> struct sdhci_host *host;
>> @@ -2995,11 +2987,12 @@ static void sdhci_data_irq(struct sdhci_host *host, 
>> u32 intmask)
>>
>>  static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>  {
>> -   irqreturn_t result = IRQ_NONE;
>> struct sdhci_host *host = dev_id;
>> u32 intmask, mask, unexpected = 0;
>> int max_loops = 16;
>>
>> +   host->result = IRQ_NONE;
>> +
>> spin_lock(>lock);
>>
>> if (host->runtime_suspended && !sdhci_sdio_irq_enabled(host)) {
>> @@ -3009,7 +3002,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>
>> intmask = sdhci_readl(host, SDHCI_INT_STATUS);
>> if (!intmask || intmask == 0x) {
>> -   result = IRQ_NONE;
>> +   host->result = IRQ_NONE;
>> goto out;
>> }
>>
>> @@ -3054,7 +3047,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>>
>> host->thread_isr |= intmask & (SDHCI_INT_CARD_INSERT 
>> |
>>
>> SDHCI_INT_CARD_REMOVE);
>> -   result = IRQ_WAKE_THREAD;
>> +   host->result = IRQ_WAKE_THREAD;
>> }
>>
>> if (intmask & SDHCI_INT_CMD_MASK)
>> @@ -3074,7 +3067,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>> (host->ier & SDHCI_INT_CARD_INT)) {
>> sdhci_enable_sdio_irq_nolock(host, false);
>> host->thread_isr |= SDHCI_INT_CARD_INT;
>> -   result = IRQ_WAKE_THREAD;
>> +   host->result = IRQ_WAKE_THREAD;
>> }
>>
>> intmask &= ~(SDHCI_INT_CARD_INSERT | SDHCI_INT_CARD_REMOVE |
>> @@ -3087,8 +3080,8 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>> sdhci_writel(host, intmask, SDHCI_INT_STATUS);
>> }
>>  cont:
>> -   if (result == IRQ_NONE)
>> -   result = IRQ_HANDLED;
>> +   if (host->result == IRQ_NONE)
>> +   host->result = IRQ_HANDLED;
>>
>> intmask = sdhci_readl(host, SDHCI_INT_STATUS);
>> } while (intmask && --max_loops);
>> @@ -3101,7 +3094,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>> sdhci_dumpregs(host);
>> }
>>
>> -   return result;
>> +   return host->result;
>>  }
>>

Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-03-18 Thread Ulf Hansson
+ Arnd, Grygorii

On Fri, 15 Feb 2019 at 20:17, Faiz Abbas  wrote:
>
> sdhci.c has two bottom halves implemented. A threaded_irq for handling
> card insert/remove operations and a tasklet for finishing mmc requests.
> With the addition of external dma support, dmaengine APIs need to
> terminate in non-atomic context before unmapping the dma buffers.
>
> To facilitate this, remove the finish_tasklet and move the call of
> sdhci_request_done() to the threaded_irq() callback. Also move the
> interrupt result variable to sdhci_host so it can be populated from
> anywhere inside the sdhci_irq handler.
>
> Signed-off-by: Faiz Abbas 

Adrian, I think it makes sense to apply this patch, even if there is
very minor negative impact throughput wise.

To me, it doesn't seems like MMC/SD/SDIO has good justification for
using tasklets, besides from the legacy point of view, of course.
Instead, I think we should try to move all mmc hosts into using
threaded IRQs.

So, what do you think? Can you overlook the throughput drop and
instead we can try to recover this on top with other optimizations?

Kind regards
Uffe

> ---
>  drivers/mmc/host/sdhci.c | 43 +++-
>  drivers/mmc/host/sdhci.h |  2 +-
>  2 files changed, 17 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index eba9bcc92ad3..20ed09b896d7 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -1232,7 +1232,7 @@ static void __sdhci_finish_mrq(struct sdhci_host *host, 
> struct mmc_request *mrq)
>
> WARN_ON(i >= SDHCI_MAX_MRQS);
>
> -   tasklet_schedule(>finish_tasklet);
> +   host->result = IRQ_WAKE_THREAD;
>  }
>
>  static void sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request 
> *mrq)
> @@ -2705,14 +2705,6 @@ static bool sdhci_request_done(struct sdhci_host *host)
> return false;
>  }
>
> -static void sdhci_tasklet_finish(unsigned long param)
> -{
> -   struct sdhci_host *host = (struct sdhci_host *)param;
> -
> -   while (!sdhci_request_done(host))
> -   ;
> -}
> -
>  static void sdhci_timeout_timer(struct timer_list *t)
>  {
> struct sdhci_host *host;
> @@ -2995,11 +2987,12 @@ static void sdhci_data_irq(struct sdhci_host *host, 
> u32 intmask)
>
>  static irqreturn_t sdhci_irq(int irq, void *dev_id)
>  {
> -   irqreturn_t result = IRQ_NONE;
> struct sdhci_host *host = dev_id;
> u32 intmask, mask, unexpected = 0;
> int max_loops = 16;
>
> +   host->result = IRQ_NONE;
> +
> spin_lock(>lock);
>
> if (host->runtime_suspended && !sdhci_sdio_irq_enabled(host)) {
> @@ -3009,7 +3002,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>
> intmask = sdhci_readl(host, SDHCI_INT_STATUS);
> if (!intmask || intmask == 0x) {
> -   result = IRQ_NONE;
> +   host->result = IRQ_NONE;
> goto out;
> }
>
> @@ -3054,7 +3047,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
>
> host->thread_isr |= intmask & (SDHCI_INT_CARD_INSERT |
>SDHCI_INT_CARD_REMOVE);
> -   result = IRQ_WAKE_THREAD;
> +   host->result = IRQ_WAKE_THREAD;
> }
>
> if (intmask & SDHCI_INT_CMD_MASK)
> @@ -3074,7 +3067,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
> (host->ier & SDHCI_INT_CARD_INT)) {
> sdhci_enable_sdio_irq_nolock(host, false);
> host->thread_isr |= SDHCI_INT_CARD_INT;
> -   result = IRQ_WAKE_THREAD;
> +   host->result = IRQ_WAKE_THREAD;
> }
>
> intmask &= ~(SDHCI_INT_CARD_INSERT | SDHCI_INT_CARD_REMOVE |
> @@ -3087,8 +3080,8 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
> sdhci_writel(host, intmask, SDHCI_INT_STATUS);
> }
>  cont:
> -   if (result == IRQ_NONE)
> -   result = IRQ_HANDLED;
> +   if (host->result == IRQ_NONE)
> +   host->result = IRQ_HANDLED;
>
> intmask = sdhci_readl(host, SDHCI_INT_STATUS);
> } while (intmask && --max_loops);
> @@ -3101,7 +3094,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
> sdhci_dumpregs(host);
> }
>
> -   return result;
> +   return host->result;
>  }
>
>  static irqreturn_t sdhci_thread_irq(int irq, void *dev_id)
> @@ -3131,6 +3124,12 @@ static irqreturn_t sdhci_thread_irq(int irq, void 
> *dev_id)
> spin_unlock_irqrestore(>lock, flags);
> }
>
> +   if (!isr) {
> +   do {
> +   isr = !sdhci_request_done(host);
> +   } while (isr);
> +   }
> +
> return isr ? IRQ_HANDLED : IRQ_NONE;
>  }
>
> @@ -4212,12 

Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-03-14 Thread Faiz Abbas
Hi,

On 14/03/19 4:45 PM, Grygorii Strashko wrote:
> 
> 
> On 12.03.19 19:30, Rizvi, Mohammad Faiz Abbas wrote:
>> Hi Adrian,
>>
>> On 3/8/2019 7:06 PM, Adrian Hunter wrote:
>>> On 6/03/19 12:00 PM, Faiz Abbas wrote:
 Adrian,

 On 25/02/19 1:47 PM, Adrian Hunter wrote:
> On 15/02/19 9:20 PM, Faiz Abbas wrote:
>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>> card insert/remove operations and a tasklet for finishing mmc requests.
>> With the addition of external dma support, dmaengine APIs need to
>> terminate in non-atomic context before unmapping the dma buffers.
>>
>> To facilitate this, remove the finish_tasklet and move the call of
>> sdhci_request_done() to the threaded_irq() callback.
>
> The irq thread has a higher latency than the tasklet.  The performance 
> drop
> is measurable on the system I tried:
>
> Before:
>
> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
> 1+0 records in
> 1+0 records out
> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
>
> After:
>
> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
> 1+0 records in
> 1+0 records out
> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
>
> So we only want to resort to the thread for the error case.
>

 Sorry for the late response here, but this is about 1.6% decrease. I
 tried out the same commands on a dra7xx board here (with about 5
 consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
 will also find a lesser percentage change if you average over multiple
 dd commands.

 Is this really so significant that we have to maintain two different
 bottom halves and keep having difficulty with adding APIs that can sleep?
>>>
>>> It is a performance drop that can be avoided, so it might as well be.
>>> Splitting the success path from the failure path is common for I/O drivers
>>> for similar reasons as here: the success path can be optimized whereas the
>>> failure path potentially needs to sleep.
>>
>> Understood. You wanna keep the success path as fast as possible.
> 
> Sry, I've not completely followed this series, but I'd like to add 5c
> 
> It's good thing to get rid of tasklets hence RT Linux kernel is actively 
> moving towards to LKML
> and there everything handled in threads (even networking trying to get rid of 
> softirqs).
> 
> Performance is pretty relative thing here - just try to run network traffic 
> in parallel, and
> there are no control over it comparing to threads. Now way to assign priority 
> or pin to CPU.

There is a 2007 LWN article(https://lwn.net/Articles/239633/) which
talks about removing tasklets altogether. I wonder what happened after that.

Thanks,
Faiz


Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-03-14 Thread Arnd Bergmann
On Tue, Mar 12, 2019 at 6:32 PM Rizvi, Mohammad Faiz Abbas
 wrote:
> On 3/8/2019 7:06 PM, Adrian Hunter wrote:
> > On 6/03/19 12:00 PM, Faiz Abbas wrote:
> > It is a performance drop that can be avoided, so it might as well be.
> > Splitting the success path from the failure path is common for I/O drivers
> > for similar reasons as here: the success path can be optimized whereas the
> > failure path potentially needs to sleep.
>
> Understood. You wanna keep the success path as fast as possible.

I looked at the sdhci_request_done() function and found that almost all
of it is executed inside of a 'spin_lock_irqsave()', including the potentially
expensive dma_sync_single_for_cpu() calls.

This means there is very little benefit in using the tasklet in the first place,
it could just as well run in the hwirq context that triggered it.

The part that is actually run with interrupts enabled in the tasklet
is mmc_blk_cqe_req_done(), but other drivers also call this from
IRQ context.

   Arnd


Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-03-14 Thread Grygorii Strashko



On 12.03.19 19:30, Rizvi, Mohammad Faiz Abbas wrote:
> Hi Adrian,
> 
> On 3/8/2019 7:06 PM, Adrian Hunter wrote:
>> On 6/03/19 12:00 PM, Faiz Abbas wrote:
>>> Adrian,
>>>
>>> On 25/02/19 1:47 PM, Adrian Hunter wrote:
 On 15/02/19 9:20 PM, Faiz Abbas wrote:
> sdhci.c has two bottom halves implemented. A threaded_irq for handling
> card insert/remove operations and a tasklet for finishing mmc requests.
> With the addition of external dma support, dmaengine APIs need to
> terminate in non-atomic context before unmapping the dma buffers.
>
> To facilitate this, remove the finish_tasklet and move the call of
> sdhci_request_done() to the threaded_irq() callback.

 The irq thread has a higher latency than the tasklet.  The performance drop
 is measurable on the system I tried:

 Before:

 # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
 1+0 records in
 1+0 records out
 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s

 After:

 # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
 1+0 records in
 1+0 records out
 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s

 So we only want to resort to the thread for the error case.

>>>
>>> Sorry for the late response here, but this is about 1.6% decrease. I
>>> tried out the same commands on a dra7xx board here (with about 5
>>> consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
>>> will also find a lesser percentage change if you average over multiple
>>> dd commands.
>>>
>>> Is this really so significant that we have to maintain two different
>>> bottom halves and keep having difficulty with adding APIs that can sleep?
>>
>> It is a performance drop that can be avoided, so it might as well be.
>> Splitting the success path from the failure path is common for I/O drivers
>> for similar reasons as here: the success path can be optimized whereas the
>> failure path potentially needs to sleep.
> 
> Understood. You wanna keep the success path as fast as possible.

Sry, I've not completely followed this series, but I'd like to add 5c

It's good thing to get rid of tasklets hence RT Linux kernel is actively moving 
towards to LKML
and there everything handled in threads (even networking trying to get rid of 
softirqs).

Performance is pretty relative thing here - just try to run network traffic in 
parallel, and
there are no control over it comparing to threads. Now way to assign priority 
or pin to CPU.


-- 
Best regards,
grygorii


Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-03-12 Thread Rizvi, Mohammad Faiz Abbas
Hi Adrian,

On 3/8/2019 7:06 PM, Adrian Hunter wrote:
> On 6/03/19 12:00 PM, Faiz Abbas wrote:
>> Adrian,
>>
>> On 25/02/19 1:47 PM, Adrian Hunter wrote:
>>> On 15/02/19 9:20 PM, Faiz Abbas wrote:
 sdhci.c has two bottom halves implemented. A threaded_irq for handling
 card insert/remove operations and a tasklet for finishing mmc requests.
 With the addition of external dma support, dmaengine APIs need to
 terminate in non-atomic context before unmapping the dma buffers.

 To facilitate this, remove the finish_tasklet and move the call of
 sdhci_request_done() to the threaded_irq() callback.
>>>
>>> The irq thread has a higher latency than the tasklet.  The performance drop
>>> is measurable on the system I tried:
>>>
>>> Before:
>>>
>>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>>> 1+0 records in
>>> 1+0 records out
>>> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
>>>
>>> After:
>>>
>>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>>> 1+0 records in
>>> 1+0 records out
>>> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
>>>
>>> So we only want to resort to the thread for the error case.
>>>
>>
>> Sorry for the late response here, but this is about 1.6% decrease. I
>> tried out the same commands on a dra7xx board here (with about 5
>> consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
>> will also find a lesser percentage change if you average over multiple
>> dd commands.
>>
>> Is this really so significant that we have to maintain two different
>> bottom halves and keep having difficulty with adding APIs that can sleep?
> 
> It is a performance drop that can be avoided, so it might as well be.
> Splitting the success path from the failure path is common for I/O drivers
> for similar reasons as here: the success path can be optimized whereas the
> failure path potentially needs to sleep.

Understood. You wanna keep the success path as fast as possible.
> 
>>
>> Also I am not sure how to implement only the error handling part in the
>> threaded_irq. We need to enter sdhci_request_done() and get the current
>> mrq before we can check for error conditions like I've done in patch 2:
>>
>> /* Terminate and synchronize dma in case of an error */
>> if (data && (mrq->cmd->error || data->error) &&
>> host->use_external_dma) {
>>  struct dma_chan *chan = sdhci_external_dma_channel(host, data);
>>  dmaengine_terminate_sync(chan);
>> }
>>
>> On a related note, do we really need to protect everything in
>> sdhci_request_done() with spinlocks?
>>  In patch 2 I have only removed lock
>> for the terminate_sync() parts that I added but the whole
>> dma_unmap/dma_sync parts should be left unprotected IMO.
> 
> As it is written, synchronization is needed to stop the same mrq being
> finished twice.
> 
> I suggest doing the dmaengine_terminate_sync() before calling
> sdhci_request_done().  Perhaps you could record the channel that needs to be
> sync'd and then do:
> 
>   struct dma_chan *chan = READ_ONCE(host->sync_chan);
> 
>   if (chan) {
>   dmaengine_terminate_sync(chan);
>   host->sync_chan = NULL;
>   }
> 

Will try this out and send next version.

Thanks,
Faiz


Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-03-08 Thread Adrian Hunter
On 6/03/19 12:00 PM, Faiz Abbas wrote:
> Adrian,
> 
> On 25/02/19 1:47 PM, Adrian Hunter wrote:
>> On 15/02/19 9:20 PM, Faiz Abbas wrote:
>>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>>> card insert/remove operations and a tasklet for finishing mmc requests.
>>> With the addition of external dma support, dmaengine APIs need to
>>> terminate in non-atomic context before unmapping the dma buffers.
>>>
>>> To facilitate this, remove the finish_tasklet and move the call of
>>> sdhci_request_done() to the threaded_irq() callback.
>>
>> The irq thread has a higher latency than the tasklet.  The performance drop
>> is measurable on the system I tried:
>>
>> Before:
>>
>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>> 1+0 records in
>> 1+0 records out
>> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
>>
>> After:
>>
>> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
>> 1+0 records in
>> 1+0 records out
>> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
>>
>> So we only want to resort to the thread for the error case.
>>
> 
> Sorry for the late response here, but this is about 1.6% decrease. I
> tried out the same commands on a dra7xx board here (with about 5
> consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
> will also find a lesser percentage change if you average over multiple
> dd commands.
> 
> Is this really so significant that we have to maintain two different
> bottom halves and keep having difficulty with adding APIs that can sleep?

It is a performance drop that can be avoided, so it might as well be.
Splitting the success path from the failure path is common for I/O drivers
for similar reasons as here: the success path can be optimized whereas the
failure path potentially needs to sleep.

> 
> Also I am not sure how to implement only the error handling part in the
> threaded_irq. We need to enter sdhci_request_done() and get the current
> mrq before we can check for error conditions like I've done in patch 2:
> 
> /* Terminate and synchronize dma in case of an error */
> if (data && (mrq->cmd->error || data->error) &&
> host->use_external_dma) {
>   struct dma_chan *chan = sdhci_external_dma_channel(host, data);
>   dmaengine_terminate_sync(chan);
> }
> 
> On a related note, do we really need to protect everything in
> sdhci_request_done() with spinlocks?
>  In patch 2 I have only removed lock
> for the terminate_sync() parts that I added but the whole
> dma_unmap/dma_sync parts should be left unprotected IMO.

As it is written, synchronization is needed to stop the same mrq being
finished twice.

I suggest doing the dmaengine_terminate_sync() before calling
sdhci_request_done().  Perhaps you could record the channel that needs to be
sync'd and then do:

struct dma_chan *chan = READ_ONCE(host->sync_chan);

if (chan) {
dmaengine_terminate_sync(chan);
host->sync_chan = NULL;
}


Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-03-06 Thread Faiz Abbas
Adrian,

On 25/02/19 1:47 PM, Adrian Hunter wrote:
> On 15/02/19 9:20 PM, Faiz Abbas wrote:
>> sdhci.c has two bottom halves implemented. A threaded_irq for handling
>> card insert/remove operations and a tasklet for finishing mmc requests.
>> With the addition of external dma support, dmaengine APIs need to
>> terminate in non-atomic context before unmapping the dma buffers.
>>
>> To facilitate this, remove the finish_tasklet and move the call of
>> sdhci_request_done() to the threaded_irq() callback.
> 
> The irq thread has a higher latency than the tasklet.  The performance drop
> is measurable on the system I tried:
> 
> Before:
> 
> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
> 1+0 records in
> 1+0 records out
> 1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s
> 
> After:
> 
> # dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
> 1+0 records in
> 1+0 records out
> 1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s
> 
> So we only want to resort to the thread for the error case.
> 

Sorry for the late response here, but this is about 1.6% decrease. I
tried out the same commands on a dra7xx board here (with about 5
consecutive dd of 1GB) and the average decrease was 0.3%. I believe you
will also find a lesser percentage change if you average over multiple
dd commands.

Is this really so significant that we have to maintain two different
bottom halves and keep having difficulty with adding APIs that can sleep?

Also I am not sure how to implement only the error handling part in the
threaded_irq. We need to enter sdhci_request_done() and get the current
mrq before we can check for error conditions like I've done in patch 2:

/* Terminate and synchronize dma in case of an error */
if (data && (mrq->cmd->error || data->error) &&
host->use_external_dma) {
struct dma_chan *chan = sdhci_external_dma_channel(host, data);
dmaengine_terminate_sync(chan);
}

On a related note, do we really need to protect everything in
sdhci_request_done() with spinlocks? In patch 2 I have only removed lock
for the terminate_sync() parts that I added but the whole
dma_unmap/dma_sync parts should be left unprotected IMO.

Thanks,
Faiz


Re: [PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-02-25 Thread Adrian Hunter
On 15/02/19 9:20 PM, Faiz Abbas wrote:
> sdhci.c has two bottom halves implemented. A threaded_irq for handling
> card insert/remove operations and a tasklet for finishing mmc requests.
> With the addition of external dma support, dmaengine APIs need to
> terminate in non-atomic context before unmapping the dma buffers.
> 
> To facilitate this, remove the finish_tasklet and move the call of
> sdhci_request_done() to the threaded_irq() callback.

The irq thread has a higher latency than the tasklet.  The performance drop
is measurable on the system I tried:

Before:

# dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 4.44502 s, 242 MB/s

After:

# dd if=/dev/mmcblk1 of=/dev/null bs=1G count=1 &
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 4.50898 s, 238 MB/s

So we only want to resort to the thread for the error case.


[PATCH v2 1/8] mmc: sdhci: Get rid of finish_tasklet

2019-02-15 Thread Faiz Abbas
sdhci.c has two bottom halves implemented. A threaded_irq for handling
card insert/remove operations and a tasklet for finishing mmc requests.
With the addition of external dma support, dmaengine APIs need to
terminate in non-atomic context before unmapping the dma buffers.

To facilitate this, remove the finish_tasklet and move the call of
sdhci_request_done() to the threaded_irq() callback. Also move the
interrupt result variable to sdhci_host so it can be populated from
anywhere inside the sdhci_irq handler.

Signed-off-by: Faiz Abbas 
---
 drivers/mmc/host/sdhci.c | 43 +++-
 drivers/mmc/host/sdhci.h |  2 +-
 2 files changed, 17 insertions(+), 28 deletions(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index eba9bcc92ad3..20ed09b896d7 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -1232,7 +1232,7 @@ static void __sdhci_finish_mrq(struct sdhci_host *host, 
struct mmc_request *mrq)
 
WARN_ON(i >= SDHCI_MAX_MRQS);
 
-   tasklet_schedule(>finish_tasklet);
+   host->result = IRQ_WAKE_THREAD;
 }
 
 static void sdhci_finish_mrq(struct sdhci_host *host, struct mmc_request *mrq)
@@ -2705,14 +2705,6 @@ static bool sdhci_request_done(struct sdhci_host *host)
return false;
 }
 
-static void sdhci_tasklet_finish(unsigned long param)
-{
-   struct sdhci_host *host = (struct sdhci_host *)param;
-
-   while (!sdhci_request_done(host))
-   ;
-}
-
 static void sdhci_timeout_timer(struct timer_list *t)
 {
struct sdhci_host *host;
@@ -2995,11 +2987,12 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 
intmask)
 
 static irqreturn_t sdhci_irq(int irq, void *dev_id)
 {
-   irqreturn_t result = IRQ_NONE;
struct sdhci_host *host = dev_id;
u32 intmask, mask, unexpected = 0;
int max_loops = 16;
 
+   host->result = IRQ_NONE;
+
spin_lock(>lock);
 
if (host->runtime_suspended && !sdhci_sdio_irq_enabled(host)) {
@@ -3009,7 +3002,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
 
intmask = sdhci_readl(host, SDHCI_INT_STATUS);
if (!intmask || intmask == 0x) {
-   result = IRQ_NONE;
+   host->result = IRQ_NONE;
goto out;
}
 
@@ -3054,7 +3047,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
 
host->thread_isr |= intmask & (SDHCI_INT_CARD_INSERT |
   SDHCI_INT_CARD_REMOVE);
-   result = IRQ_WAKE_THREAD;
+   host->result = IRQ_WAKE_THREAD;
}
 
if (intmask & SDHCI_INT_CMD_MASK)
@@ -3074,7 +3067,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
(host->ier & SDHCI_INT_CARD_INT)) {
sdhci_enable_sdio_irq_nolock(host, false);
host->thread_isr |= SDHCI_INT_CARD_INT;
-   result = IRQ_WAKE_THREAD;
+   host->result = IRQ_WAKE_THREAD;
}
 
intmask &= ~(SDHCI_INT_CARD_INSERT | SDHCI_INT_CARD_REMOVE |
@@ -3087,8 +3080,8 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
sdhci_writel(host, intmask, SDHCI_INT_STATUS);
}
 cont:
-   if (result == IRQ_NONE)
-   result = IRQ_HANDLED;
+   if (host->result == IRQ_NONE)
+   host->result = IRQ_HANDLED;
 
intmask = sdhci_readl(host, SDHCI_INT_STATUS);
} while (intmask && --max_loops);
@@ -3101,7 +3094,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
sdhci_dumpregs(host);
}
 
-   return result;
+   return host->result;
 }
 
 static irqreturn_t sdhci_thread_irq(int irq, void *dev_id)
@@ -3131,6 +3124,12 @@ static irqreturn_t sdhci_thread_irq(int irq, void 
*dev_id)
spin_unlock_irqrestore(>lock, flags);
}
 
+   if (!isr) {
+   do {
+   isr = !sdhci_request_done(host);
+   } while (isr);
+   }
+
return isr ? IRQ_HANDLED : IRQ_NONE;
 }
 
@@ -4212,12 +4211,6 @@ int __sdhci_add_host(struct sdhci_host *host)
struct mmc_host *mmc = host->mmc;
int ret;
 
-   /*
-* Init tasklets.
-*/
-   tasklet_init(>finish_tasklet,
-   sdhci_tasklet_finish, (unsigned long)host);
-
timer_setup(>timer, sdhci_timeout_timer, 0);
timer_setup(>data_timer, sdhci_timeout_data_timer, 0);
 
@@ -4230,7 +4223,7 @@ int __sdhci_add_host(struct sdhci_host *host)
if (ret) {
pr_err("%s: Failed to request IRQ %d: %d\n",
   mmc_hostname(mmc), host->irq, ret);
-   goto untasklet;
+   return ret;
}
 
ret = sdhci_led_register(host);
@@ -4263,8 +4256,6 @@ int __sdhci_add_host(struct sdhci_host