subject:"Re\: solo6010 modprobe lockup since e1ceb25a \(v4.3 regression\)"

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-10-24 Thread Andrey Utkin

On Mon, Oct 24, 2016 at 05:32:33PM -0200, Mauro Carvalho Chehab wrote:
> Em Wed, 28 Sep 2016 07:21:44 +0200
> khal...@piap.pl (Krzysztof Hałasa) escreveu:
> 
> > Andrey Utkin  writes:
> > 
> > > Lockup happens only on 6010. In provided log you can see that 6110
> > > passes just fine right before 6010. Also if 6010 PCI ID is removed from
> > > solo6x10 driver's devices list, the freeze doesn't happen.  
> > 
> > Probably explains why I don't see lockups :-)
> > 
> > I will have a look.
> 
> Any news on this? Should the patch be applied or not? If not, are there
> any other patch to fix this regression?

Actual patch is

Subject: [PATCH v2] media: solo6x10: fix lockup by avoiding delayed register 
write
Message-Id: <20161022153436.12076-1-andrey.ut...@corp.bluecherry.net>
Date: Sat, 22 Oct 2016 16:34:36 +0100

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-10-24 Thread Andrey Utkin

On Mon, Oct 24, 2016 at 05:32:33PM -0200, Mauro Carvalho Chehab wrote:
> Em Wed, 28 Sep 2016 07:21:44 +0200
> khal...@piap.pl (Krzysztof Hałasa) escreveu:
> 
> > Andrey Utkin  writes:
> > 
> > > Lockup happens only on 6010. In provided log you can see that 6110
> > > passes just fine right before 6010. Also if 6010 PCI ID is removed from
> > > solo6x10 driver's devices list, the freeze doesn't happen.  
> > 
> > Probably explains why I don't see lockups :-)
> > 
> > I will have a look.
> 
> Any news on this? Should the patch be applied or not? If not, are there
> any other patch to fix this regression?

Actual patch is

Subject: [PATCH v2] media: solo6x10: fix lockup by avoiding delayed register 
write
Message-Id: <20161022153436.12076-1-andrey.ut...@corp.bluecherry.net>
Date: Sat, 22 Oct 2016 16:34:36 +0100

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-10-24 Thread Mauro Carvalho Chehab

Em Wed, 28 Sep 2016 07:21:44 +0200
khal...@piap.pl (Krzysztof Hałasa) escreveu:

> Andrey Utkin  writes:
> 
> > Lockup happens only on 6010. In provided log you can see that 6110
> > passes just fine right before 6010. Also if 6010 PCI ID is removed from
> > solo6x10 driver's devices list, the freeze doesn't happen.  
> 
> Probably explains why I don't see lockups :-)
> 
> I will have a look.

Any news on this? Should the patch be applied or not? If not, are there
any other patch to fix this regression?

Thanks,
Mauro

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-10-24 Thread Mauro Carvalho Chehab

Em Wed, 28 Sep 2016 07:21:44 +0200
khal...@piap.pl (Krzysztof Hałasa) escreveu:

> Andrey Utkin  writes:
> 
> > Lockup happens only on 6010. In provided log you can see that 6110
> > passes just fine right before 6010. Also if 6010 PCI ID is removed from
> > solo6x10 driver's devices list, the freeze doesn't happen.  
> 
> Probably explains why I don't see lockups :-)
> 
> I will have a look.

Any news on this? Should the patch be applied or not? If not, are there
any other patch to fix this regression?

Thanks,
Mauro

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-27 Thread Krzysztof Hałasa

Andrey Utkin  writes:

> Lockup happens only on 6010. In provided log you can see that 6110
> passes just fine right before 6010. Also if 6010 PCI ID is removed from
> solo6x10 driver's devices list, the freeze doesn't happen.

Probably explains why I don't see lockups :-)

I will have a look.
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-27 Thread Krzysztof Hałasa

Andrey Utkin  writes:

> Lockup happens only on 6010. In provided log you can see that 6110
> passes just fine right before 6010. Also if 6010 PCI ID is removed from
> solo6x10 driver's devices list, the freeze doesn't happen.

Probably explains why I don't see lockups :-)

I will have a look.
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-27 Thread Andrey Utkin

On Tue, Sep 27, 2016 at 01:33:49PM +0200, Krzysztof Hałasa wrote:
> Thanks. I can see you have quite a set of video devices there.
> I will see what I can do with this.

Yeah, I have got also 4-chip tw5864 board here :)
Bluecherry decided to switch to it because they are available for retail
purchase, unlike solo* which must be ordered in large batch. It was huge
reverse-engineering effort to make it work, though, and there are still
issues with H.264 encoding functionality, and audio functionality is not
done yet.

> BTW does the lookup occur on SOLO6010, 6110, or both?

Lockup happens only on 6010. In provided log you can see that 6110
passes just fine right before 6010. Also if 6010 PCI ID is removed from
solo6x10 driver's devices list, the freeze doesn't happen.

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-27 Thread Andrey Utkin

On Tue, Sep 27, 2016 at 01:33:49PM +0200, Krzysztof Hałasa wrote:
> Thanks. I can see you have quite a set of video devices there.
> I will see what I can do with this.

Yeah, I have got also 4-chip tw5864 board here :)
Bluecherry decided to switch to it because they are available for retail
purchase, unlike solo* which must be ordered in large batch. It was huge
reverse-engineering effort to make it work, though, and there are still
issues with H.264 encoding functionality, and audio functionality is not
done yet.

> BTW does the lookup occur on SOLO6010, 6110, or both?

Lockup happens only on 6010. In provided log you can see that 6110
passes just fine right before 6010. Also if 6010 PCI ID is removed from
solo6x10 driver's devices list, the freeze doesn't happen.

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-27 Thread Krzysztof Hałasa

Andrey Utkin  writes:

>> Can you share some details about the machine you are experiencing the
>> problems on? CPU, chipset? I'd try to see if I can recreate the problem.
>
> See solo.txt.gz attached.

Thanks. I can see you have quite a set of video devices there.
I will see what I can do with this.

BTW does the lookup occur on SOLO6010, 6110, or both?
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-27 Thread Krzysztof Hałasa

Andrey Utkin  writes:

>> Can you share some details about the machine you are experiencing the
>> problems on? CPU, chipset? I'd try to see if I can recreate the problem.
>
> See solo.txt.gz attached.

Thanks. I can see you have quite a set of video devices there.
I will see what I can do with this.

BTW does the lookup occur on SOLO6010, 6110, or both?
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-27 Thread Andrey Utkin

On Tue, Sep 27, 2016 at 07:27:53AM +0200, Krzysztof Hałasa wrote:
> Andrey Utkin  writes:
> 
> >> Does (only) adding the
> >> 
> >>pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
> >> 
> >> in solo_reg_write() help?
> >
> > Yes.
> > I have posted a patch with this change few days ago, I thought you have
> > noticed it.
> 
> Well, I think you haven't sent me a copy. Anyway, it would be great to
> determine where exactly writes need a flush. Adding it everywhere is a
> bit suboptimal, one would think.

Oh, I'm terribly sorry, I really meant to send you a copy.
Actual posting is:
lkml.kernel.org/r/20160922000331.4193-1-andrey.ut...@corp.bluecherry.net

> Can you share some details about the machine you are experiencing the
> problems on? CPU, chipset? I'd try to see if I can recreate the problem.

See solo.txt.gz attached.

> Alternatively, you could investigate yourself - at first you could put
> pci_read_config_word() at the end of subroutines (including return
> statements) using solo_reg_write(). And in that solo_p2m_dma_desc(),
> before wait_for_completion_timeout(). Then eliminate them using some
> sort of binary search to see which ones are required.

Sorry, but I've got no time for this long-lasting debug session right
now, and except for this issue, users enjoy their mainline kernel
driver. So I'd just fix that in mainline kernels as quickly as possible.
Now I'm even considering submitting that to longterm 4.4 branch.

solo.txt.gz
Description: Binary data

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-27 Thread Andrey Utkin

On Tue, Sep 27, 2016 at 07:27:53AM +0200, Krzysztof Hałasa wrote:
> Andrey Utkin  writes:
> 
> >> Does (only) adding the
> >> 
> >>pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
> >> 
> >> in solo_reg_write() help?
> >
> > Yes.
> > I have posted a patch with this change few days ago, I thought you have
> > noticed it.
> 
> Well, I think you haven't sent me a copy. Anyway, it would be great to
> determine where exactly writes need a flush. Adding it everywhere is a
> bit suboptimal, one would think.

Oh, I'm terribly sorry, I really meant to send you a copy.
Actual posting is:
lkml.kernel.org/r/20160922000331.4193-1-andrey.ut...@corp.bluecherry.net

> Can you share some details about the machine you are experiencing the
> problems on? CPU, chipset? I'd try to see if I can recreate the problem.

See solo.txt.gz attached.

> Alternatively, you could investigate yourself - at first you could put
> pci_read_config_word() at the end of subroutines (including return
> statements) using solo_reg_write(). And in that solo_p2m_dma_desc(),
> before wait_for_completion_timeout(). Then eliminate them using some
> sort of binary search to see which ones are required.

Sorry, but I've got no time for this long-lasting debug session right
now, and except for this issue, users enjoy their mainline kernel
driver. So I'd just fix that in mainline kernels as quickly as possible.
Now I'm even considering submitting that to longterm 4.4 branch.

solo.txt.gz
Description: Binary data

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-26 Thread Krzysztof Hałasa

Andrey Utkin  writes:

>> Does (only) adding the
>> 
>>  pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
>> 
>> in solo_reg_write() help?
>
> Yes.
> I have posted a patch with this change few days ago, I thought you have
> noticed it.

Well, I think you haven't sent me a copy. Anyway, it would be great to
determine where exactly writes need a flush. Adding it everywhere is a
bit suboptimal, one would think.

Can you share some details about the machine you are experiencing the
problems on? CPU, chipset? I'd try to see if I can recreate the problem.

Alternatively, you could investigate yourself - at first you could put
pci_read_config_word() at the end of subroutines (including return
statements) using solo_reg_write(). And in that solo_p2m_dma_desc(),
before wait_for_completion_timeout(). Then eliminate them using some
sort of binary search to see which ones are required.
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-26 Thread Krzysztof Hałasa

Andrey Utkin  writes:

>> Does (only) adding the
>> 
>>  pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
>> 
>> in solo_reg_write() help?
>
> Yes.
> I have posted a patch with this change few days ago, I thought you have
> noticed it.

Well, I think you haven't sent me a copy. Anyway, it would be great to
determine where exactly writes need a flush. Adding it everywhere is a
bit suboptimal, one would think.

Can you share some details about the machine you are experiencing the
problems on? CPU, chipset? I'd try to see if I can recreate the problem.

Alternatively, you could investigate yourself - at first you could put
pci_read_config_word() at the end of subroutines (including return
statements) using solo_reg_write(). And in that solo_p2m_dma_desc(),
before wait_for_completion_timeout(). Then eliminate them using some
sort of binary search to see which ones are required.
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-26 Thread Andrey Utkin

On Mon, Sep 26, 2016 at 07:38:05AM +0200, Krzysztof Hałasa wrote:
> Andrey Utkin  writes:
> 
> > On Thu, Sep 22, 2016 at 10:51:37AM +0200, Krzysztof Hałasa wrote:
> >> I wonder if the following fixes the problem (completely untested).
> >
> > I have given this a run, and it still hangs.
> 
> Does (only) adding the
> 
>   pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
> 
> in solo_reg_write() help?

Yes.
I have posted a patch with this change few days ago, I thought you have
noticed it.

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-26 Thread Andrey Utkin

On Mon, Sep 26, 2016 at 07:38:05AM +0200, Krzysztof Hałasa wrote:
> Andrey Utkin  writes:
> 
> > On Thu, Sep 22, 2016 at 10:51:37AM +0200, Krzysztof Hałasa wrote:
> >> I wonder if the following fixes the problem (completely untested).
> >
> > I have given this a run, and it still hangs.
> 
> Does (only) adding the
> 
>   pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
> 
> in solo_reg_write() help?

Yes.
I have posted a patch with this change few days ago, I thought you have
noticed it.

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-25 Thread Krzysztof Hałasa

Andrey Utkin  writes:

> On Thu, Sep 22, 2016 at 10:51:37AM +0200, Krzysztof Hałasa wrote:
>> I wonder if the following fixes the problem (completely untested).
>
> I have given this a run, and it still hangs.

Does (only) adding the

pci_read_config_word(solo_dev->pdev, PCI_STATUS, );

in solo_reg_write() help?
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-25 Thread Krzysztof Hałasa

Andrey Utkin  writes:

> On Thu, Sep 22, 2016 at 10:51:37AM +0200, Krzysztof Hałasa wrote:
>> I wonder if the following fixes the problem (completely untested).
>
> I have given this a run, and it still hangs.

Does (only) adding the

pci_read_config_word(solo_dev->pdev, PCI_STATUS, );

in solo_reg_write() help?
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-22 Thread Andrey Utkin

On Thu, Sep 22, 2016 at 10:51:37AM +0200, Krzysztof Hałasa wrote:
> I wonder if the following fixes the problem (completely untested).

I have given this a run, and it still hangs.

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-22 Thread Andrey Utkin

On Thu, Sep 22, 2016 at 10:51:37AM +0200, Krzysztof Hałasa wrote:
> I wonder if the following fixes the problem (completely untested).

I have given this a run, and it still hangs.

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-22 Thread Krzysztof Hałasa

Andrey Utkin  writes:

> It happens in solo_disp_init at uploading default motion thresholds
> array.
>
> I've got a prints trace with solo6010-fix-lockup branch
> https://github.com/bluecherrydvr/linux/tree/solo6010-fix-lockup/drivers/media/pci/solo6x10
> the trace itself in jpg:
> https://decent.im:5281/upload/3793f393-e285-4514-83dd-bf08d1c8b4a2/e7ad898b-515b-4522-86a9-553daaeb0860.jpg

solo_motion_config() uses BM DMA and thus generates IRQ, this may be
indeed the ISR problem. BTW the IRQ debugging ("kernel hacking") should
catch it.
OTOH programming the DMA can be guilty as well.

I wonder if the following fixes the problem (completely untested).

diff --git a/drivers/media/pci/solo6x10/solo6x10-core.c 
b/drivers/media/pci/solo6x10/solo6x10-core.c
index f50d072..2d4900e 100644
--- a/drivers/media/pci/solo6x10/solo6x10-core.c
+++ b/drivers/media/pci/solo6x10/solo6x10-core.c
@@ -99,6 +99,7 @@ static irqreturn_t solo_isr(int irq, void *data)
 {
struct solo_dev *solo_dev = data;
u32 status;
+   u16 tmp;
int i;

status = solo_reg_read(solo_dev, SOLO_IRQ_STAT);
@@ -129,6 +130,7 @@ static irqreturn_t solo_isr(int irq, void *data)
if (status & SOLO_IRQ_G723)
solo_g723_isr(solo_dev);

+   pci_read_config_word(solo_dev->pdev, PCI_STATUS, ) // flush write 
to SOLO_IRQ_STAT
return IRQ_HANDLED;
 }

diff --git a/drivers/media/pci/solo6x10/solo6x10-p2m.c 
b/drivers/media/pci/solo6x10/solo6x10-p2m.c
index 07c4e07..8a51d45 100644
--- a/drivers/media/pci/solo6x10/solo6x10-p2m.c
+++ b/drivers/media/pci/solo6x10/solo6x10-p2m.c
@@ -70,6 +70,7 @@ int solo_p2m_dma_desc(struct solo_dev *solo_dev,
unsigned int config = 0;
int ret = 0;
int p2m_id = 0;
+   u16 tmp;

/* Get next ID. According to Softlogic, 6110 has problems on !=0 P2M */
if (solo_dev->type != SOLO_DEV_6110 && multi_p2m) {
@@ -111,6 +112,7 @@ int solo_p2m_dma_desc(struct solo_dev *solo_dev,
   desc[1].ctrl);
}

+   pci_read_config_word(solo_dev->pdev, PCI_STATUS, ); // flush writes
timeout = wait_for_completion_timeout(_dev->completion,
  solo_dev->p2m_jiffies);

> Indeed, targeted fixing would be more reasonable than making register
> r/w routines follow blocking fashion. But the driver is already complete
> and was known to be working, and I seems all places in code assume the
> blocking fashion of reg r/w, and changing that assumption may lead to
> covert bugs anywhere else, not just at probing, which may be hard to
> nail down.

The driver code doesn't have to assume anything about posted writes -
except at very specific places (as explained by Alan).

Normally, a CPU write to a register doesn't have to be flushed right
away. It would be much slower, especially if used extensively. Nobody
does anything alike since the end of the ISA bus.
The driver (and the card) can still see all operations in correct
order, in both cases.

The potential problem is a write being held in a buffer (and not making
it to the actual hardware). This may happen in ISR since the actual
write is deactivates the physical IRQ line. Otherwise the ISR terminates
and is immediately requested again - though this second call should
bring the IRQ down by reading the register (thus flushing the write
buffer) - so, while not very effective, it shouldn't lock up (but it's
a real bug worth fixing).

Also, I imagine a write to the DMA registers can be posted and the DMA
may not start in time. This shouldn't end in a lock up, either. Perhaps
a different bug is involved.

The other thing is BM DMA (card->RAM). All DMA transfers (initiated by
the card) are completed with an IRQ (either with success or failure).
This is potentially a problem as well, though it has nothing to do with
the patch in question. I guess the SOLO reads some descriptors or
something, and such writes are flushed this way.

> For now, I'll try setting pci_read_config_word() back instead of full
> revert. Does it need to be just in reg_write? No need for it in
> reg_read, right?

Sure, reg_read() doesn't write to the device.

It the patch doesn't fix the problem, what CPU and chipset are used by
the computer which exhibits the issue? Perhaps I have something similar
here and can reproduce it.
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-22 Thread Krzysztof Hałasa

Andrey Utkin  writes:

> It happens in solo_disp_init at uploading default motion thresholds
> array.
>
> I've got a prints trace with solo6010-fix-lockup branch
> https://github.com/bluecherrydvr/linux/tree/solo6010-fix-lockup/drivers/media/pci/solo6x10
> the trace itself in jpg:
> https://decent.im:5281/upload/3793f393-e285-4514-83dd-bf08d1c8b4a2/e7ad898b-515b-4522-86a9-553daaeb0860.jpg

solo_motion_config() uses BM DMA and thus generates IRQ, this may be
indeed the ISR problem. BTW the IRQ debugging ("kernel hacking") should
catch it.
OTOH programming the DMA can be guilty as well.

I wonder if the following fixes the problem (completely untested).

diff --git a/drivers/media/pci/solo6x10/solo6x10-core.c 
b/drivers/media/pci/solo6x10/solo6x10-core.c
index f50d072..2d4900e 100644
--- a/drivers/media/pci/solo6x10/solo6x10-core.c
+++ b/drivers/media/pci/solo6x10/solo6x10-core.c
@@ -99,6 +99,7 @@ static irqreturn_t solo_isr(int irq, void *data)
 {
struct solo_dev *solo_dev = data;
u32 status;
+   u16 tmp;
int i;

status = solo_reg_read(solo_dev, SOLO_IRQ_STAT);
@@ -129,6 +130,7 @@ static irqreturn_t solo_isr(int irq, void *data)
if (status & SOLO_IRQ_G723)
solo_g723_isr(solo_dev);

+   pci_read_config_word(solo_dev->pdev, PCI_STATUS, ) // flush write 
to SOLO_IRQ_STAT
return IRQ_HANDLED;
 }

diff --git a/drivers/media/pci/solo6x10/solo6x10-p2m.c 
b/drivers/media/pci/solo6x10/solo6x10-p2m.c
index 07c4e07..8a51d45 100644
--- a/drivers/media/pci/solo6x10/solo6x10-p2m.c
+++ b/drivers/media/pci/solo6x10/solo6x10-p2m.c
@@ -70,6 +70,7 @@ int solo_p2m_dma_desc(struct solo_dev *solo_dev,
unsigned int config = 0;
int ret = 0;
int p2m_id = 0;
+   u16 tmp;

/* Get next ID. According to Softlogic, 6110 has problems on !=0 P2M */
if (solo_dev->type != SOLO_DEV_6110 && multi_p2m) {
@@ -111,6 +112,7 @@ int solo_p2m_dma_desc(struct solo_dev *solo_dev,
   desc[1].ctrl);
}

+   pci_read_config_word(solo_dev->pdev, PCI_STATUS, ); // flush writes
timeout = wait_for_completion_timeout(_dev->completion,
  solo_dev->p2m_jiffies);

> Indeed, targeted fixing would be more reasonable than making register
> r/w routines follow blocking fashion. But the driver is already complete
> and was known to be working, and I seems all places in code assume the
> blocking fashion of reg r/w, and changing that assumption may lead to
> covert bugs anywhere else, not just at probing, which may be hard to
> nail down.

The driver code doesn't have to assume anything about posted writes -
except at very specific places (as explained by Alan).

Normally, a CPU write to a register doesn't have to be flushed right
away. It would be much slower, especially if used extensively. Nobody
does anything alike since the end of the ISA bus.
The driver (and the card) can still see all operations in correct
order, in both cases.

The potential problem is a write being held in a buffer (and not making
it to the actual hardware). This may happen in ISR since the actual
write is deactivates the physical IRQ line. Otherwise the ISR terminates
and is immediately requested again - though this second call should
bring the IRQ down by reading the register (thus flushing the write
buffer) - so, while not very effective, it shouldn't lock up (but it's
a real bug worth fixing).

Also, I imagine a write to the DMA registers can be posted and the DMA
may not start in time. This shouldn't end in a lock up, either. Perhaps
a different bug is involved.

The other thing is BM DMA (card->RAM). All DMA transfers (initiated by
the card) are completed with an IRQ (either with success or failure).
This is potentially a problem as well, though it has nothing to do with
the patch in question. I guess the SOLO reads some descriptors or
something, and such writes are flushed this way.

> For now, I'll try setting pci_read_config_word() back instead of full
> revert. Does it need to be just in reg_write? No need for it in
> reg_read, right?

Sure, reg_read() doesn't write to the device.

It the patch doesn't fix the problem, what CPU and chipset are used by
the computer which exhibits the issue? Perhaps I have something similar
here and can reproduce it.
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-21 Thread Andrey Utkin

On Wed, Sep 21, 2016 at 03:16:57PM +0200, Krzysztof Hałasa wrote:
> Hans Verkuil  writes:
> 
> > That was probably the reason for the pci_read_config_word in the reg_write
> > code. Try putting that back (and just that).
> 
> Yes. I guess a single pci_read_config_word() would suffice.
> 
> Though it would obviously be much better to identify the place in the
> driver which needs to have the write buffers flushed, and add a read()
> just there.
> 
> The interrupt handler maybe (e.g. just before the return IRQ_HANDLED)?
> 
> OTOH this may be some sort of timing problem, I mean the faster code may
> put too much stress on the SOLO chip.
> 
> Doesn't happen here so I can't test the cure.

It happens in solo_disp_init at uploading default motion thresholds
array.

I've got a prints trace with solo6010-fix-lockup branch
https://github.com/bluecherrydvr/linux/tree/solo6010-fix-lockup/drivers/media/pci/solo6x10
the trace itself in jpg:
https://decent.im:5281/upload/3793f393-e285-4514-83dd-bf08d1c8b4a2/e7ad898b-515b-4522-86a9-553daaeb0860.jpg

Indeed, targeted fixing would be more reasonable than making register
r/w routines follow blocking fashion. But the driver is already complete
and was known to be working, and I seems all places in code assume the
blocking fashion of reg r/w, and changing that assumption may lead to
covert bugs anywhere else, not just at probing, which may be hard to
nail down.

For now, I'll try setting pci_read_config_word() back instead of full
revert. Does it need to be just in reg_write? No need for it in
reg_read, right?

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-21 Thread Andrey Utkin

On Wed, Sep 21, 2016 at 03:16:57PM +0200, Krzysztof Hałasa wrote:
> Hans Verkuil  writes:
> 
> > That was probably the reason for the pci_read_config_word in the reg_write
> > code. Try putting that back (and just that).
> 
> Yes. I guess a single pci_read_config_word() would suffice.
> 
> Though it would obviously be much better to identify the place in the
> driver which needs to have the write buffers flushed, and add a read()
> just there.
> 
> The interrupt handler maybe (e.g. just before the return IRQ_HANDLED)?
> 
> OTOH this may be some sort of timing problem, I mean the faster code may
> put too much stress on the SOLO chip.
> 
> Doesn't happen here so I can't test the cure.

It happens in solo_disp_init at uploading default motion thresholds
array.

I've got a prints trace with solo6010-fix-lockup branch
https://github.com/bluecherrydvr/linux/tree/solo6010-fix-lockup/drivers/media/pci/solo6x10
the trace itself in jpg:
https://decent.im:5281/upload/3793f393-e285-4514-83dd-bf08d1c8b4a2/e7ad898b-515b-4522-86a9-553daaeb0860.jpg

Indeed, targeted fixing would be more reasonable than making register
r/w routines follow blocking fashion. But the driver is already complete
and was known to be working, and I seems all places in code assume the
blocking fashion of reg r/w, and changing that assumption may lead to
covert bugs anywhere else, not just at probing, which may be hard to
nail down.

For now, I'll try setting pci_read_config_word() back instead of full
revert. Does it need to be just in reg_write? No need for it in
reg_read, right?

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-21 Thread Krzysztof Hałasa

Hans Verkuil  writes:

> That was probably the reason for the pci_read_config_word in the reg_write
> code. Try putting that back (and just that).

Yes. I guess a single pci_read_config_word() would suffice.

Though it would obviously be much better to identify the place in the
driver which needs to have the write buffers flushed, and add a read()
just there.

The interrupt handler maybe (e.g. just before the return IRQ_HANDLED)?

OTOH this may be some sort of timing problem, I mean the faster code may
put too much stress on the SOLO chip.

Doesn't happen here so I can't test the cure.
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-21 Thread Krzysztof Hałasa

Hans Verkuil  writes:

> That was probably the reason for the pci_read_config_word in the reg_write
> code. Try putting that back (and just that).

Yes. I guess a single pci_read_config_word() would suffice.

Though it would obviously be much better to identify the place in the
driver which needs to have the write buffers flushed, and add a read()
just there.

The interrupt handler maybe (e.g. just before the return IRQ_HANDLED)?

OTOH this may be some sort of timing problem, I mean the faster code may
put too much stress on the SOLO chip.

Doesn't happen here so I can't test the cure.
-- 
Krzysztof Halasa

Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-15 Thread One Thousand Gnomes

On Thu, 15 Sep 2016 16:19:52 +0300
Andrey Utkin  wrote:

> On Thu, Sep 15, 2016 at 03:15:53PM +0200, Hans Verkuil wrote:
> > It could be related to the fact that a PCI write may be delayed unless
> > it is followed by a read (see also the comments in 
> > drivers/media/pci/ivtv/ivtv-driver.h).  
> 
> Thanks for explanation!
> 
> > That was probably the reason for the pci_read_config_word in the reg_write
> > code. Try putting that back (and just that).  
> 
> In this case reg_write becomes not atomic, thus spinlock would be
> required again here, right?

No - PCI writes are ordered but may not complete until the next read or
config access. That ordering isn't affected by things like spin locking
as it is a property of the bus.

Usually this only matters in obscure cases - timing is one of them, and
the other is when freeing memory because writes are posted both ways so
you need to write to stop the transfer, read to ensure the transfer has
completed and then free the target memory.

Alan

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-15 Thread One Thousand Gnomes

On Thu, 15 Sep 2016 16:19:52 +0300
Andrey Utkin  wrote:

> On Thu, Sep 15, 2016 at 03:15:53PM +0200, Hans Verkuil wrote:
> > It could be related to the fact that a PCI write may be delayed unless
> > it is followed by a read (see also the comments in 
> > drivers/media/pci/ivtv/ivtv-driver.h).  
> 
> Thanks for explanation!
> 
> > That was probably the reason for the pci_read_config_word in the reg_write
> > code. Try putting that back (and just that).  
> 
> In this case reg_write becomes not atomic, thus spinlock would be
> required again here, right?

No - PCI writes are ordered but may not complete until the next read or
config access. That ordering isn't affected by things like spin locking
as it is a property of the bus.

Usually this only matters in obscure cases - timing is one of them, and
the other is when freeing memory because writes are posted both ways so
you need to write to stop the transfer, read to ensure the transfer has
completed and then free the target memory.

Alan

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-15 Thread Hans Verkuil

On 09/15/2016 03:19 PM, Andrey Utkin wrote:
> On Thu, Sep 15, 2016 at 03:15:53PM +0200, Hans Verkuil wrote:
>> It could be related to the fact that a PCI write may be delayed unless
>> it is followed by a read (see also the comments in 
>> drivers/media/pci/ivtv/ivtv-driver.h).
> 
> Thanks for explanation!
> 
>> That was probably the reason for the pci_read_config_word in the reg_write
>> code. Try putting that back (and just that).
> 
> In this case reg_write becomes not atomic, thus spinlock would be
> required again here, right?

That depends on whether you can have calls to this function in parallel.

But I get the feeling that it might be easier to just revert the patch.

Regards,

Hans

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-15 Thread Hans Verkuil

On 09/15/2016 03:19 PM, Andrey Utkin wrote:
> On Thu, Sep 15, 2016 at 03:15:53PM +0200, Hans Verkuil wrote:
>> It could be related to the fact that a PCI write may be delayed unless
>> it is followed by a read (see also the comments in 
>> drivers/media/pci/ivtv/ivtv-driver.h).
> 
> Thanks for explanation!
> 
>> That was probably the reason for the pci_read_config_word in the reg_write
>> code. Try putting that back (and just that).
> 
> In this case reg_write becomes not atomic, thus spinlock would be
> required again here, right?

That depends on whether you can have calls to this function in parallel.

But I get the feeling that it might be easier to just revert the patch.

Regards,

Hans

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-15 Thread Andrey Utkin

On Thu, Sep 15, 2016 at 03:15:53PM +0200, Hans Verkuil wrote:
> It could be related to the fact that a PCI write may be delayed unless
> it is followed by a read (see also the comments in 
> drivers/media/pci/ivtv/ivtv-driver.h).

Thanks for explanation!

> That was probably the reason for the pci_read_config_word in the reg_write
> code. Try putting that back (and just that).

In this case reg_write becomes not atomic, thus spinlock would be
required again here, right?

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-15 Thread Andrey Utkin

On Thu, Sep 15, 2016 at 03:15:53PM +0200, Hans Verkuil wrote:
> It could be related to the fact that a PCI write may be delayed unless
> it is followed by a read (see also the comments in 
> drivers/media/pci/ivtv/ivtv-driver.h).

Thanks for explanation!

> That was probably the reason for the pci_read_config_word in the reg_write
> code. Try putting that back (and just that).

In this case reg_write becomes not atomic, thus spinlock would be
required again here, right?

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-15 Thread Hans Verkuil

It could be related to the fact that a PCI write may be delayed unless
it is followed by a read (see also the comments in 
drivers/media/pci/ivtv/ivtv-driver.h).

That was probably the reason for the pci_read_config_word in the reg_write
code. Try putting that back (and just that).

Regards,

Hans

On 09/15/2016 03:04 PM, Andrey Utkin wrote:
> Hi Krzysztof,
> 
> Me and one more solo6010 board user experience machine lockup when
> solo6x10 module is loaded on kernel series starting with 4.3 (despite
> solo6110 board probes just fine on all kernels). That is, 3.16, 3.18,
> 4.1 and 4.2 are tested and fine, and 4.3, 4.4, and others up to current
> linux-next are bad.
> So regression slipped in between 4.2 and 4.3. The diff between
> stable/linux-4.2.y and ...-4.3.y (which were tested) is not large, and
> my suspect fell on ripoff of register writing procedures complexity,
> which was introduced in e1ceb25a (see below). Reversion of that fixes
> lockup.  However, if, on top of reversion of e1ceb25a, i drop barrier
> stuff and pci_read_config... (see
> https://github.com/bluecherrydvr/linux/commit/d59aaf3), leaving the
> spinlock stuff, it locks up again.  This is a matter in which I'm not
> quite qualified, so I have no idea what that code copes with and why
> this workaround works for solo6010.  For now I think I'll tell the
> customer to use kernel with e1ceb25a reverted, but for upstream fix, I'm
> interested in more in-depth investigation. I'll be able to provide dmesg
> logs a bit later.
> 
> The breaking commit is quoted below.
> 
> commit e1ceb25a1569ce5b61b9c496dd32d038ba8cb936
> Author: Krzysztof Hałasa 
> Date:   Mon Jun 8 10:42:24 2015 -0300
> 
> [media] SOLO6x10: remove unneeded register locking and barriers
> 
> readl() and writel() are atomic, we don't need the spin lock.
> Also, flushing posted write buffer isn't required. Especially on read :-)
> 
> Signed-off-by: Krzysztof Ha?asa 
> Signed-off-by: Hans Verkuil 
> Signed-off-by: Mauro Carvalho Chehab 
> 
> diff --git a/drivers/media/pci/solo6x10/solo6x10-core.c 
> b/drivers/media/pci/solo6x10/solo6x10-core.c
> index 84627e6..9c948b1 100644
> --- a/drivers/media/pci/solo6x10/solo6x10-core.c
> +++ b/drivers/media/pci/solo6x10/solo6x10-core.c
> @@ -483,7 +483,6 @@ static int solo_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>  
>   solo_dev->type = id->driver_data;
>   solo_dev->pdev = pdev;
> - spin_lock_init(_dev->reg_io_lock);
>   ret = v4l2_device_register(>dev, _dev->v4l2_dev);
>   if (ret)
>   goto fail_probe;
> diff --git a/drivers/media/pci/solo6x10/solo6x10.h 
> b/drivers/media/pci/solo6x10/solo6x10.h
> index 1ca54b0..27423d7 100644
> --- a/drivers/media/pci/solo6x10/solo6x10.h
> +++ b/drivers/media/pci/solo6x10/solo6x10.h
> @@ -199,7 +199,6 @@ struct solo_dev {
>   int nr_ext;
>   u32 irq_mask;
>   u32 motion_mask;
> - spinlock_t  reg_io_lock;
>   struct v4l2_device  v4l2_dev;
>  
>   /* tw28xx accounting */
> @@ -281,36 +280,13 @@ struct solo_dev {
>  
>  static inline u32 solo_reg_read(struct solo_dev *solo_dev, int reg)
>  {
> - unsigned long flags;
> - u32 ret;
> - u16 val;
> -
> - spin_lock_irqsave(_dev->reg_io_lock, flags);
> -
> - ret = readl(solo_dev->reg_base + reg);
> - rmb();
> - pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
> - rmb();
> -
> - spin_unlock_irqrestore(_dev->reg_io_lock, flags);
> -
> - return ret;
> + return readl(solo_dev->reg_base + reg);
>  }
>  
>  static inline void solo_reg_write(struct solo_dev *solo_dev, int reg,
> u32 data)
>  {
> - unsigned long flags;
> - u16 val;
> -
> - spin_lock_irqsave(_dev->reg_io_lock, flags);
> -
>   writel(data, solo_dev->reg_base + reg);
> - wmb();
> - pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
> - rmb();
> -
> - spin_unlock_irqrestore(_dev->reg_io_lock, flags);
>  }
>  
>  static inline void solo_irq_on(struct solo_dev *dev, u32 mask)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

2016-09-15 Thread Hans Verkuil

It could be related to the fact that a PCI write may be delayed unless
it is followed by a read (see also the comments in 
drivers/media/pci/ivtv/ivtv-driver.h).

That was probably the reason for the pci_read_config_word in the reg_write
code. Try putting that back (and just that).

Regards,

Hans

On 09/15/2016 03:04 PM, Andrey Utkin wrote:
> Hi Krzysztof,
> 
> Me and one more solo6010 board user experience machine lockup when
> solo6x10 module is loaded on kernel series starting with 4.3 (despite
> solo6110 board probes just fine on all kernels). That is, 3.16, 3.18,
> 4.1 and 4.2 are tested and fine, and 4.3, 4.4, and others up to current
> linux-next are bad.
> So regression slipped in between 4.2 and 4.3. The diff between
> stable/linux-4.2.y and ...-4.3.y (which were tested) is not large, and
> my suspect fell on ripoff of register writing procedures complexity,
> which was introduced in e1ceb25a (see below). Reversion of that fixes
> lockup.  However, if, on top of reversion of e1ceb25a, i drop barrier
> stuff and pci_read_config... (see
> https://github.com/bluecherrydvr/linux/commit/d59aaf3), leaving the
> spinlock stuff, it locks up again.  This is a matter in which I'm not
> quite qualified, so I have no idea what that code copes with and why
> this workaround works for solo6010.  For now I think I'll tell the
> customer to use kernel with e1ceb25a reverted, but for upstream fix, I'm
> interested in more in-depth investigation. I'll be able to provide dmesg
> logs a bit later.
> 
> The breaking commit is quoted below.
> 
> commit e1ceb25a1569ce5b61b9c496dd32d038ba8cb936
> Author: Krzysztof Hałasa 
> Date:   Mon Jun 8 10:42:24 2015 -0300
> 
> [media] SOLO6x10: remove unneeded register locking and barriers
> 
> readl() and writel() are atomic, we don't need the spin lock.
> Also, flushing posted write buffer isn't required. Especially on read :-)
> 
> Signed-off-by: Krzysztof Ha?asa 
> Signed-off-by: Hans Verkuil 
> Signed-off-by: Mauro Carvalho Chehab 
> 
> diff --git a/drivers/media/pci/solo6x10/solo6x10-core.c 
> b/drivers/media/pci/solo6x10/solo6x10-core.c
> index 84627e6..9c948b1 100644
> --- a/drivers/media/pci/solo6x10/solo6x10-core.c
> +++ b/drivers/media/pci/solo6x10/solo6x10-core.c
> @@ -483,7 +483,6 @@ static int solo_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>  
>   solo_dev->type = id->driver_data;
>   solo_dev->pdev = pdev;
> - spin_lock_init(_dev->reg_io_lock);
>   ret = v4l2_device_register(>dev, _dev->v4l2_dev);
>   if (ret)
>   goto fail_probe;
> diff --git a/drivers/media/pci/solo6x10/solo6x10.h 
> b/drivers/media/pci/solo6x10/solo6x10.h
> index 1ca54b0..27423d7 100644
> --- a/drivers/media/pci/solo6x10/solo6x10.h
> +++ b/drivers/media/pci/solo6x10/solo6x10.h
> @@ -199,7 +199,6 @@ struct solo_dev {
>   int nr_ext;
>   u32 irq_mask;
>   u32 motion_mask;
> - spinlock_t  reg_io_lock;
>   struct v4l2_device  v4l2_dev;
>  
>   /* tw28xx accounting */
> @@ -281,36 +280,13 @@ struct solo_dev {
>  
>  static inline u32 solo_reg_read(struct solo_dev *solo_dev, int reg)
>  {
> - unsigned long flags;
> - u32 ret;
> - u16 val;
> -
> - spin_lock_irqsave(_dev->reg_io_lock, flags);
> -
> - ret = readl(solo_dev->reg_base + reg);
> - rmb();
> - pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
> - rmb();
> -
> - spin_unlock_irqrestore(_dev->reg_io_lock, flags);
> -
> - return ret;
> + return readl(solo_dev->reg_base + reg);
>  }
>  
>  static inline void solo_reg_write(struct solo_dev *solo_dev, int reg,
> u32 data)
>  {
> - unsigned long flags;
> - u16 val;
> -
> - spin_lock_irqsave(_dev->reg_io_lock, flags);
> -
>   writel(data, solo_dev->reg_base + reg);
> - wmb();
> - pci_read_config_word(solo_dev->pdev, PCI_STATUS, );
> - rmb();
> -
> - spin_unlock_irqrestore(_dev->reg_io_lock, flags);
>  }
>  
>  static inline void solo_irq_on(struct solo_dev *dev, u32 mask)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

Re: solo6010 modprobe lockup since e1ceb25a (v4.3 regression)

34 matches

Site Navigation

Mail list logo

Footer information