Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread Tony Luck
On Thu, Aug 15, 2013 at 5:33 PM, Linus Torvalds
 wrote:
> I'll probably delay committing it until tomorrow, in the hope that
> somebody using one of the other architectures will at least ack that
> it compiles. I'm re-attaching the patch (with the two "logn" -> "long"
> fixes) just to encourage that. Hint hint, everybody..

I see I'm too late to supply an Ack for the commit, because it is already in.
But just for completeness sake - all my ia64 configs build OK, and the couple
that get boot tested still appear to be working too.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread Peter Zijlstra
On Fri, Aug 16, 2013 at 01:00:31PM +0200, Michal Hocko wrote:

> I was thinking about teaching __tlb_remove_page to update the range
> automatically from the given address.

The mmu_gather unification stuff I had did it differently still:

  http://permalink.gmane.org/gmane.linux.kernel.mm/81287

That said, I do like Linus' approach. The only thing I haven't
considered is if it does the right thing for tile,mips-r4k which have
'special' rules for VM_HUGETLB. Although I don't think it changes those
archs enough to break anything.

I should find some time to finally finish that series :/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread Michal Hocko
On Thu 15-08-13 17:33:28, Linus Torvalds wrote:
> On Thu, Aug 15, 2013 at 4:05 PM, Ben Tebulin  wrote:
> >
> >> Ben, please test. I'm worried that the problem you see is something
> >> even more fundamentally wrong with the whole "oops, must flush in the
> >> middle" logic, but I'm _hoping_ this fixes it.
> >
> > It's gone.
> >
> > Really!
> >
> > I git-fsck'ed successfully around 30 times in a row.
> > And even all the other things still seem to work ;-)
> 
> Goodie. I think I'm just going to commit it (with the speling fixes
> for other architectures) asap. It's bigger than I'd like, but it's a
> lot simpler than the alternatives of trying to figure out exactly
> which call chain got things wrong with the previous confusing model.

I was thinking about teaching __tlb_remove_page to update the range
automatically from the given address.

But your patch looks good to me as well.

Feel free to add
Reviewed-by: Michal Hocko 

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread richard -rw- weinberger
On Fri, Aug 16, 2013 at 2:33 AM, Linus Torvalds
 wrote:
> I'll probably delay committing it until tomorrow, in the hope that
> somebody using one of the other architectures will at least ack that
> it compiles. I'm re-attaching the patch (with the two "logn" -> "long"
> fixes) just to encourage that. Hint hint, everybody..

/me tested arch/um, so far everything looks good. :-)

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread Stephen Rothwell
Hi Linus,

On Thu, 15 Aug 2013 17:33:28 -0700 Linus Torvalds 
 wrote:
>
> I'll probably delay committing it until tomorrow, in the hope that
> somebody using one of the other architectures will at least ack that
> it compiles. I'm re-attaching the patch (with the two "logn" -> "long"
> fixes) just to encourage that. Hint hint, everybody..

I built all the (major) PowerPC defconfigs, allnoconfig and allmodconfig
and they built as well as they did before this patch (i.e. some failed
for other reasons).  I have not done any boot testing on PowerPC. 

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpkdrOLC8mEK.pgp
Description: PGP signature


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread Stephen Rothwell
Hi Linus,

On Thu, 15 Aug 2013 17:33:28 -0700 Linus Torvalds 
torva...@linux-foundation.org wrote:

 I'll probably delay committing it until tomorrow, in the hope that
 somebody using one of the other architectures will at least ack that
 it compiles. I'm re-attaching the patch (with the two logn - long
 fixes) just to encourage that. Hint hint, everybody..

I built all the (major) PowerPC defconfigs, allnoconfig and allmodconfig
and they built as well as they did before this patch (i.e. some failed
for other reasons).  I have not done any boot testing on PowerPC. 

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpkdrOLC8mEK.pgp
Description: PGP signature


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread richard -rw- weinberger
On Fri, Aug 16, 2013 at 2:33 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 I'll probably delay committing it until tomorrow, in the hope that
 somebody using one of the other architectures will at least ack that
 it compiles. I'm re-attaching the patch (with the two logn - long
 fixes) just to encourage that. Hint hint, everybody..

/me tested arch/um, so far everything looks good. :-)

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread Michal Hocko
On Thu 15-08-13 17:33:28, Linus Torvalds wrote:
 On Thu, Aug 15, 2013 at 4:05 PM, Ben Tebulin tebu...@googlemail.com wrote:
 
  Ben, please test. I'm worried that the problem you see is something
  even more fundamentally wrong with the whole oops, must flush in the
  middle logic, but I'm _hoping_ this fixes it.
 
  It's gone.
 
  Really!
 
  I git-fsck'ed successfully around 30 times in a row.
  And even all the other things still seem to work ;-)
 
 Goodie. I think I'm just going to commit it (with the speling fixes
 for other architectures) asap. It's bigger than I'd like, but it's a
 lot simpler than the alternatives of trying to figure out exactly
 which call chain got things wrong with the previous confusing model.

I was thinking about teaching __tlb_remove_page to update the range
automatically from the given address.

But your patch looks good to me as well.

Feel free to add
Reviewed-by: Michal Hocko mho...@suse.cz

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread Peter Zijlstra
On Fri, Aug 16, 2013 at 01:00:31PM +0200, Michal Hocko wrote:

 I was thinking about teaching __tlb_remove_page to update the range
 automatically from the given address.

The mmu_gather unification stuff I had did it differently still:

  http://permalink.gmane.org/gmane.linux.kernel.mm/81287

That said, I do like Linus' approach. The only thing I haven't
considered is if it does the right thing for tile,mips-r4k which have
'special' rules for VM_HUGETLB. Although I don't think it changes those
archs enough to break anything.

I should find some time to finally finish that series :/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-16 Thread Tony Luck
On Thu, Aug 15, 2013 at 5:33 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 I'll probably delay committing it until tomorrow, in the hope that
 somebody using one of the other architectures will at least ack that
 it compiles. I'm re-attaching the patch (with the two logn - long
 fixes) just to encourage that. Hint hint, everybody..

I see I'm too late to supply an Ack for the commit, because it is already in.
But just for completeness sake - all my ia64 configs build OK, and the couple
that get boot tested still appear to be working too.

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-15 Thread Linus Torvalds
On Thu, Aug 15, 2013 at 4:05 PM, Ben Tebulin  wrote:
>
>> Ben, please test. I'm worried that the problem you see is something
>> even more fundamentally wrong with the whole "oops, must flush in the
>> middle" logic, but I'm _hoping_ this fixes it.
>
> It's gone.
>
> Really!
>
> I git-fsck'ed successfully around 30 times in a row.
> And even all the other things still seem to work ;-)

Goodie. I think I'm just going to commit it (with the speling fixes
for other architectures) asap. It's bigger than I'd like, but it's a
lot simpler than the alternatives of trying to figure out exactly
which call chain got things wrong with the previous confusing model.

Thanks for bisecting and testing.

> Honestly I have to confess that I'm deeply impressed how this finally
> worked out: I just threw a particular, innocent-looking commit hash and
> nothing more into the round.

Being able to bisect the exact commit that introduced the bad behavior
is *very* powerful debugging aid, and in fact the smaller and more
innocent-looking the bisected commit is, the easier it generally is to
then say "ok, it must be related to this one particular issue". So the
bisection really pinpointed the area. After that it was just a matter
of reading the source code and seeing what looked suspicious.

I'll probably delay committing it until tomorrow, in the hope that
somebody using one of the other architectures will at least ack that
it compiles. I'm re-attaching the patch (with the two "logn" -> "long"
fixes) just to encourage that. Hint hint, everybody..

   Linus


patch.diff
Description: Binary data


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-15 Thread Ben Tebulin
Am 15.08.2013 20:00, schrieb Linus Torvalds:
> Ok, so I've slept on it, and here's my current thinking.
> [...]  

Many thoughts which as a user I'm am unable to follow  ;-)

> This patch tries to fix the interface instead of trying to patch up
> the individual places that *should* set the range some particular way
> [...]
> This patch is against current git, so to apply you need to have
> that commit e6c495a96ce0 cherry-picked to older kernels first.

I took a shot based on 3.9.11 + e6c495a96ce0. The reason why I don't
simply use the current git master is, that for some reasons my
linux-image-*.deb become 750MB and larger since 3.10.y and I have no
clue at all why and what to do about it.

The patch failed. Due to my outstanding incompetence I resorted into
applying it onto master, cherry-picking that back and trying to resolve
the remaining conflicts correctly.

>  - I have no idea whether this will fix the problem Ben sees, but I
> feel happier about the code, because now any place that forgets to set
> up start/end will work just fine, because they are always valid. 

Simpler code? Resilient API? Happy people? Great!

> Ben, please test. I'm worried that the problem you see is something 
> even more fundamentally wrong with the whole "oops, must flush in the
> middle" logic, but I'm _hoping_ this fixes it.

It's gone.

Really!

I git-fsck'ed successfully around 30 times in a row.
And even all the other things still seem to work ;-)

Honestly I have to confess that I'm deeply impressed how this finally
worked out: I just threw a particular, innocent-looking commit hash and
nothing more into the round. And while still being unsure if this might
be a plain user space issue, only 24h later I received a 11kb sized
kernel patch (with blatant typos in it !1! *g* ) apparently solving my
issue.

/me happy now, too! :)

- Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-15 Thread Ben Tebulin
Am 15.08.2013 20:00, schrieb Linus Torvalds:
 Ok, so I've slept on it, and here's my current thinking.
 [...]  

Many thoughts which as a user I'm am unable to follow  ;-)

 This patch tries to fix the interface instead of trying to patch up
 the individual places that *should* set the range some particular way
 [...]
 This patch is against current git, so to apply you need to have
 that commit e6c495a96ce0 cherry-picked to older kernels first.

I took a shot based on 3.9.11 + e6c495a96ce0. The reason why I don't
simply use the current git master is, that for some reasons my
linux-image-*.deb become 750MB and larger since 3.10.y and I have no
clue at all why and what to do about it.

The patch failed. Due to my outstanding incompetence I resorted into
applying it onto master, cherry-picking that back and trying to resolve
the remaining conflicts correctly.

  - I have no idea whether this will fix the problem Ben sees, but I
 feel happier about the code, because now any place that forgets to set
 up start/end will work just fine, because they are always valid. 

Simpler code? Resilient API? Happy people? Great!

 Ben, please test. I'm worried that the problem you see is something 
 even more fundamentally wrong with the whole oops, must flush in the
 middle logic, but I'm _hoping_ this fixes it.

It's gone.

Really!

I git-fsck'ed successfully around 30 times in a row.
And even all the other things still seem to work ;-)

Honestly I have to confess that I'm deeply impressed how this finally
worked out: I just threw a particular, innocent-looking commit hash and
nothing more into the round. And while still being unsure if this might
be a plain user space issue, only 24h later I received a 11kb sized
kernel patch (with blatant typos in it !1! *g* ) apparently solving my
issue.

/me happy now, too! :)

- Ben
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug] Reproducible data corruption on i5-3340M: Please continue your great work! :-)

2013-08-15 Thread Linus Torvalds
On Thu, Aug 15, 2013 at 4:05 PM, Ben Tebulin tebu...@googlemail.com wrote:

 Ben, please test. I'm worried that the problem you see is something
 even more fundamentally wrong with the whole oops, must flush in the
 middle logic, but I'm _hoping_ this fixes it.

 It's gone.

 Really!

 I git-fsck'ed successfully around 30 times in a row.
 And even all the other things still seem to work ;-)

Goodie. I think I'm just going to commit it (with the speling fixes
for other architectures) asap. It's bigger than I'd like, but it's a
lot simpler than the alternatives of trying to figure out exactly
which call chain got things wrong with the previous confusing model.

Thanks for bisecting and testing.

 Honestly I have to confess that I'm deeply impressed how this finally
 worked out: I just threw a particular, innocent-looking commit hash and
 nothing more into the round.

Being able to bisect the exact commit that introduced the bad behavior
is *very* powerful debugging aid, and in fact the smaller and more
innocent-looking the bisected commit is, the easier it generally is to
then say ok, it must be related to this one particular issue. So the
bisection really pinpointed the area. After that it was just a matter
of reading the source code and seeing what looked suspicious.

I'll probably delay committing it until tomorrow, in the hope that
somebody using one of the other architectures will at least ack that
it compiles. I'm re-attaching the patch (with the two logn - long
fixes) just to encourage that. Hint hint, everybody..

   Linus


patch.diff
Description: Binary data