Re: [PATCH] md/raid0: add config parameters to specify zone layout

2020-04-30 Thread John Stoffel
> "Song" == Song Liu  writes:

Song> Hi Jason,
>> On Apr 27, 2020, at 2:10 PM, Jason Baron  wrote:
>> 
>> 
>> 
>> On 4/25/20 12:31 AM, Coly Li wrote:
>>> On 2020/3/26 23:28, Jason Baron wrote:
 Let's add some CONFIG_* options to directly configure the raid0 layout
 if you know in advance how your raid0 array was created. This can be
 simpler than having to manage module or kernel command-line parameters.
 
>>> 
>>> Hi Jason,
>>> 
>>> If the people who compiling the kernel is not the end users, the
>>> communication gap has potential risk to make users to use a different
>>> layout for existing raid0 array after a kernel upgrade.
>>> 
>>> If this patch goes into upstream, it is very probably such risky
>>> situation may happen.
>>> 
>>> The purpose of adding default_layout is to let *end user* to be aware of
>>> they layout when they use difference sizes component disks to assemble
>>> the raid0 array, and make decision which layout algorithm should be
>>> used. Such situation cannot be decided in kernel compiling time.
>> 
>> I agree that in general it may not be known at compile time. Thus,
>> I've left the default as RAID0_LAYOUT_NONE. However, there are
>> use-cases where it is known at compile-time which layout is needed.
>> In our use-case, we knew that we didn't have any pre-3.14 raid0
>> arrays. Thus, we can safely set RAID0_ALT_MULTIZONE_LAYOUT. So
>> this is a simpler configuration for us than setting module or command
>> line parameters.

Song> I would echo Coly's concern that CONFIG_ option could make it risky. 
Song> If the overhead of maintaining extra command line parameter, I would
Song> recommend you carry a private patch for this change. For upstream, it
Song> is better NOT to carry the default in CONFIG_.

I agree as well.  Just because you have a known base, doesn't mean
that others wouldn't be hit with this problem.

John



Re: [dm-devel] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-05-02 Thread John Stoffel
> "Mike" == Mike Snitzer  writes:

Mike> On Tue, May 01 2018 at  8:36pm -0400,
Mike> Andrew Morton  wrote:

>> On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka 
>>  wrote:
>> 
>> > 
>> > 
>> > On Tue, 24 Apr 2018, Michal Hocko wrote:
>> > 
>> > > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
>> > > > 
>> > > > 
>> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
>> > > > 
>> > > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
>> > > > > 
>> > > > > > Fixing __vmalloc code 
>> > > > > > is easy and it doesn't require cooperation with maintainers.
>> > > > > 
>> > > > > But it is a hack against the intention of the scope api.
>> > > > 
>> > > > It is not!
>> > > 
>> > > This discussion simply doesn't make much sense it seems. The scope API
>> > > is to document the scope of the reclaim recursion critical section. That
>> > > certainly is not a utility function like vmalloc.
>> > 
>> > That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
>> > developer) from converting the code to the scope API. You make nonsensical 
>> > excuses.
>> > 
>> 
>> Fun thread!
>> 
>> Winding back to the original problem, I'd state it as
>> 
>> - Caller uses kvmalloc() but passes the address into vmalloc-naive
>> DMA API and
>> 
>> - Caller uses kvmalloc() but passes the address into kfree()
>> 
>> Yes?

Mike> I think so.

>> If so, then...
>> 
>> Is there a way in which, in the kvmalloc-called-kmalloc path, we can
>> tag the slab-allocated memory with a "this memory was allocated with
>> kvmalloc()" flag?  I *think* there's extra per-object storage available
>> with suitable slab/slub debugging options?  Perhaps we could steal one
>> bit from the redzone, dunno.
>> 
>> If so then we can
>> 
>> a) set that flag in kvmalloc() if the kmalloc() call succeeded
>> 
>> b) check for that flag in the DMA code, WARN if it is set.
>> 
>> c) in kvfree(), clear that flag before calling kfree()
>> 
>> d) in kfree(), check for that flag and go WARN() if set.
>> 
>> So both potential bugs are detected all the time, dependent upon
>> CONFIG_SLUB_DEBUG (and perhaps other slub config options).

Mike> Thanks Andrew, definitely the most sane proposal I've seen to resolve
Mike> this.

Cuts to the heart of the issue I think, and seems pretty sane.  Should
the WARN be rate limited as well?

John


Re: [dm-devel] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-05-02 Thread John Stoffel
> "Mike" == Mike Snitzer  writes:

Mike> On Tue, May 01 2018 at  8:36pm -0400,
Mike> Andrew Morton  wrote:

>> On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka 
>>  wrote:
>> 
>> > 
>> > 
>> > On Tue, 24 Apr 2018, Michal Hocko wrote:
>> > 
>> > > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
>> > > > 
>> > > > 
>> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
>> > > > 
>> > > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
>> > > > > 
>> > > > > > Fixing __vmalloc code 
>> > > > > > is easy and it doesn't require cooperation with maintainers.
>> > > > > 
>> > > > > But it is a hack against the intention of the scope api.
>> > > > 
>> > > > It is not!
>> > > 
>> > > This discussion simply doesn't make much sense it seems. The scope API
>> > > is to document the scope of the reclaim recursion critical section. That
>> > > certainly is not a utility function like vmalloc.
>> > 
>> > That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
>> > developer) from converting the code to the scope API. You make nonsensical 
>> > excuses.
>> > 
>> 
>> Fun thread!
>> 
>> Winding back to the original problem, I'd state it as
>> 
>> - Caller uses kvmalloc() but passes the address into vmalloc-naive
>> DMA API and
>> 
>> - Caller uses kvmalloc() but passes the address into kfree()
>> 
>> Yes?

Mike> I think so.

>> If so, then...
>> 
>> Is there a way in which, in the kvmalloc-called-kmalloc path, we can
>> tag the slab-allocated memory with a "this memory was allocated with
>> kvmalloc()" flag?  I *think* there's extra per-object storage available
>> with suitable slab/slub debugging options?  Perhaps we could steal one
>> bit from the redzone, dunno.
>> 
>> If so then we can
>> 
>> a) set that flag in kvmalloc() if the kmalloc() call succeeded
>> 
>> b) check for that flag in the DMA code, WARN if it is set.
>> 
>> c) in kvfree(), clear that flag before calling kfree()
>> 
>> d) in kfree(), check for that flag and go WARN() if set.
>> 
>> So both potential bugs are detected all the time, dependent upon
>> CONFIG_SLUB_DEBUG (and perhaps other slub config options).

Mike> Thanks Andrew, definitely the most sane proposal I've seen to resolve
Mike> this.

Cuts to the heart of the issue I think, and seems pretty sane.  Should
the WARN be rate limited as well?

John


Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-05-02 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka <mpato...@redhat.com> writes:

Mikulas> On Mon, 30 Apr 2018, John Stoffel wrote:

>> >>>>> "Mikulas" == Mikulas Patocka <mpato...@redhat.com> writes:
>> 
Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:
>> 
Mikulas> I see your point - and I think the misunderstanding is this.
>> 
>> Thanks.
>> 
Mikulas> This patch is not really helping people to debug existing crashes. It 
is 
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get 
a 
Mikulas> page that suggests to turn this option on" - "you turn it on and solve 
the 
Mikulas> crash".
>> 
Mikulas> What this patch really does is that - it makes the kernel deliberately 
Mikulas> crash in a situation when the code violates the specification, but it 
Mikulas> would not crash otherwise or it would crash very rarely. It helps to 
Mikulas> detect specification violations.
>> 
Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy 
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or 
report 
Mikulas> it. How is the user or developer supposed to learn about this option, 
if 
Mikulas> he gets no crash at all?
>> 
>> So why do we make this a KConfig option at all?

Mikulas> Because other people see the KConfig option (so, they may enable it) 
and 
Mikulas> they don't see the kernel parameter (so, they won't enable it).

Mikulas> Close your eyes and say how many kernel parameters do you remember :-)

>> Just turn it on and let it rip.

Mikulas> I can't test if all the networking drivers use kvmalloc properly, 
because 
Mikulas> I don't have the hardware. You can't test it neither. No one has all 
the 
Mikulas> hardware that is supported by Linux.

Mikulas> Driver issues can only be tested by a mass of users. And if the users 
Mikulas> don't know about the debugging option, they won't enable it.

>> >> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
>> >> tells me *nothing* about why I should pick one or the other, as an
>> >> example.

Mikulas> BTW. You can enable slub debugging either with CONFIG_SLUB_DEBUG_ON or 
Mikulas> with the kernel parameter "slub_debug" - and most users who compile 
their 
Mikulas> own kernel use CONFIG_SLUB_DEBUG_ON - just because it is visible.

You miss my point, which is that there's no explanation of what the
difference is between SLAB and SLUB and which I should choose.  The
same goes here.  If the KConfig option doesn't give useful info, it's
useless.

>> Now I also think that Linus has the right idea to not just sprinkle 
>> BUG_ONs into the code, just dump and oops and keep going if you can.  
>> If it's a filesystem or a device, turn it read only so that people 
>> notice right away.

Mikulas> This vmalloc fallback is similar to
Mikulas> CONFIG_DEBUG_KOBJECT_RELEASE.  CONFIG_DEBUG_KOBJECT_RELEASE
Mikulas> changes the behavior of kobject_put in order to cause
Mikulas> deliberate crashes (that wouldn't happen otherwise) in
Mikulas> drivers that misuse kobject_put. In the same sense, we want
Mikulas> to cause deliberate crashes (that wouldn't happen otherwise)
Mikulas> in drivers that misuse kvmalloc.

Mikulas> The crashes will only happen in debugging kernels, not in
Mikulas> production kernels.

Says you.  What about people or distros that enable it
unconditionally?  They're going to get all kinds of reports and then
turn it off again.  Crashing the system isn't the answer here.  


Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-05-02 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Mon, 30 Apr 2018, John Stoffel wrote:

>> >>>>> "Mikulas" == Mikulas Patocka  writes:
>> 
Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:
>> 
Mikulas> I see your point - and I think the misunderstanding is this.
>> 
>> Thanks.
>> 
Mikulas> This patch is not really helping people to debug existing crashes. It 
is 
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get 
a 
Mikulas> page that suggests to turn this option on" - "you turn it on and solve 
the 
Mikulas> crash".
>> 
Mikulas> What this patch really does is that - it makes the kernel deliberately 
Mikulas> crash in a situation when the code violates the specification, but it 
Mikulas> would not crash otherwise or it would crash very rarely. It helps to 
Mikulas> detect specification violations.
>> 
Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy 
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or 
report 
Mikulas> it. How is the user or developer supposed to learn about this option, 
if 
Mikulas> he gets no crash at all?
>> 
>> So why do we make this a KConfig option at all?

Mikulas> Because other people see the KConfig option (so, they may enable it) 
and 
Mikulas> they don't see the kernel parameter (so, they won't enable it).

Mikulas> Close your eyes and say how many kernel parameters do you remember :-)

>> Just turn it on and let it rip.

Mikulas> I can't test if all the networking drivers use kvmalloc properly, 
because 
Mikulas> I don't have the hardware. You can't test it neither. No one has all 
the 
Mikulas> hardware that is supported by Linux.

Mikulas> Driver issues can only be tested by a mass of users. And if the users 
Mikulas> don't know about the debugging option, they won't enable it.

>> >> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
>> >> tells me *nothing* about why I should pick one or the other, as an
>> >> example.

Mikulas> BTW. You can enable slub debugging either with CONFIG_SLUB_DEBUG_ON or 
Mikulas> with the kernel parameter "slub_debug" - and most users who compile 
their 
Mikulas> own kernel use CONFIG_SLUB_DEBUG_ON - just because it is visible.

You miss my point, which is that there's no explanation of what the
difference is between SLAB and SLUB and which I should choose.  The
same goes here.  If the KConfig option doesn't give useful info, it's
useless.

>> Now I also think that Linus has the right idea to not just sprinkle 
>> BUG_ONs into the code, just dump and oops and keep going if you can.  
>> If it's a filesystem or a device, turn it read only so that people 
>> notice right away.

Mikulas> This vmalloc fallback is similar to
Mikulas> CONFIG_DEBUG_KOBJECT_RELEASE.  CONFIG_DEBUG_KOBJECT_RELEASE
Mikulas> changes the behavior of kobject_put in order to cause
Mikulas> deliberate crashes (that wouldn't happen otherwise) in
Mikulas> drivers that misuse kobject_put. In the same sense, we want
Mikulas> to cause deliberate crashes (that wouldn't happen otherwise)
Mikulas> in drivers that misuse kvmalloc.

Mikulas> The crashes will only happen in debugging kernels, not in
Mikulas> production kernels.

Says you.  What about people or distros that enable it
unconditionally?  They're going to get all kinds of reports and then
turn it off again.  Crashing the system isn't the answer here.  


Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-30 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka <mpato...@redhat.com> writes:

Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:

>> >>>>> "James" == James Bottomley <james.bottom...@hansenpartnership.com> 
>> >>>>> writes:
>> 
James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options.  The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX.  Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.
>> 
>> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
>> tells me *nothing* about why I should pick one or the other, as an
>> example.
>> 
>> John

Mikulas> I see your point - and I think the misunderstanding is this.

Thanks.

Mikulas> This patch is not really helping people to debug existing crashes. It 
is 
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get 
a 
Mikulas> page that suggests to turn this option on" - "you turn it on and solve 
the 
Mikulas> crash".

Mikulas> What this patch really does is that - it makes the kernel deliberately 
Mikulas> crash in a situation when the code violates the specification, but it 
Mikulas> would not crash otherwise or it would crash very rarely. It helps to 
Mikulas> detect specification violations.

Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy 
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or 
report 
Mikulas> it. How is the user or developer supposed to learn about this option, 
if 
Mikulas> he gets no crash at all?

So why do we make this a KConfig option at all?  Just turn it on and
let it rip.  Now I also think that Linus has the right idea to not
just sprinkle BUG_ONs into the code, just dump and oops and keep going
if you can.  If it's a filesystem or a device, turn it read only so
that people notice right away.



Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-30 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:

>> >>>>> "James" == James Bottomley  
>> >>>>> writes:
>> 
James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options.  The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX.  Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.
>> 
>> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
>> tells me *nothing* about why I should pick one or the other, as an
>> example.
>> 
>> John

Mikulas> I see your point - and I think the misunderstanding is this.

Thanks.

Mikulas> This patch is not really helping people to debug existing crashes. It 
is 
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get 
a 
Mikulas> page that suggests to turn this option on" - "you turn it on and solve 
the 
Mikulas> crash".

Mikulas> What this patch really does is that - it makes the kernel deliberately 
Mikulas> crash in a situation when the code violates the specification, but it 
Mikulas> would not crash otherwise or it would crash very rarely. It helps to 
Mikulas> detect specification violations.

Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy 
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or 
report 
Mikulas> it. How is the user or developer supposed to learn about this option, 
if 
Mikulas> he gets no crash at all?

So why do we make this a KConfig option at all?  Just turn it on and
let it rip.  Now I also think that Linus has the right idea to not
just sprinkle BUG_ONs into the code, just dump and oops and keep going
if you can.  If it's a filesystem or a device, turn it read only so
that people notice right away.



Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread John Stoffel
> "James" == James Bottomley  writes:

James> On Wed, 2018-04-25 at 19:00 -0400, Mikulas Patocka wrote:
>> 
>> On Wed, 25 Apr 2018, James Bottomley wrote:
>> 
>> > > > Do we really need the new config option?  This could just be
>> > > > manually  tunable via fault injection IIUC.
>> > > 
>> > > We do, because we want to enable it in RHEL and Fedora debugging
>> > > kernels, so that it will be tested by the users.
>> > > 
>> > > The users won't use some extra magic kernel options or debugfs
>> files.
>> > 
>> > If it can be enabled via a tunable, then the distro can turn it on
>> > without the user having to do anything.  If you want to present the
>> > user with a different boot option, you can (just have the tunable
>> set
>> > on the command line), but being tunable driven means that you don't
>> > have to choose that option, you could automatically enable it under
>> a
>> > range of circumstances.  I think most sane distributions would want
>> > that flexibility.
>> > 
>> > Kconfig proliferation, conversely, is a bit of a nightmare from
>> both
>> > the user and the tester's point of view, so we're trying to avoid
>> it
>> > unless absolutely necessary.
>> > 
>> > James
>> 
>> BTW. even developers who compile their own kernel should have this
>> enabled by a CONFIG option - because if the developer sees the option
>> when browsing through menuconfig, he may enable it. If he doesn't see
>> the option, he won't even know that such an option exists.

James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options.  The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX.  Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.

I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
tells me *nothing* about why I should pick one or the other, as an
example.

John


Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread John Stoffel
> "James" == James Bottomley  writes:

James> On Wed, 2018-04-25 at 19:00 -0400, Mikulas Patocka wrote:
>> 
>> On Wed, 25 Apr 2018, James Bottomley wrote:
>> 
>> > > > Do we really need the new config option?  This could just be
>> > > > manually  tunable via fault injection IIUC.
>> > > 
>> > > We do, because we want to enable it in RHEL and Fedora debugging
>> > > kernels, so that it will be tested by the users.
>> > > 
>> > > The users won't use some extra magic kernel options or debugfs
>> files.
>> > 
>> > If it can be enabled via a tunable, then the distro can turn it on
>> > without the user having to do anything.  If you want to present the
>> > user with a different boot option, you can (just have the tunable
>> set
>> > on the command line), but being tunable driven means that you don't
>> > have to choose that option, you could automatically enable it under
>> a
>> > range of circumstances.  I think most sane distributions would want
>> > that flexibility.
>> > 
>> > Kconfig proliferation, conversely, is a bit of a nightmare from
>> both
>> > the user and the tester's point of view, so we're trying to avoid
>> it
>> > unless absolutely necessary.
>> > 
>> > James
>> 
>> BTW. even developers who compile their own kernel should have this
>> enabled by a CONFIG option - because if the developer sees the option
>> when browsing through menuconfig, he may enable it. If he doesn't see
>> the option, he won't even know that such an option exists.

James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options.  The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX.  Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.

I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
tells me *nothing* about why I should pick one or the other, as an
example.

John


Re: [PATCH 0/12 v3] Writeback improvements

2017-09-28 Thread John Stoffel
On Wed, Sep 27, 2017 at 02:13:47PM -0600, Jens Axboe wrote:
> We've had some issues with writeback in presence of memory reclaim
> at Facebook, and this patch set attempts to fix it up. The real
> functional change for that issue is patch 10. The rest are cleanups,
> as well as the removal of doing non-range cyclic writeback. The users
> of that was sync_inodes_sb() and wakeup_flusher_threads(), both of
> which writeback all of the dirty pages.

So does this patch set make things faster?  Less bursty?  Does it make
writeout take longer, but with less spikes?  What is the performance
impact of this change?   I hate to be a pain, but this just smacks of
arm waving and I'm sure FB doesn't make changes without data... :-)

> The basic idea is that we have callers that call
> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
> everything'. For memory reclaim situations, we can end up queuing
> a TON of these kinds of writeback units. This can cause softlockups
> and further memory issues, since we allocate huge amounts of
> struct wb_writeback_work to handle this writeback. Handle this
> situation more gracefully.

Do you push back on the callers or slow them down?  Why do we even
allow callers to flush everything?

John


Re: [PATCH 0/12 v3] Writeback improvements

2017-09-28 Thread John Stoffel
On Wed, Sep 27, 2017 at 02:13:47PM -0600, Jens Axboe wrote:
> We've had some issues with writeback in presence of memory reclaim
> at Facebook, and this patch set attempts to fix it up. The real
> functional change for that issue is patch 10. The rest are cleanups,
> as well as the removal of doing non-range cyclic writeback. The users
> of that was sync_inodes_sb() and wakeup_flusher_threads(), both of
> which writeback all of the dirty pages.

So does this patch set make things faster?  Less bursty?  Does it make
writeout take longer, but with less spikes?  What is the performance
impact of this change?   I hate to be a pain, but this just smacks of
arm waving and I'm sure FB doesn't make changes without data... :-)

> The basic idea is that we have callers that call
> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
> everything'. For memory reclaim situations, we can end up queuing
> a TON of these kinds of writeback units. This can cause softlockups
> and further memory issues, since we allocate huge amounts of
> struct wb_writeback_work to handle this writeback. Handle this
> situation more gracefully.

Do you push back on the callers or slow them down?  Why do we even
allow callers to flush everything?

John


Re: [PATCH 0/6] More graceful flusher thread memory reclaim wakeup

2017-09-20 Thread John Stoffel
On Tue, Sep 19, 2017 at 01:53:01PM -0600, Jens Axboe wrote:
> We've had some issues with writeback in presence of memory reclaim
> at Facebook, and this patch set attempts to fix it up. The real
> functional change is the last patch in the series, the first 5 are
> prep and cleanup patches.
> 
> The basic idea is that we have callers that call
> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
> everything'. For memory reclaim situations, we can end up queuing
> a TON of these kinds of writeback units. This can cause softlockups
> and further memory issues, since we allocate huge amounts of
> struct wb_writeback_work to handle this writeback. Handle this
> situation more gracefully.

This looks nice, but do you have any numbers to show how this improves
things?  I read the patches, but I'm not strong enough to comment on
them at all.  But I am interested in how this improves writeback under
pressure, if at all.

John


Re: [PATCH 0/6] More graceful flusher thread memory reclaim wakeup

2017-09-20 Thread John Stoffel
On Tue, Sep 19, 2017 at 01:53:01PM -0600, Jens Axboe wrote:
> We've had some issues with writeback in presence of memory reclaim
> at Facebook, and this patch set attempts to fix it up. The real
> functional change is the last patch in the series, the first 5 are
> prep and cleanup patches.
> 
> The basic idea is that we have callers that call
> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
> everything'. For memory reclaim situations, we can end up queuing
> a TON of these kinds of writeback units. This can cause softlockups
> and further memory issues, since we allocate huge amounts of
> struct wb_writeback_work to handle this writeback. Handle this
> situation more gracefully.

This looks nice, but do you have any numbers to show how this improves
things?  I read the patches, but I'm not strong enough to comment on
them at all.  But I am interested in how this improves writeback under
pressure, if at all.

John


Re: 4.4-rc3, KVM, br0 and instant hang

2015-12-05 Thread John Stoffel
>>>>> "Jens" == Jens Axboe  writes:

Jens> On 12/05/2015 10:31 AM, John Stoffel wrote:
>>>>>>> "John" == John Stoffel  writes:
>> 
John> On Fri, Dec 04, 2015 at 11:28:33PM -0500, John Stoffel wrote:
>>>> 
>> On my most recent bootup, I thought it was ok, since the VMs worked
>> for a while (10 minutes) and I was starting to re-compile the kernel
>> again to make more modules compiled in.  No luck, I got the following
>> crash dump (partial) on my netconsole box.
>> 
>> [ 1434.266524] [ cut here ]
>> [ 1434.266643] WARNING: CPU: 2 PID: 179 at block/blk-merge.c:435 
>> blk_rq_map_sg+0x2d9/0x2eb()

Jens> This is fixed in current -git, as of a few days ago.

Thanks!  I'll try that out and see how stable it is.  I assume it
doesn't matter if it's a deadline or cfq scheduler?

In any case, I see the pull in Linus' tree and I'm building a kernel
now to play with.

Thanks!
John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.4-rc3, KVM, br0 and instant hang

2015-12-05 Thread John Stoffel
>>>>> "John" == John Stoffel  writes:

John> On Fri, Dec 04, 2015 at 11:28:33PM -0500, John Stoffel wrote:
>> 
>> Hi all,

>> Anyway, if I try to boot up anything past the 4.2.6 kernel, the system
>> locks up pretty quickly with an oops message that scrolls off the
>> screen too far.  I've got some pictures which I'll attach in a bit,
>> maybe they'll help.  So at first I thought it was something to do with
>> bad kworker threads, or SCSI or SATA interactions, but as I tried to
>> configure Netconsole to log to my beaglebone black SBC, I found out
>> that if I compiled and installed 4.4-rc3, started the bridge up (br0),
>> even started KVM, but did NOT start my VMs, the system was stable.

I've now figured out that I can disable all my VMs from autostart, and
the system will come up properly.  Then I can setup netconsole to use
the br0 interface, do an  "echo t > sysrq" to confirm it's working,
and start up the VMs.

On my most recent bootup, I thought it was ok, since the VMs worked
for a while (10 minutes) and I was starting to re-compile the kernel
again to make more modules compiled in.  No luck, I got the following
crash dump (partial) on my netconsole box.

[ 1434.266524] [ cut here ]
[ 1434.266643] WARNING: CPU: 2 PID: 179 at block/blk-merge.c:435 
blk_rq_map_sg+0x2d9/0x2eb()
[ 1434.266739] Modules linked in: vhost_net vhost macvtap macvlan tun 
binfmt_misc cpufreq_stats cpuf
req_powersave cpufreq_conservative cpufreq_userspace loop snd_pcm_oss 
snd_mixer_oss snd_pcm snd_time
r snd soundcore pcspkr serio_raw edac_mce_amd k10temp edac_core sp5100_tco 
i2c_piix4 asus_atk0110 wm
i shpchp evdev acpi_cpufreq netconsole configfs dm_mod raid1 usbhid md_mod
[ 1434.267691] CPU: 2 PID: 179 Comm: kworker/2:1H Not tainted 4.4.0-rc3 #3
[ 1434.267754] Hardware name: System manufacturer System Product Name/M4A88TD-V 
EVO/USB3, BIOS 1401
   06/11/2010
   [ 1434.267851] Workqueue: kblockd cfq_kick_queue
   [ 1434.267927]   88040ba57b78 812ded80 

   [ 1434.268103]  88040ba57bb0 81071184 812c4cba 
88034aecee60
   [ 1434.268270]   0002 88040bd4b7c8 
88040ba57bc0
   [ 1434.268440] Call Trace:
   [ 1434.268501]  [] dump_stack+0x44/0x55
   [ 1434.268565]  [] warn_slowpath_common+0x95/0xae
   [ 1434.268628]  [] ? blk_rq_map_sg+0x2d9/0x2eb
   [ 1434.268688]  [] warn_slowpath_null+0x15/0x17
   [ 1434.268749]  [] blk_rq_map_sg+0x2d9/0x2eb
   [ 1434.268814]  [] scsi_init_sgtable+0x3f/0x63
   [ 1434.268876]  [] scsi_init_io+0x47/0x1ab
   [ 1434.268937]  [] sd_init_command+0x3e5/0xba6
   [ 1434.268997]  [] ? scsi_host_alloc_command+0x48/0xb0
   [ 1434.269060]  [] scsi_setup_cmnd+0x86/0x109
   [ 1434.269123]  [] scsi_prep_fn+0xa7/0x139
   [ 1434.269185]  [] blk_peek_request+0x169/0x1de
   [ 1434.269246]  [] scsi_request_fn+0x26/0x2a2
   [ 1434.269308]  [] ? __switch_to+0x1e9/0x3f1
   [ 1434.269372]  [] __blk_run_queue_uncond+0x22/0x2b
   [ 1434.269433]  [] __blk_run_queue+0x14/0x16
   [ 1434.269494]  [] cfq_kick_queue+0x2a/0x3a
   [ 1434.269554]  [] process_one_work+0x144/0x217
   [ 1434.269618]  [] worker_thread+0x1e3/0x28c
   [ 1434.269678]  [] ? rescuer_thread+0x270/0x270
   [ 1434.269738]  [] ? rescuer_thread+0x270/0x270
   [ 1434.269800]  [] kthread+0xb2/0xba
   [ 1434.269864]  [] ? kthread_parkme+0x1f/0x1f
   [ 1434.269925]  [] ret_from_fork+0x3f/0x70


And it stops and the system locks hard, it won't respond to
magic-sysrq at all and I have to hit the reset button.  Is there
anything I can provide for more details, or config options I can add
to do better debugging?

So now I'm doing yet another re-compile, but I'm making deadline be my
default scheduler.  My system is pretty simple in setup, it's mostly
triple mirrored RAID1 devices:

quad:/sys/devices# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdg1[0] sdc1[3] sde1[1]
  976628736 blocks super 1.2 [3/3] [UUU]
bitmap: 0/8 pages [0KB], 65536KB chunk

md4 : active raid1 sdf1[3] sdd1[1] sda1[2]
  1953380736 blocks super 1.2 [3/3] [UUU]
bitmap: 0/15 pages [0KB], 65536KB chunk

md0 : active raid1 sdh2[0] sdj2[3] sdi2[4]
  185545656 blocks super 1.2 [3/3] [UUU]
bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: 


And once this new kernel is compiled and installed, I'll also change
my disks to deadline scheduler and fire up the VMs to see what
happens.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.4-rc3, KVM, br0 and instant hang

2015-12-05 Thread John Stoffel
On Fri, Dec 04, 2015 at 11:28:33PM -0500, John Stoffel wrote:
> 
> Hi all,
> 
> I've been trying to upgrade to something newer than 4.2.6 since I want
> to use LVM Cache on my home NFS fileserver, KVM server, test server,
> etc.  So when it goes down, I lose all my other systems which mount
> stuff from it.
> 
> Right now I'm trying to figure out how to use Netconsole to grab a
> dump of the oops, but it's not working well.  But let me describe the
> situation as I've found it so far.
> 
> When the system boots up, it first starts with eth0 on the network,
> then switches to br0 since I have a KVM bridge setup so my VMs can
> run on the same home network, 192.168.1.0/24 which is pretty
> standard.  The system is an AMD Phenom(tm) II X4 945 Processor,
> running at a max of 3Ghz, with 16gb of RAM, mpt2 LSI PCI-E 8 port sata
> controller, on an ASUS motherboard.  I can get details if you like.
> It's an older box, but still runs really well, so why change?
> 
> Anyway, if I try to boot up anything past the 4.2.6 kernel, the system
> locks up pretty quickly with an oops message that scrolls off the
> screen too far.  I've got some pictures which I'll attach in a bit,
> maybe they'll help.  So at first I thought it was something to do with
> bad kworker threads, or SCSI or SATA interactions, but as I tried to
> configure Netconsole to log to my beaglebone black SBC, I found out
> that if I compiled and installed 4.4-rc3, started the bridge up (br0),
> even started KVM, but did NOT start my VMs, the system was stable.
> 
> And if I didn't start br0, I could start a VM, but the system wouldn't
> crash.  The VM wasn't on the network... but the system didn't crash.
> So I think I've found a wierd interaction here.  My KVMs are both
> Debian images, with 1-2gb of RAM and 1 CPU each.  Nothing strange.  My
> network config is:
> 
>  > cat /etc/network/interfaces
>  # This file describes the network interfaces available on your system
>  # and how to activate them. For more information, see interfaces(5).
> 
>  # The loopback network interface
>  auto lo
>  iface lo inet loopback
> 
>  # Bridge for VMs
>  auto br0
> 
>  iface br0 inet static
>address 192.168.1.6
>netmask 255.255.255.0
>  network 192.168.1.0
>gateway 192.168.1.254
>  bridge_ports eth0
>bridge_stp on
>  bridge_maxwait 0
>bridge_fd 0
> 
>  # Old setup
>  # auto eth0
> 
>  # iface eth0 inet static
>  #address 192.168.1.6
>  #netmask 255.255.255.0
>  #gateway 192.168.1.254
> 
> The currently running system version is:
> 
>  > cat /proc/version
>  Linux version 4.4.0-rc3 (john@quad) (gcc version 4.9.2 (Debian 4.9.2-10) 
> ) #1 SMP Thu Dec 3 12:13:30 EST 2015
> 
> And more detailed CPU info
> 
>  > cat /proc/cpuinfo
>  .
> 
>  processor   : 3
>  vendor_id   : AuthenticAMD
>  cpu family  : 16
>  model   : 4
>  model name  : AMD Phenom(tm) II X4 945 Processor
>  stepping: 3
>  microcode   : 0x1b6
>  cpu MHz : 800.000
>  cache size  : 512 KB
>  physical id : 0
>  siblings: 4
>  core id : 3
>  cpu cores   : 4
>  apicid  : 3
>  initial apicid  : 3
>  fpu : yes
>  fpu_exception   : yes
>  cpuid level : 5
>  wp  : yes
>  flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>  mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
>  fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl
>  nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm
>  extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit
>  wdt hw_pstate npt lbrv svm_lock nrip_save vmmcall
>  bugs: tlb_mmatch apic_c1e fxsave_leak sysret_ss_attrs
>  bogomips: 6027.13
>  TLB size: 1024 4K pages
>  clflush size: 64
>  cache_alignment : 64
>  address sizes   : 48 bits physical, 48 bits virtual
>  power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> 
> Here's my bootup messages, unfortunately I don't have any oops
> messages.  For whatever reason, it kicks in so quickly, that I can't
> get anything out over the network.  I'm going to see if I can stuff
> another network card in there and use that to send traffic, instead of
> over the brige.
> 
> My next step is going to be to try and disable some of the bridge
> se

Re: 4.4-rc3, KVM, br0 and instant hang

2015-12-05 Thread John Stoffel
On Fri, Dec 04, 2015 at 11:28:33PM -0500, John Stoffel wrote:
> 
> Hi all,
> 
> I've been trying to upgrade to something newer than 4.2.6 since I want
> to use LVM Cache on my home NFS fileserver, KVM server, test server,
> etc.  So when it goes down, I lose all my other systems which mount
> stuff from it.
> 
> Right now I'm trying to figure out how to use Netconsole to grab a
> dump of the oops, but it's not working well.  But let me describe the
> situation as I've found it so far.
> 
> When the system boots up, it first starts with eth0 on the network,
> then switches to br0 since I have a KVM bridge setup so my VMs can
> run on the same home network, 192.168.1.0/24 which is pretty
> standard.  The system is an AMD Phenom(tm) II X4 945 Processor,
> running at a max of 3Ghz, with 16gb of RAM, mpt2 LSI PCI-E 8 port sata
> controller, on an ASUS motherboard.  I can get details if you like.
> It's an older box, but still runs really well, so why change?
> 
> Anyway, if I try to boot up anything past the 4.2.6 kernel, the system
> locks up pretty quickly with an oops message that scrolls off the
> screen too far.  I've got some pictures which I'll attach in a bit,
> maybe they'll help.  So at first I thought it was something to do with
> bad kworker threads, or SCSI or SATA interactions, but as I tried to
> configure Netconsole to log to my beaglebone black SBC, I found out
> that if I compiled and installed 4.4-rc3, started the bridge up (br0),
> even started KVM, but did NOT start my VMs, the system was stable.
> 
> And if I didn't start br0, I could start a VM, but the system wouldn't
> crash.  The VM wasn't on the network... but the system didn't crash.
> So I think I've found a wierd interaction here.  My KVMs are both
> Debian images, with 1-2gb of RAM and 1 CPU each.  Nothing strange.  My
> network config is:
> 
>  > cat /etc/network/interfaces
>  # This file describes the network interfaces available on your system
>  # and how to activate them. For more information, see interfaces(5).
> 
>  # The loopback network interface
>  auto lo
>  iface lo inet loopback
> 
>  # Bridge for VMs
>  auto br0
> 
>  iface br0 inet static
>address 192.168.1.6
>netmask 255.255.255.0
>  network 192.168.1.0
>gateway 192.168.1.254
>  bridge_ports eth0
>bridge_stp on
>  bridge_maxwait 0
>bridge_fd 0
> 
>  # Old setup
>  # auto eth0
> 
>  # iface eth0 inet static
>  #address 192.168.1.6
>  #netmask 255.255.255.0
>  #gateway 192.168.1.254
> 
> The currently running system version is:
> 
>  > cat /proc/version
>  Linux version 4.4.0-rc3 (john@quad) (gcc version 4.9.2 (Debian 4.9.2-10) 
> ) #1 SMP Thu Dec 3 12:13:30 EST 2015
> 
> And more detailed CPU info
> 
>  > cat /proc/cpuinfo
>  .
> 
>  processor   : 3
>  vendor_id   : AuthenticAMD
>  cpu family  : 16
>  model   : 4
>  model name  : AMD Phenom(tm) II X4 945 Processor
>  stepping: 3
>  microcode   : 0x1b6
>  cpu MHz : 800.000
>  cache size  : 512 KB
>  physical id : 0
>  siblings: 4
>  core id : 3
>  cpu cores   : 4
>  apicid  : 3
>  initial apicid  : 3
>  fpu : yes
>  fpu_exception   : yes
>  cpuid level : 5
>  wp  : yes
>  flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>  mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
>  fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl
>  nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm
>  extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit
>  wdt hw_pstate npt lbrv svm_lock nrip_save vmmcall
>  bugs: tlb_mmatch apic_c1e fxsave_leak sysret_ss_attrs
>  bogomips: 6027.13
>  TLB size: 1024 4K pages
>  clflush size: 64
>  cache_alignment : 64
>  address sizes   : 48 bits physical, 48 bits virtual
>  power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> 
> Here's my bootup messages, unfortunately I don't have any oops
> messages.  For whatever reason, it kicks in so quickly, that I can't
> get anything out over the network.  I'm going to see if I can stuff
> another network card in there and use that to send traffic, instead of
> over the brige.
> 
> My next step is going to be to try and disable some of the bridge
> se

Re: 4.4-rc3, KVM, br0 and instant hang

2015-12-05 Thread John Stoffel
>>>>> "John" == John Stoffel <j...@quad.stoffel.home> writes:

John> On Fri, Dec 04, 2015 at 11:28:33PM -0500, John Stoffel wrote:
>> 
>> Hi all,

>> Anyway, if I try to boot up anything past the 4.2.6 kernel, the system
>> locks up pretty quickly with an oops message that scrolls off the
>> screen too far.  I've got some pictures which I'll attach in a bit,
>> maybe they'll help.  So at first I thought it was something to do with
>> bad kworker threads, or SCSI or SATA interactions, but as I tried to
>> configure Netconsole to log to my beaglebone black SBC, I found out
>> that if I compiled and installed 4.4-rc3, started the bridge up (br0),
>> even started KVM, but did NOT start my VMs, the system was stable.

I've now figured out that I can disable all my VMs from autostart, and
the system will come up properly.  Then I can setup netconsole to use
the br0 interface, do an  "echo t > sysrq" to confirm it's working,
and start up the VMs.

On my most recent bootup, I thought it was ok, since the VMs worked
for a while (10 minutes) and I was starting to re-compile the kernel
again to make more modules compiled in.  No luck, I got the following
crash dump (partial) on my netconsole box.

[ 1434.266524] [ cut here ]
[ 1434.266643] WARNING: CPU: 2 PID: 179 at block/blk-merge.c:435 
blk_rq_map_sg+0x2d9/0x2eb()
[ 1434.266739] Modules linked in: vhost_net vhost macvtap macvlan tun 
binfmt_misc cpufreq_stats cpuf
req_powersave cpufreq_conservative cpufreq_userspace loop snd_pcm_oss 
snd_mixer_oss snd_pcm snd_time
r snd soundcore pcspkr serio_raw edac_mce_amd k10temp edac_core sp5100_tco 
i2c_piix4 asus_atk0110 wm
i shpchp evdev acpi_cpufreq netconsole configfs dm_mod raid1 usbhid md_mod
[ 1434.267691] CPU: 2 PID: 179 Comm: kworker/2:1H Not tainted 4.4.0-rc3 #3
[ 1434.267754] Hardware name: System manufacturer System Product Name/M4A88TD-V 
EVO/USB3, BIOS 1401
   06/11/2010
   [ 1434.267851] Workqueue: kblockd cfq_kick_queue
   [ 1434.267927]   88040ba57b78 812ded80 

   [ 1434.268103]  88040ba57bb0 81071184 812c4cba 
88034aecee60
   [ 1434.268270]   0002 88040bd4b7c8 
88040ba57bc0
   [ 1434.268440] Call Trace:
   [ 1434.268501]  [] dump_stack+0x44/0x55
   [ 1434.268565]  [] warn_slowpath_common+0x95/0xae
   [ 1434.268628]  [] ? blk_rq_map_sg+0x2d9/0x2eb
   [ 1434.268688]  [] warn_slowpath_null+0x15/0x17
   [ 1434.268749]  [] blk_rq_map_sg+0x2d9/0x2eb
   [ 1434.268814]  [] scsi_init_sgtable+0x3f/0x63
   [ 1434.268876]  [] scsi_init_io+0x47/0x1ab
   [ 1434.268937]  [] sd_init_command+0x3e5/0xba6
   [ 1434.268997]  [] ? scsi_host_alloc_command+0x48/0xb0
   [ 1434.269060]  [] scsi_setup_cmnd+0x86/0x109
   [ 1434.269123]  [] scsi_prep_fn+0xa7/0x139
   [ 1434.269185]  [] blk_peek_request+0x169/0x1de
   [ 1434.269246]  [] scsi_request_fn+0x26/0x2a2
   [ 1434.269308]  [] ? __switch_to+0x1e9/0x3f1
   [ 1434.269372]  [] __blk_run_queue_uncond+0x22/0x2b
   [ 1434.269433]  [] __blk_run_queue+0x14/0x16
   [ 1434.269494]  [] cfq_kick_queue+0x2a/0x3a
   [ 1434.269554]  [] process_one_work+0x144/0x217
   [ 1434.269618]  [] worker_thread+0x1e3/0x28c
   [ 1434.269678]  [] ? rescuer_thread+0x270/0x270
   [ 1434.269738]  [] ? rescuer_thread+0x270/0x270
   [ 1434.269800]  [] kthread+0xb2/0xba
   [ 1434.269864]  [] ? kthread_parkme+0x1f/0x1f
   [ 1434.269925]  [] ret_from_fork+0x3f/0x70


And it stops and the system locks hard, it won't respond to
magic-sysrq at all and I have to hit the reset button.  Is there
anything I can provide for more details, or config options I can add
to do better debugging?

So now I'm doing yet another re-compile, but I'm making deadline be my
default scheduler.  My system is pretty simple in setup, it's mostly
triple mirrored RAID1 devices:

quad:/sys/devices# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdg1[0] sdc1[3] sde1[1]
  976628736 blocks super 1.2 [3/3] [UUU]
bitmap: 0/8 pages [0KB], 65536KB chunk

md4 : active raid1 sdf1[3] sdd1[1] sda1[2]
  1953380736 blocks super 1.2 [3/3] [UUU]
bitmap: 0/15 pages [0KB], 65536KB chunk

md0 : active raid1 sdh2[0] sdj2[3] sdi2[4]
  185545656 blocks super 1.2 [3/3] [UUU]
bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: 


And once this new kernel is compiled and installed, I'll also change
my disks to deadline scheduler and fire up the VMs to see what
happens.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.4-rc3, KVM, br0 and instant hang

2015-12-05 Thread John Stoffel
>>>>> "Jens" == Jens Axboe <ax...@fb.com> writes:

Jens> On 12/05/2015 10:31 AM, John Stoffel wrote:
>>>>>>> "John" == John Stoffel <j...@quad.stoffel.home> writes:
>> 
John> On Fri, Dec 04, 2015 at 11:28:33PM -0500, John Stoffel wrote:
>>>> 
>> On my most recent bootup, I thought it was ok, since the VMs worked
>> for a while (10 minutes) and I was starting to re-compile the kernel
>> again to make more modules compiled in.  No luck, I got the following
>> crash dump (partial) on my netconsole box.
>> 
>> [ 1434.266524] [ cut here ]
>> [ 1434.266643] WARNING: CPU: 2 PID: 179 at block/blk-merge.c:435 
>> blk_rq_map_sg+0x2d9/0x2eb()

Jens> This is fixed in current -git, as of a few days ago.

Thanks!  I'll try that out and see how stable it is.  I assume it
doesn't matter if it's a deadline or cfq scheduler?

In any case, I see the pull in Linus' tree and I'm building a kernel
now to play with.

Thanks!
John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


4.4-rc3, KVM, br0 and instant hang

2015-12-04 Thread John Stoffel

Hi all,

I've been trying to upgrade to something newer than 4.2.6 since I want
to use LVM Cache on my home NFS fileserver, KVM server, test server,
etc.  So when it goes down, I lose all my other systems which mount
stuff from it.

Right now I'm trying to figure out how to use Netconsole to grab a
dump of the oops, but it's not working well.  But let me describe the
situation as I've found it so far.

When the system boots up, it first starts with eth0 on the network,
then switches to br0 since I have a KVM bridge setup so my VMs can
run on the same home network, 192.168.1.0/24 which is pretty
standard.  The system is an AMD Phenom(tm) II X4 945 Processor,
running at a max of 3Ghz, with 16gb of RAM, mpt2 LSI PCI-E 8 port sata
controller, on an ASUS motherboard.  I can get details if you like.
It's an older box, but still runs really well, so why change?

Anyway, if I try to boot up anything past the 4.2.6 kernel, the system
locks up pretty quickly with an oops message that scrolls off the
screen too far.  I've got some pictures which I'll attach in a bit,
maybe they'll help.  So at first I thought it was something to do with
bad kworker threads, or SCSI or SATA interactions, but as I tried to
configure Netconsole to log to my beaglebone black SBC, I found out
that if I compiled and installed 4.4-rc3, started the bridge up (br0),
even started KVM, but did NOT start my VMs, the system was stable.

And if I didn't start br0, I could start a VM, but the system wouldn't
crash.  The VM wasn't on the network... but the system didn't crash.
So I think I've found a wierd interaction here.  My KVMs are both
Debian images, with 1-2gb of RAM and 1 CPU each.  Nothing strange.  My
network config is:

 > cat /etc/network/interfaces
 # This file describes the network interfaces available on your system
 # and how to activate them. For more information, see interfaces(5).

 # The loopback network interface
 auto lo
 iface lo inet loopback

 # Bridge for VMs
 auto br0

 iface br0 inet static
   address 192.168.1.6
 netmask 255.255.255.0
   network 192.168.1.0
 gateway 192.168.1.254
   bridge_ports eth0
 bridge_stp on
   bridge_maxwait 0
 bridge_fd 0

 # Old setup
 # auto eth0

 # iface eth0 inet static
 #address 192.168.1.6
 #netmask 255.255.255.0
 #gateway 192.168.1.254

The currently running system version is:

 > cat /proc/version
 Linux version 4.4.0-rc3 (john@quad) (gcc version 4.9.2 (Debian 4.9.2-10) ) 
#1 SMP Thu Dec 3 12:13:30 EST 2015

And more detailed CPU info

 > cat /proc/cpuinfo
 .

 processor   : 3
 vendor_id   : AuthenticAMD
 cpu family  : 16
 model   : 4
 model name  : AMD Phenom(tm) II X4 945 Processor
 stepping: 3
 microcode   : 0x1b6
 cpu MHz : 800.000
 cache size  : 512 KB
 physical id : 0
 siblings: 4
 core id : 3
 cpu cores   : 4
 apicid  : 3
 initial apicid  : 3
 fpu : yes
 fpu_exception   : yes
 cpuid level : 5
 wp  : yes
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
 mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
 fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl
 nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm
 extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit
 wdt hw_pstate npt lbrv svm_lock nrip_save vmmcall
 bugs: tlb_mmatch apic_c1e fxsave_leak sysret_ss_attrs
 bogomips: 6027.13
 TLB size: 1024 4K pages
 clflush size: 64
 cache_alignment : 64
 address sizes   : 48 bits physical, 48 bits virtual
 power management: ts ttp tm stc 100mhzsteps hwpstate


Here's my bootup messages, unfortunately I don't have any oops
messages.  For whatever reason, it kicks in so quickly, that I can't
get anything out over the network.  I'm going to see if I can stuff
another network card in there and use that to send traffic, instead of
over the brige.

My next step is going to be to try and disable some of the bridge
settings, like bridge_stp, bridge_maxwait and bridge_fd to just accept
the defaults.  I set this up under Debian Wheezy a long time ago and
never touched it since.

My network config is:

quad:~> ifconfig -a
br0   Link encap:Ethernet  HWaddr 20:cf:30:95:5f:2f
  inet addr:192.168.1.6  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: 2002:42bd:1ac0:1:22cf:30ff:fe95:5f2f/64 Scope:Global
  inet6 addr: fe80::22cf:30ff:fe95:5f2f/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:24154 errors:0 dropped:0 overruns:0 frame:0
  TX 

4.4-rc3, KVM, br0 and instant hang

2015-12-04 Thread John Stoffel

Hi all,

I've been trying to upgrade to something newer than 4.2.6 since I want
to use LVM Cache on my home NFS fileserver, KVM server, test server,
etc.  So when it goes down, I lose all my other systems which mount
stuff from it.

Right now I'm trying to figure out how to use Netconsole to grab a
dump of the oops, but it's not working well.  But let me describe the
situation as I've found it so far.

When the system boots up, it first starts with eth0 on the network,
then switches to br0 since I have a KVM bridge setup so my VMs can
run on the same home network, 192.168.1.0/24 which is pretty
standard.  The system is an AMD Phenom(tm) II X4 945 Processor,
running at a max of 3Ghz, with 16gb of RAM, mpt2 LSI PCI-E 8 port sata
controller, on an ASUS motherboard.  I can get details if you like.
It's an older box, but still runs really well, so why change?

Anyway, if I try to boot up anything past the 4.2.6 kernel, the system
locks up pretty quickly with an oops message that scrolls off the
screen too far.  I've got some pictures which I'll attach in a bit,
maybe they'll help.  So at first I thought it was something to do with
bad kworker threads, or SCSI or SATA interactions, but as I tried to
configure Netconsole to log to my beaglebone black SBC, I found out
that if I compiled and installed 4.4-rc3, started the bridge up (br0),
even started KVM, but did NOT start my VMs, the system was stable.

And if I didn't start br0, I could start a VM, but the system wouldn't
crash.  The VM wasn't on the network... but the system didn't crash.
So I think I've found a wierd interaction here.  My KVMs are both
Debian images, with 1-2gb of RAM and 1 CPU each.  Nothing strange.  My
network config is:

 > cat /etc/network/interfaces
 # This file describes the network interfaces available on your system
 # and how to activate them. For more information, see interfaces(5).

 # The loopback network interface
 auto lo
 iface lo inet loopback

 # Bridge for VMs
 auto br0

 iface br0 inet static
   address 192.168.1.6
 netmask 255.255.255.0
   network 192.168.1.0
 gateway 192.168.1.254
   bridge_ports eth0
 bridge_stp on
   bridge_maxwait 0
 bridge_fd 0

 # Old setup
 # auto eth0

 # iface eth0 inet static
 #address 192.168.1.6
 #netmask 255.255.255.0
 #gateway 192.168.1.254

The currently running system version is:

 > cat /proc/version
 Linux version 4.4.0-rc3 (john@quad) (gcc version 4.9.2 (Debian 4.9.2-10) ) 
#1 SMP Thu Dec 3 12:13:30 EST 2015

And more detailed CPU info

 > cat /proc/cpuinfo
 .

 processor   : 3
 vendor_id   : AuthenticAMD
 cpu family  : 16
 model   : 4
 model name  : AMD Phenom(tm) II X4 945 Processor
 stepping: 3
 microcode   : 0x1b6
 cpu MHz : 800.000
 cache size  : 512 KB
 physical id : 0
 siblings: 4
 core id : 3
 cpu cores   : 4
 apicid  : 3
 initial apicid  : 3
 fpu : yes
 fpu_exception   : yes
 cpuid level : 5
 wp  : yes
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
 mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
 fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl
 nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm
 extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit
 wdt hw_pstate npt lbrv svm_lock nrip_save vmmcall
 bugs: tlb_mmatch apic_c1e fxsave_leak sysret_ss_attrs
 bogomips: 6027.13
 TLB size: 1024 4K pages
 clflush size: 64
 cache_alignment : 64
 address sizes   : 48 bits physical, 48 bits virtual
 power management: ts ttp tm stc 100mhzsteps hwpstate


Here's my bootup messages, unfortunately I don't have any oops
messages.  For whatever reason, it kicks in so quickly, that I can't
get anything out over the network.  I'm going to see if I can stuff
another network card in there and use that to send traffic, instead of
over the brige.

My next step is going to be to try and disable some of the bridge
settings, like bridge_stp, bridge_maxwait and bridge_fd to just accept
the defaults.  I set this up under Debian Wheezy a long time ago and
never touched it since.

My network config is:

quad:~> ifconfig -a
br0   Link encap:Ethernet  HWaddr 20:cf:30:95:5f:2f
  inet addr:192.168.1.6  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: 2002:42bd:1ac0:1:22cf:30ff:fe95:5f2f/64 Scope:Global
  inet6 addr: fe80::22cf:30ff:fe95:5f2f/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:24154 errors:0 dropped:0 overruns:0 frame:0
  TX 

Re: Hard lockup in ext4_finish_bio

2015-10-08 Thread John Stoffel
>>>>> "Nikolay" == Nikolay Borisov  writes:

Nikolay> On 10/08/2015 05:34 PM, John Stoffel wrote:
>> Great bug report, but you're missing the info on which kernel
>> you're

Nikolay> This is on 3.12.47 (self compiled). It was evident on my
Nikolay> initial post, but I did forget to mention that in the
Nikolay> reply. Also, I suspect even current kernel are susceptible to
Nikolay> this since the locking in question hasn't changed.

Hi Nikolay, must have missed it.  I looked quickly, but didn't find
it.  Since it's such an older kernel release, it might be best if you
upgrade to the latest version and try to re-create the lock if at all
possible.  

What kind of workload are you running on there?  And if you have more
details on the hardware, that might help as well.  

John

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hard lockup in ext4_finish_bio

2015-10-08 Thread John Stoffel

Great bug report, but you're missing the info on which kernel you're
running here...  is this a vendor kernel or self-compiled?  


Nikolay> I've hit a rather strange hard lock up on one of my servers
Nikolay> from the page writeback path, the actual backtrace is:

Nikolay> [427149.717151] [ cut here ]
Nikolay> [427149.717553] WARNING: CPU: 23 PID: 4611 at
Nikolay> kernel/watchdog.c:245 watchdog_overflow_callback+0x98/0xc0()
Nikolay> [427149.718216] Watchdog detected hard LOCKUP on cpu 23
Nikolay> [427149.718292] Modules linked in: [427149.718723] tcp_diag
Nikolay> inet_diag netconsole act_police cls_basic sch_ingress
Nikolay> xt_pkttype xt_state veth openvswitch gre vxlan ip_tunnel
Nikolay> xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat
Nikolay> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_n at xt_CT
Nikolay> nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs
Nikolay> ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ext2
Nikolay> dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio
Nikolay> dm_mirror dm_region_hash dm_log i2c_i801 lpc_ich mfd_core
Nikolay> shpchp i oapic ioatdma igb i2c_algo_bit ses enclosure
Nikolay> ipmi_devintf ipmi_si ipmi_msghandler ib_qib dca ib_mad
Nikolay> ib_core [427149.725321] CPU: 23 PID: 4611 Comm: kworker/u98:7
Nikolay> Not tainted 3.12.47-clouder1 #1 [427149.725690] Hardware
Nikolay> name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
Nikolay> [427149.726062] Workqueue: writeback bdi_writeback_workfn
Nikolay> (flush-252:148) [427149.726564] 00f5
Nikolay> 883fff366b58 81651631 00f5
Nikolay> [427149.727212] 883fff366ba8 883fff366b98
Nikolay> 81089a6c  [427149.727860]
Nikolay> 883fd2f08000  883fff366ce8
Nikolay>  [427149.728490] Call Trace: [427149.728845]
Nikolay>  [] dump_stack+0x58/0x7f
Nikolay> [427149.729350] []
Nikolay> warn_slowpath_common+0x8c/0xc0 [427149.729712]
Nikolay> [] warn_slowpath_fmt+0x46/0x50
Nikolay> [427149.730076] []
Nikolay> watchdog_overflow_callback+0x98/0xc0 [427149.730443]
Nikolay> [] __perf_event_overflow+0x9c/0x250
Nikolay> [427149.730810] []
Nikolay> perf_event_overflow+0x14/0x20 [427149.731175]
Nikolay> [] intel_pmu_handle_irq+0x1d6/0x3e0
Nikolay> [427149.739656] []
Nikolay> perf_event_nmi_handler+0x34/0x60 [427149.740027]
Nikolay> [] nmi_handle+0xa2/0x1a0 [427149.740389]
Nikolay> [] do_nmi+0x164/0x430 [427149.740754]
Nikolay> [] end_repeat_nmi+0x1a/0x1e [427149.741122]
Nikolay> [] ? mempool_free_slab+0x17/0x20
Nikolay> [427149.741492] [] ?
Nikolay> ext4_finish_bio+0x275/0x2a0 [427149.741854]
Nikolay> [] ? ext4_finish_bio+0x275/0x2a0
Nikolay> [427149.742216] [] ?
Nikolay> ext4_finish_bio+0x275/0x2a0 [427149.742579] <> 
Nikolay> [] ext4_end_bio+0xc8/0x120 [427149.743150]
Nikolay> [] bio_endio+0x1d/0x40 [427149.743516]
Nikolay> [] dec_pending+0x1c1/0x360 [427149.743878]
Nikolay> [] clone_endio+0x76/0xa0 [427149.744239]
Nikolay> [] bio_endio+0x1d/0x40 [427149.744599]
Nikolay> [] dec_pending+0x1c1/0x360 [427149.744964]
Nikolay> [] clone_endio+0x76/0xa0 [427149.745326]
Nikolay> [] bio_endio+0x1d/0x40 [427149.745686]
Nikolay> [] dec_pending+0x1c1/0x360 [427149.746048]
Nikolay> [] clone_endio+0x76/0xa0 [427149.746407]
Nikolay> [] bio_endio+0x1d/0x40 [427149.746773]
Nikolay> [] blk_update_request+0x21b/0x450
Nikolay> [427149.747138] []
Nikolay> blk_update_bidi_request+0x27/0xb0 [427149.747513]
Nikolay> [] blk_end_bidi_request+0x2f/0x80
Nikolay> [427149.748101] []
Nikolay> blk_end_request+0x10/0x20 [427149.748705]
Nikolay> [] scsi_io_completion+0xbc/0x620
Nikolay> [427149.749297] []
Nikolay> scsi_finish_command+0xc9/0x130 [427149.749891]
Nikolay> [] scsi_softirq_done+0x147/0x170
Nikolay> [427149.750491] []
Nikolay> blk_done_softirq+0x7d/0x90 [427149.751089]
Nikolay> [] __do_softirq+0x137/0x2e0 [427149.751694]
Nikolay> [] call_softirq+0x1c/0x30 [427149.752284]
Nikolay> [] do_softirq+0x8d/0xc0 [427149.752892]
Nikolay> [] irq_exit+0x95/0xa0 [427149.753526]
Nikolay> []
Nikolay> smp_call_function_single_interrupt+0x35/0x40 [427149.754149]
Nikolay> [] call_function_single_interrupt+0x6f/0x80
Nikolay> [427149.754750]  [] ? memcpy+0x6/0x110
Nikolay> [427149.755572] [] ? __bio_clone+0x26/0x70
Nikolay> [427149.756179] []
Nikolay> __clone_and_map_data_bio+0x139/0x160 [427149.756814]
Nikolay> [] __split_and_process_bio+0x3ed/0x490
Nikolay> [427149.757444] [] dm_request+0x136/0x1e0
Nikolay> [427149.758041] []
Nikolay> generic_make_request+0xca/0x100 [427149.758641]
Nikolay> [] submit_bio+0x79/0x160 [427149.759035]
Nikolay> [] ? account_page_writeback+0x2d/0x40
Nikolay> [427149.759406] [] ?
Nikolay> __test_set_page_writeback+0x16d/0x1f0 [427149.759781]
Nikolay> [] ext4_io_submit+0x29/0x50 [427149.760151]
Nikolay> [] ext4_bio_write_page+0x12b/0x2f0
Nikolay> [427149.760519] []
Nikolay> mpage_submit_page+0x68/0x90 [427149.760887]
Nikolay> [] mpage_process_page_bufs+0xf0/0x110
Nikolay> [427149.761257] []
Nikolay> 

Re: Hard lockup in ext4_finish_bio

2015-10-08 Thread John Stoffel

Great bug report, but you're missing the info on which kernel you're
running here...  is this a vendor kernel or self-compiled?  


Nikolay> I've hit a rather strange hard lock up on one of my servers
Nikolay> from the page writeback path, the actual backtrace is:

Nikolay> [427149.717151] [ cut here ]
Nikolay> [427149.717553] WARNING: CPU: 23 PID: 4611 at
Nikolay> kernel/watchdog.c:245 watchdog_overflow_callback+0x98/0xc0()
Nikolay> [427149.718216] Watchdog detected hard LOCKUP on cpu 23
Nikolay> [427149.718292] Modules linked in: [427149.718723] tcp_diag
Nikolay> inet_diag netconsole act_police cls_basic sch_ingress
Nikolay> xt_pkttype xt_state veth openvswitch gre vxlan ip_tunnel
Nikolay> xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat
Nikolay> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_n at xt_CT
Nikolay> nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs
Nikolay> ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ext2
Nikolay> dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio
Nikolay> dm_mirror dm_region_hash dm_log i2c_i801 lpc_ich mfd_core
Nikolay> shpchp i oapic ioatdma igb i2c_algo_bit ses enclosure
Nikolay> ipmi_devintf ipmi_si ipmi_msghandler ib_qib dca ib_mad
Nikolay> ib_core [427149.725321] CPU: 23 PID: 4611 Comm: kworker/u98:7
Nikolay> Not tainted 3.12.47-clouder1 #1 [427149.725690] Hardware
Nikolay> name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
Nikolay> [427149.726062] Workqueue: writeback bdi_writeback_workfn
Nikolay> (flush-252:148) [427149.726564] 00f5
Nikolay> 883fff366b58 81651631 00f5
Nikolay> [427149.727212] 883fff366ba8 883fff366b98
Nikolay> 81089a6c  [427149.727860]
Nikolay> 883fd2f08000  883fff366ce8
Nikolay>  [427149.728490] Call Trace: [427149.728845]
Nikolay>  [] dump_stack+0x58/0x7f
Nikolay> [427149.729350] []
Nikolay> warn_slowpath_common+0x8c/0xc0 [427149.729712]
Nikolay> [] warn_slowpath_fmt+0x46/0x50
Nikolay> [427149.730076] []
Nikolay> watchdog_overflow_callback+0x98/0xc0 [427149.730443]
Nikolay> [] __perf_event_overflow+0x9c/0x250
Nikolay> [427149.730810] []
Nikolay> perf_event_overflow+0x14/0x20 [427149.731175]
Nikolay> [] intel_pmu_handle_irq+0x1d6/0x3e0
Nikolay> [427149.739656] []
Nikolay> perf_event_nmi_handler+0x34/0x60 [427149.740027]
Nikolay> [] nmi_handle+0xa2/0x1a0 [427149.740389]
Nikolay> [] do_nmi+0x164/0x430 [427149.740754]
Nikolay> [] end_repeat_nmi+0x1a/0x1e [427149.741122]
Nikolay> [] ? mempool_free_slab+0x17/0x20
Nikolay> [427149.741492] [] ?
Nikolay> ext4_finish_bio+0x275/0x2a0 [427149.741854]
Nikolay> [] ? ext4_finish_bio+0x275/0x2a0
Nikolay> [427149.742216] [] ?
Nikolay> ext4_finish_bio+0x275/0x2a0 [427149.742579] <> 
Nikolay> [] ext4_end_bio+0xc8/0x120 [427149.743150]
Nikolay> [] bio_endio+0x1d/0x40 [427149.743516]
Nikolay> [] dec_pending+0x1c1/0x360 [427149.743878]
Nikolay> [] clone_endio+0x76/0xa0 [427149.744239]
Nikolay> [] bio_endio+0x1d/0x40 [427149.744599]
Nikolay> [] dec_pending+0x1c1/0x360 [427149.744964]
Nikolay> [] clone_endio+0x76/0xa0 [427149.745326]
Nikolay> [] bio_endio+0x1d/0x40 [427149.745686]
Nikolay> [] dec_pending+0x1c1/0x360 [427149.746048]
Nikolay> [] clone_endio+0x76/0xa0 [427149.746407]
Nikolay> [] bio_endio+0x1d/0x40 [427149.746773]
Nikolay> [] blk_update_request+0x21b/0x450
Nikolay> [427149.747138] []
Nikolay> blk_update_bidi_request+0x27/0xb0 [427149.747513]
Nikolay> [] blk_end_bidi_request+0x2f/0x80
Nikolay> [427149.748101] []
Nikolay> blk_end_request+0x10/0x20 [427149.748705]
Nikolay> [] scsi_io_completion+0xbc/0x620
Nikolay> [427149.749297] []
Nikolay> scsi_finish_command+0xc9/0x130 [427149.749891]
Nikolay> [] scsi_softirq_done+0x147/0x170
Nikolay> [427149.750491] []
Nikolay> blk_done_softirq+0x7d/0x90 [427149.751089]
Nikolay> [] __do_softirq+0x137/0x2e0 [427149.751694]
Nikolay> [] call_softirq+0x1c/0x30 [427149.752284]
Nikolay> [] do_softirq+0x8d/0xc0 [427149.752892]
Nikolay> [] irq_exit+0x95/0xa0 [427149.753526]
Nikolay> []
Nikolay> smp_call_function_single_interrupt+0x35/0x40 [427149.754149]
Nikolay> [] call_function_single_interrupt+0x6f/0x80
Nikolay> [427149.754750]  [] ? memcpy+0x6/0x110
Nikolay> [427149.755572] [] ? __bio_clone+0x26/0x70
Nikolay> [427149.756179] []
Nikolay> __clone_and_map_data_bio+0x139/0x160 [427149.756814]
Nikolay> [] __split_and_process_bio+0x3ed/0x490
Nikolay> [427149.757444] [] dm_request+0x136/0x1e0
Nikolay> [427149.758041] []
Nikolay> generic_make_request+0xca/0x100 [427149.758641]
Nikolay> [] submit_bio+0x79/0x160 [427149.759035]
Nikolay> [] ? account_page_writeback+0x2d/0x40
Nikolay> [427149.759406] [] ?
Nikolay> __test_set_page_writeback+0x16d/0x1f0 [427149.759781]
Nikolay> [] ext4_io_submit+0x29/0x50 [427149.760151]
Nikolay> [] ext4_bio_write_page+0x12b/0x2f0
Nikolay> [427149.760519] []
Nikolay> mpage_submit_page+0x68/0x90 [427149.760887]
Nikolay> [] mpage_process_page_bufs+0xf0/0x110
Nikolay> [427149.761257] []
Nikolay> 

Re: Hard lockup in ext4_finish_bio

2015-10-08 Thread John Stoffel
>>>>> "Nikolay" == Nikolay Borisov <ker...@kyup.com> writes:

Nikolay> On 10/08/2015 05:34 PM, John Stoffel wrote:
>> Great bug report, but you're missing the info on which kernel
>> you're

Nikolay> This is on 3.12.47 (self compiled). It was evident on my
Nikolay> initial post, but I did forget to mention that in the
Nikolay> reply. Also, I suspect even current kernel are susceptible to
Nikolay> this since the locking in question hasn't changed.

Hi Nikolay, must have missed it.  I looked quickly, but didn't find
it.  Since it's such an older kernel release, it might be best if you
upgrade to the latest version and try to re-create the lock if at all
possible.  

What kind of workload are you running on there?  And if you have more
details on the hardware, that might help as well.  

John

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH 0/2] A simpler way to maintain custom defconfigs

2015-08-28 Thread John Stoffel

Felipe> For several years I've used a trick to be able to maintain a simple 
defconfig
Felipe> that works across many versions, and requires little maintenance from my
Felipe> part:

Felipe> % cat arch/x86/configs/x86_64_defconfig ~/my-config > .config && make 
olddefconfig

Felipe> I'm sending a proposal to integrate it on the build system so that many 
people
Felipe> can do the same in a simple manner.

Felipe> The interesting part is how to generate this simplified defconfig. In a
Felipe> nutshell; you want to take your .config, remove everything that is the 
default
Felipe> in the Kconfig files (what savedefconfig does), but also removes 
anything that
Felipe> is in the default defconfig (e.g. x86_64_defconfig)

Felipe> I've been doing this by hand, but today I gave it a shot to automate 
this. The
Felipe> result is a bit crude, but it works.

Felipe> Thoughts?

I like this idea, it makes alot of sense to me, and looks like it will
simplify things for people.  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH 0/2] A simpler way to maintain custom defconfigs

2015-08-28 Thread John Stoffel

Felipe For several years I've used a trick to be able to maintain a simple 
defconfig
Felipe that works across many versions, and requires little maintenance from my
Felipe part:

Felipe % cat arch/x86/configs/x86_64_defconfig ~/my-config  .config  make 
olddefconfig

Felipe I'm sending a proposal to integrate it on the build system so that many 
people
Felipe can do the same in a simple manner.

Felipe The interesting part is how to generate this simplified defconfig. In a
Felipe nutshell; you want to take your .config, remove everything that is the 
default
Felipe in the Kconfig files (what savedefconfig does), but also removes 
anything that
Felipe is in the default defconfig (e.g. x86_64_defconfig)

Felipe I've been doing this by hand, but today I gave it a shot to automate 
this. The
Felipe result is a bit crude, but it works.

Felipe Thoughts?

I like this idea, it makes alot of sense to me, and looks like it will
simplify things for people.  

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3 v4] mm/vmalloc: Cache the vmalloc memory info

2015-08-24 Thread John Stoffel

George> John Stoffel  wrote:
>>> vmap_info_gen should be initialized to 1 to force an initial
>>> cache update.

>> Blech, it should be initialized with a proper #define
>> VMAP_CACHE_NEEDS_UPDATE 1, instead of more magic numbers.

George> Er... this is a joke, right?

Not really.  The comment made before was that by setting this variable
to zero, it wasn't properly initialized.  Which implies that either
the API is wrong... or we should be documenting it better.   I just
went in the direction of the #define instead of a comment. 

George> First, this number is used exactly once, and it's not part of
George> a collection of similar numbers.  And the definition would be
George> adjacent to the use.

George> We have easier ways of accomplishing that, called "comments".

Sure, that would be the better solution in this case.  

George> Second, your proposed name is misleading.  "needs update" is defined
George> as vmap_info_gen != vmap_info_cache_gen.  There is no particular value
George> of either that has this meaning.

George> For example, initializing vmap_info_cache_gen to -1 would do just as 
well.
George> (I actually considered that before deciding that +1 was "simpler" than 
-1.)

See, I just threw out a dumb suggestion without reading the patch
properly.  My fault.

George> (John, my apologies if I went over the top and am contributing to LKML's
George> reputation for flaming.  I *did* actually laugh, and *do* think it's a
George> dumb idea, but my annoyance is really directed at unpleasant memories of
George> mindless application of coding style guidelines.  In this case, I 
suspect
George> you just posted before reading carefully enough to see the subtle 
logic.)

Nope, I'm in the wrong here.  And your comment here is wonderful, I
really do appreciate how you handled my ham fisted attempt to
contribute.  But I've got thick skin and I'll keep trying in my free
time to comment on patches when I can.

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3 v4] mm/vmalloc: Cache the vmalloc memory info

2015-08-24 Thread John Stoffel
> "Ingo" == Ingo Molnar  writes:

Ingo> * George Spelvin  wrote:

>> First, an actual, albeit minor, bug: initializing both vmap_info_gen
>> and vmap_info_cache_gen to 0 marks the cache as valid, which it's not.

Ingo> Ha! :-) Fixed.

>> vmap_info_gen should be initialized to 1 to force an initial
>> cache update.

Blech, it should be initialized with a proper #define
VMAP_CACHE_NEEDS_UPDATE 1, instead of more magic numbers.


Ingo> + */
Ingo> +static DEFINE_SPINLOCK(vmap_info_lock);
Ingo> +static int vmap_info_gen = 1;

   static int vmap_info_gen = VMAP_CACHE_NEEDS_UPDATE;

Ingo> +static int vmap_info_cache_gen;
Ingo> +static struct vmalloc_info vmap_info_cache;
Ingo> +#endif


This will help keep bugs like this out in the future... I hope!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3 v4] mm/vmalloc: Cache the vmalloc memory info

2015-08-24 Thread John Stoffel
 Ingo == Ingo Molnar mi...@kernel.org writes:

Ingo * George Spelvin li...@horizon.com wrote:

 First, an actual, albeit minor, bug: initializing both vmap_info_gen
 and vmap_info_cache_gen to 0 marks the cache as valid, which it's not.

Ingo Ha! :-) Fixed.

 vmap_info_gen should be initialized to 1 to force an initial
 cache update.

Blech, it should be initialized with a proper #define
VMAP_CACHE_NEEDS_UPDATE 1, instead of more magic numbers.


Ingo + */
Ingo +static DEFINE_SPINLOCK(vmap_info_lock);
Ingo +static int vmap_info_gen = 1;

   static int vmap_info_gen = VMAP_CACHE_NEEDS_UPDATE;

Ingo +static int vmap_info_cache_gen;
Ingo +static struct vmalloc_info vmap_info_cache;
Ingo +#endif


This will help keep bugs like this out in the future... I hope!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3 v4] mm/vmalloc: Cache the vmalloc memory info

2015-08-24 Thread John Stoffel

George John Stoffel j...@stoffel.org wrote:
 vmap_info_gen should be initialized to 1 to force an initial
 cache update.

 Blech, it should be initialized with a proper #define
 VMAP_CACHE_NEEDS_UPDATE 1, instead of more magic numbers.

George Er... this is a joke, right?

Not really.  The comment made before was that by setting this variable
to zero, it wasn't properly initialized.  Which implies that either
the API is wrong... or we should be documenting it better.   I just
went in the direction of the #define instead of a comment. 

George First, this number is used exactly once, and it's not part of
George a collection of similar numbers.  And the definition would be
George adjacent to the use.

George We have easier ways of accomplishing that, called comments.

Sure, that would be the better solution in this case.  

George Second, your proposed name is misleading.  needs update is defined
George as vmap_info_gen != vmap_info_cache_gen.  There is no particular value
George of either that has this meaning.

George For example, initializing vmap_info_cache_gen to -1 would do just as 
well.
George (I actually considered that before deciding that +1 was simpler than 
-1.)

See, I just threw out a dumb suggestion without reading the patch
properly.  My fault.

George (John, my apologies if I went over the top and am contributing to LKML's
George reputation for flaming.  I *did* actually laugh, and *do* think it's a
George dumb idea, but my annoyance is really directed at unpleasant memories of
George mindless application of coding style guidelines.  In this case, I 
suspect
George you just posted before reading carefully enough to see the subtle 
logic.)

Nope, I'm in the wrong here.  And your comment here is wonderful, I
really do appreciate how you handled my ham fisted attempt to
contribute.  But I've got thick skin and I'll keep trying in my free
time to comment on patches when I can.

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86/kconfig/32: Make CONFIG_VM86 default to n and remove EXPERT

2015-07-09 Thread John Stoffel
> "Linus" == Linus Torvalds  writes:

Linus> On Thu, Jul 9, 2015 at 11:51 AM, Arjan van de Ven 
 wrote:
>> 
>> I would rather do BOTH the default n AND the EXPERT

Linus> That basically makes it impossible for "normal people" to test it. You
Linus> have to mark yourself as expert, and then get the rest of the
Linus> configuration right. Not a good idea.

Linus> The kernel config is probably our biggest problem for getting
Linus> people to test. Building the kernel? Easy. Installing it? "make
Linus> install; make modules_install". Not that hard, unless your
Linus> distro has screwed it up (which has happened, I'm looking at
Linus> you, Ubuntu).

The big problem with the kernel config is the piles and piles of crap
which makes finding the common cases really really hard, and I've been
following this list and building kernels off and on now for 12+
years.  

It would be nice if we could come up with a plan to organize the
configuration tree, and make it easier to use, with a little bit more
thought in how it's laid out.  

For example, under Device Drivers -> PPS Support ->  ??? 

What the hell is this?  Expand and use your acronyms the first time
you use them like "Parallel Port Support (PPS)"  so people have a
clue of figuring what you're talking about.

Maybe we could add a 'quick system' menu, where you select common
configurations at the top level, such as x86_64 home PC, which would
turn on all the options you'd pretty much expect for a home PC:

 - x86_64 cpu
  - max CPUs of 16
  - 
 - Device drivers:
  - SATA, PATA, AHCI, SCSI, USB, RAID, LVM
  - AMD/NVidia/Intel video drivers.

We could have the same for:

   ARM boards, PPC, Sparc, etc 

I know this is a hard problem space, esp since I'm sure people will
scream if you move their baby down/up/sideways in the config
hierarchy.  But cleaning it up and maybe even just sorting
alphabetically would be a big help!


 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86/kconfig/32: Make CONFIG_VM86 default to n and remove EXPERT

2015-07-09 Thread John Stoffel
 Linus == Linus Torvalds torva...@linux-foundation.org writes:

Linus On Thu, Jul 9, 2015 at 11:51 AM, Arjan van de Ven 
ar...@linux.intel.com wrote:
 
 I would rather do BOTH the default n AND the EXPERT

Linus That basically makes it impossible for normal people to test it. You
Linus have to mark yourself as expert, and then get the rest of the
Linus configuration right. Not a good idea.

Linus The kernel config is probably our biggest problem for getting
Linus people to test. Building the kernel? Easy. Installing it? make
Linus install; make modules_install. Not that hard, unless your
Linus distro has screwed it up (which has happened, I'm looking at
Linus you, Ubuntu).

The big problem with the kernel config is the piles and piles of crap
which makes finding the common cases really really hard, and I've been
following this list and building kernels off and on now for 12+
years.  

It would be nice if we could come up with a plan to organize the
configuration tree, and make it easier to use, with a little bit more
thought in how it's laid out.  

For example, under Device Drivers - PPS Support -  ??? 

What the hell is this?  Expand and use your acronyms the first time
you use them like Parallel Port Support (PPS)  so people have a
clue of figuring what you're talking about.

Maybe we could add a 'quick system' menu, where you select common
configurations at the top level, such as x86_64 home PC, which would
turn on all the options you'd pretty much expect for a home PC:

 - x86_64 cpu
  - max CPUs of 16
  - 
 - Device drivers:
  - SATA, PATA, AHCI, SCSI, USB, RAID, LVM
  - AMD/NVidia/Intel video drivers.

We could have the same for:

   ARM boards, PPC, Sparc, etc 

I know this is a hard problem space, esp since I'm sure people will
scream if you move their baby down/up/sideways in the config
hierarchy.  But cleaning it up and maybe even just sorting
alphabetically would be a big help!


 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-12 Thread John Stoffel
> "Austin" == Austin S Hemmelgarn  writes:

Austin> On 2015-05-12 01:08, Kevin Easton wrote:
>> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
>>> On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> Let me re-ask the question that I asked last week (and was apparently
> ignored).  Why not trying to use the lazytime feature instead of
> pointing a head straight at the application's --- and system
> administrators' --- heads?
 
 Sorry Ted, I thought I responded already.
 
 The goal is to avoid inode writeout entirely when we can, and
 as I understand it lazytime will still force writeout before the inode
 is dropped from the cache.  In systems like Ceph in particular, the
 IOs can be spread across lots of files, so simply deferring writeout
 doesn't always help.
>>> 
>>> Sure, but it would reduce the writeout by orders of magnitude.  I can
>>> understand if you want to reduce it further, but it might be good
>>> enough for your purposes.
>>> 
>>> I considered doing the equivalent of O_NOMTIME for our purposes at
>>> $WORK, and our use case is actually not that different from Ceph's
>>> (i.e., using a local disk file system to support a cluster file
>>> system), and lazytime was (a) something I figured was something I
>>> could upstream in good conscience, and (b) was more than good enough
>>> for us.
>> 
>> A safer alternative might be a chattr file attribute that if set, the
>> mtime is not updated on writes, and stat() on the file always shows the
>> mtime as "right now".  At least that way, the file won't accidentally
>> get left out of backups that rely on the mtime.
>> 
>> (If the file attribute is unset, you immediately update the mtime then
>> too, and from then on the file is back to normal).
>> 

Austin> I like this even better than the flag suggestion, it provides
Austin> better control, means that you don't need to update
Austin> applications to get the benefits, and prevents backup software
Austin> from breaking (although backups would be bigger).

Me too, it fails in a safer mode, where you do more work on backups
than strictly needed.  I'm still against this as a mount option
though, way way way too many bullets in the foot gun.  And as someone
else said, once you mount with O_NOMTIME, then unmount, then mount
again without O_NOMTIME, you've lost information.  Not good.  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-12 Thread John Stoffel
> "Sage" == Sage Weil  writes:

Sage> On Mon, 11 May 2015, Trond Myklebust wrote:
>> On Mon, May 11, 2015 at 12:39 PM, Sage Weil  wrote:
>> > On Mon, 11 May 2015, Dave Chinner wrote:
>> >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
>> >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil  wrote:
>> >> > > I'm sure you realize what we're try to achieve is the same "invisible 
>> >> > > IO"
>> >> > > that the XFS open by handle ioctls do by default.  Would you be more
>> >> > > comfortable if this option where only available to the generic
>> >> > > open_by_handle syscall, and not to open(2)?
>> >> >
>> >> > It should be an ioctl(). It has no business being part of
>> >> > open_by_handle either, since that is another generic interface.
>> >
>> > Our use-case doesn't make sense on network file systems, but it does on
>> > any reasonably featureful local filesystem, and the goal is to be generic
>> > there.  If mtime is critical to a network file system's consistency it
>> > seems pretty reasonable to disallow/ignore it for just that file system
>> > (e.g., by masking off the flag at open time), as others won't have that
>> > same problem (cephfs doesn't, for example).
>> >
>> > Perhaps making each fs opt-in instead of handling it in a generic path
>> > would alleviate this concern?
>> 
>> The issue isn't whether or not you have a network file system, it's
>> whether or not you want users to be able to manage data. mtime isn't
>> useful for the application (which knows whether or not it has changed
>> the file) or for the filesystem (ditto). It exists, rather, in order
>> to enable data management by users and other applications, letting
>> them know whether or not the data contents of the file have changed,
>> and when that change occurred.

Sage> Agreed.
 
>> If you are able to guarantee that your users don't care about that,
>> then fine, but that would be a very special case that doesn't fit the
>> way that most data centres are run. Backups are one case where mtime
>> matters, tiering and archiving is another.

Sage> This is true, although I argue it is becoming increasingly
Sage> common for the data management (including backups and so forth)
Sage> to be layered not on top of the POSIX file system but on
Sage> something higher up in the stack. This is true of pretty much
Sage> any distributed system (ceph, cassandra, mongo, etc., and I
Sage> assume commercial databases like Oracle, too) where backups,
Sage> replication, and any other DR strategies need to be orchestrated
Sage> across nodes to be consistent--simply copying files out from
Sage> underneath them is already insufficient and a recipe for
Sage> disaster.

you're smoking crack here.  Backups are not layered at higher layers
unless absolutely necessary, such as for databases.  Now Mongo, Hadoop
and others might also fit this model, but for day to day backup of
data, it's mtime all the way.  

I don't see why you insist that this is a good idea to implement for a
very special corner case.  

Sage> There is a growing category of applications that can benefit
Sage> from this capability...

There is a perceived growing category of super special niche
applications which might think they want this capability.  

Why are you even using a filesystem in the first place if you're so
worried about writing out inodes being a performance problem?  Just
use raw partitions and do all the work yourself.  Oracle and other DBs
can do this when they want.  

>> Neither of these examples
>> cases are under the control of the application that calls
>> open(O_NOMTIME).

Sage> Wouldn't a mount option (e.g., allow_nomtime) address this
Sage> concern?  Only nodes provisioned explicitly to run these systems
Sage> would be enable this option.

Why do you keep coming back to a mount option?  What's wrong with a
per-file ioctl option?  Making this a mount option means that you
default to a fail hard setup.  If someone screws up and mounts user
home directories with this option thinking that it's like the noatime
option, then suddenly all their backups will silently break unless
they're aware of disk space churn numbers and notice that they are
only backing up tiny bits.

With an ioctl, it's upto the damn application to *request* this
change, and then the VFS/filesystem and *maybe* support this, but the
application shouldn't actually know or care what the result is, it's
just a performance hint/request.  

We should default to sane semantics and not give out such a big
foot-gun if at all possible.  

I'm a sysadm by day (and night, evening, early morning... :-) and I
know my user's don't think about thinks like this. They don't even
think about backups until they want to restore something.  User's only
care about restores, not backups.

John


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ 

Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-12 Thread John Stoffel
 Sage == Sage Weil s...@newdream.net writes:

Sage On Mon, 11 May 2015, Trond Myklebust wrote:
 On Mon, May 11, 2015 at 12:39 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 11 May 2015, Dave Chinner wrote:
  On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
   On Fri, May 8, 2015 at 6:24 PM, Sage Weil s...@newdream.net wrote:
I'm sure you realize what we're try to achieve is the same invisible 
IO
that the XFS open by handle ioctls do by default.  Would you be more
comfortable if this option where only available to the generic
open_by_handle syscall, and not to open(2)?
  
   It should be an ioctl(). It has no business being part of
   open_by_handle either, since that is another generic interface.
 
  Our use-case doesn't make sense on network file systems, but it does on
  any reasonably featureful local filesystem, and the goal is to be generic
  there.  If mtime is critical to a network file system's consistency it
  seems pretty reasonable to disallow/ignore it for just that file system
  (e.g., by masking off the flag at open time), as others won't have that
  same problem (cephfs doesn't, for example).
 
  Perhaps making each fs opt-in instead of handling it in a generic path
  would alleviate this concern?
 
 The issue isn't whether or not you have a network file system, it's
 whether or not you want users to be able to manage data. mtime isn't
 useful for the application (which knows whether or not it has changed
 the file) or for the filesystem (ditto). It exists, rather, in order
 to enable data management by users and other applications, letting
 them know whether or not the data contents of the file have changed,
 and when that change occurred.

Sage Agreed.
 
 If you are able to guarantee that your users don't care about that,
 then fine, but that would be a very special case that doesn't fit the
 way that most data centres are run. Backups are one case where mtime
 matters, tiering and archiving is another.

Sage This is true, although I argue it is becoming increasingly
Sage common for the data management (including backups and so forth)
Sage to be layered not on top of the POSIX file system but on
Sage something higher up in the stack. This is true of pretty much
Sage any distributed system (ceph, cassandra, mongo, etc., and I
Sage assume commercial databases like Oracle, too) where backups,
Sage replication, and any other DR strategies need to be orchestrated
Sage across nodes to be consistent--simply copying files out from
Sage underneath them is already insufficient and a recipe for
Sage disaster.

you're smoking crack here.  Backups are not layered at higher layers
unless absolutely necessary, such as for databases.  Now Mongo, Hadoop
and others might also fit this model, but for day to day backup of
data, it's mtime all the way.  

I don't see why you insist that this is a good idea to implement for a
very special corner case.  

Sage There is a growing category of applications that can benefit
Sage from this capability...

There is a perceived growing category of super special niche
applications which might think they want this capability.  

Why are you even using a filesystem in the first place if you're so
worried about writing out inodes being a performance problem?  Just
use raw partitions and do all the work yourself.  Oracle and other DBs
can do this when they want.  

 Neither of these examples
 cases are under the control of the application that calls
 open(O_NOMTIME).

Sage Wouldn't a mount option (e.g., allow_nomtime) address this
Sage concern?  Only nodes provisioned explicitly to run these systems
Sage would be enable this option.

Why do you keep coming back to a mount option?  What's wrong with a
per-file ioctl option?  Making this a mount option means that you
default to a fail hard setup.  If someone screws up and mounts user
home directories with this option thinking that it's like the noatime
option, then suddenly all their backups will silently break unless
they're aware of disk space churn numbers and notice that they are
only backing up tiny bits.

With an ioctl, it's upto the damn application to *request* this
change, and then the VFS/filesystem and *maybe* support this, but the
application shouldn't actually know or care what the result is, it's
just a performance hint/request.  

We should default to sane semantics and not give out such a big
foot-gun if at all possible.  

I'm a sysadm by day (and night, evening, early morning... :-) and I
know my user's don't think about thinks like this. They don't even
think about backups until they want to restore something.  User's only
care about restores, not backups.

John


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-12 Thread John Stoffel
 Austin == Austin S Hemmelgarn ahferro...@gmail.com writes:

Austin On 2015-05-12 01:08, Kevin Easton wrote:
 On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
 On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
 Let me re-ask the question that I asked last week (and was apparently
 ignored).  Why not trying to use the lazytime feature instead of
 pointing a head straight at the application's --- and system
 administrators' --- heads?
 
 Sorry Ted, I thought I responded already.
 
 The goal is to avoid inode writeout entirely when we can, and
 as I understand it lazytime will still force writeout before the inode
 is dropped from the cache.  In systems like Ceph in particular, the
 IOs can be spread across lots of files, so simply deferring writeout
 doesn't always help.
 
 Sure, but it would reduce the writeout by orders of magnitude.  I can
 understand if you want to reduce it further, but it might be good
 enough for your purposes.
 
 I considered doing the equivalent of O_NOMTIME for our purposes at
 $WORK, and our use case is actually not that different from Ceph's
 (i.e., using a local disk file system to support a cluster file
 system), and lazytime was (a) something I figured was something I
 could upstream in good conscience, and (b) was more than good enough
 for us.
 
 A safer alternative might be a chattr file attribute that if set, the
 mtime is not updated on writes, and stat() on the file always shows the
 mtime as right now.  At least that way, the file won't accidentally
 get left out of backups that rely on the mtime.
 
 (If the file attribute is unset, you immediately update the mtime then
 too, and from then on the file is back to normal).
 

Austin I like this even better than the flag suggestion, it provides
Austin better control, means that you don't need to update
Austin applications to get the benefits, and prevents backup software
Austin from breaking (although backups would be bigger).

Me too, it fails in a safer mode, where you do more work on backups
than strictly needed.  I'm still against this as a mount option
though, way way way too many bullets in the foot gun.  And as someone
else said, once you mount with O_NOMTIME, then unmount, then mount
again without O_NOMTIME, you've lost information.  Not good.  

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread John Stoffel
>>>>> "Linus" == Linus Torvalds  writes:

Linus> On Fri, May 8, 2015 at 7:40 AM, John Stoffel  wrote:
>> 
>> Now go and look at your /home or /data/ or /work areas, where the
>> endusers are actually keeping their day to day work.  Photos, mp3,
>> design files, source code, object code littered around, etc.

Linus> However, the big files in that list are almost immaterial from a
Linus> caching standpoint.

Linus> Caching source code is a big deal - just try not doing it and
Linus> you'll figure it out. And the kernel C source files used to
Linus> have a median size around 4k.

Caching any files is a big deal, and if I'm doing batch edits of large
jpegs, won't they get cached as well?   

Linus> The big files in your home directory? Let me make an educated
Linus> guess.  Very few to *none* of them are actually in your page
Linus> cache right now.  And you'd never even care if they ever made
Linus> it into your page cache *at*all*. Much less whether you could
Linus> ever cache them using large pages using some very fancy cache.

Hmm... probably not honestly, since I'm not a home and not using the
system actively right now.  But I can see situations where being able
to mix different page sizes efficiently might be a good thing.  

Linus> There are big files that care about caches, but they tend to be
Linus> binaries, and for other reasons (things like randomization) you
Linus> would never want to use largepages for those anyway.

Or large design files, like my users at $WORK use, which can be 4Gb in
size for a large design, which is ASIC chip layout work.  So I'm a
little bit in the minority there.  

And yes I do have other users will millions of itty bitty files as
well.  

Linus> So from a page cache standpoint, I think the 4kB size still
Linus> matters. A *lot*. largepages are a complete red herring, and
Linus> will continue to be so pretty much forever (anonymous
Linus> largepages perhaps less so).

I think in the future, being able to efficiently mix page sizes will
become useful, if only to lower the memory overhead of keeping track
of large numbers of pages. 

John

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread John Stoffel
> "Ingo" == Ingo Molnar  writes:

Ingo> * Rik van Riel  wrote:

>> The disadvantage is pretty obvious too: 4kB pages would no longer be 
>> the fast case, with an indirection. I do not know how much of an 
>> issue that would be, or whether it even makes sense for 4kB pages to 
>> continue being the fast case going forward.

Ingo> I strongly disagree that 4kB does not matter as much: it is _the_ 
Ingo> bread and butter of 99% of Linux usecases. 4kB isn't going away 
Ingo> anytime soon - THP might look nice in benchmarks, but it does not 
Ingo> matter nearly as much in practice and for filesystems and IO it's 
Ingo> absolutely crazy to think about 2MB granularity.

Ingo> Having said that, I don't think a single jump of indirection is a big 
Ingo> issue - except for the present case where all the pmem IO space is 
Ingo> mapped non-cacheable. Write-through caching patches are in the works 
Ingo> though, and that should make it plenty fast.

>> Memory trends point in one direction, file size trends in another.
>> 
>> For persistent memory, we would not need 4kB page struct pages 
>> unless memory from a particular area was in small files AND those 
>> files were being actively accessed. [...]

Ingo> Average file size on my system's /usr is 12.5K:

Ingo> triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") |
Ingo> sed 's/ /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n"
Ingo> | wc -l; ) | bc 12502

Now go and look at your /home or /data/ or /work areas, where the
endusers are actually keeping their day to day work.  Photos, mp3,
design files, source code, object code littered around, etc.

Now I also have 12Tb filesystems with 30+ million files in them, which
just *suck* for backup, esp incrementals.  I have one monster with 85+
million files (time to get beat on users again ...) which needs to be
pruned.

So I'm not arguing against you, I'm just saying you need better more
representative numbers across more day to day work.  Running this
exact same command against my home directory gets:

528989

So I'm not arguing one way or another... just providing numbers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-08 Thread John Stoffel
> "Sage" == Sage Weil  writes:

Sage> On Thu, 7 May 2015, Zach Brown wrote:
>> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
>> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
>> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
>> > > owning the file or having the CAP_FOWNER capability.  If we're not
>> > > comfortable allowing owners to prevent mtime/ctime updates then we
>> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
>> > 
>> > I dislike "turn off safety for performance" options because Joe
>> > SpeedRacer will always select performance over safety.
>> 
>> Well, for ceph there's no safety concern.  They never use cmtime in
>> these files.
>> 
>> So are you suggesting not implementing this and making them rework their
>> IO paths to avoid the fs maintaining mtime so that we don't give Joe
>> Speedracer more rope?  Or are we talking about adding some speed bumps
>> that ceph can flip on that might give Joe Speedracer pause?

Sage> I think this is the fundamental question: who do we give the
Sage> ammunition to, the user or app writer, or the sysadmin?

Sage> One might argue that we gave the user a similar power with
Sage> O_NOATIME (the power to break applications that assume atime is
Sage> accurate).  Here we give developers/users the power to not
Sage> update mtime and suffer the consequences (like, obviously,
Sage> breaking mtime-based backups).  It should be pretty obvious to
Sage> anyone using the flag what the consequences are.

Not modifying atime doesn't really break anything except people who
think they can tell when a file was last accessed.  Which isn't
critical (unless your in a paranoid security conscious place...) but
MTIME is another beast entirely.   Turning that off is going to break
lots of hidden assumptions.  

Sage> Note that we can suffer similar lapses in mtime with fdatasync
Sage> followed by a system crash.  And as Andy points out it's
Sage> semi-broken for writable mmap.  The crash case is obviously a
Sage> slightly different thing, but the idea that mtime can't always
Sage> be trusted certainly isn't crazy talk.

True, but after a crash... people expect and understand there might be
corruption in a filesystem.  

Sage> Or, we can be conservative and require a mount option so that
Sage> the admin has to explicitly allow behavior that might break some
Sage> existing assumptions about mtime/ctime ('-o user_noatime' I
Sage> guess?).


Sage> I'm happy either way, so long as in the end an unprivileged ceph
Sage> daemon avoids the useless work.  In our case we always own the
Sage> entire mount/disk, so a mount option is just fine.

I agree with the mount option, makes it crystal clear.  And then it's
on the sysadmin/owner of the system to understand (ha!) the problems.

This is all me speaking with my Sysadmin hat firmly on my head.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-08 Thread John Stoffel
 Sage == Sage Weil s...@newdream.net writes:

Sage On Thu, 7 May 2015, Zach Brown wrote:
 On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
  On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
   The criteria for using O_NOMTIME is the same as for using O_NOATIME:
   owning the file or having the CAP_FOWNER capability.  If we're not
   comfortable allowing owners to prevent mtime/ctime updates then we
   should add a tunable to allow O_NOMTIME.  Maybe a mount option?
  
  I dislike turn off safety for performance options because Joe
  SpeedRacer will always select performance over safety.
 
 Well, for ceph there's no safety concern.  They never use cmtime in
 these files.
 
 So are you suggesting not implementing this and making them rework their
 IO paths to avoid the fs maintaining mtime so that we don't give Joe
 Speedracer more rope?  Or are we talking about adding some speed bumps
 that ceph can flip on that might give Joe Speedracer pause?

Sage I think this is the fundamental question: who do we give the
Sage ammunition to, the user or app writer, or the sysadmin?

Sage One might argue that we gave the user a similar power with
Sage O_NOATIME (the power to break applications that assume atime is
Sage accurate).  Here we give developers/users the power to not
Sage update mtime and suffer the consequences (like, obviously,
Sage breaking mtime-based backups).  It should be pretty obvious to
Sage anyone using the flag what the consequences are.

Not modifying atime doesn't really break anything except people who
think they can tell when a file was last accessed.  Which isn't
critical (unless your in a paranoid security conscious place...) but
MTIME is another beast entirely.   Turning that off is going to break
lots of hidden assumptions.  

Sage Note that we can suffer similar lapses in mtime with fdatasync
Sage followed by a system crash.  And as Andy points out it's
Sage semi-broken for writable mmap.  The crash case is obviously a
Sage slightly different thing, but the idea that mtime can't always
Sage be trusted certainly isn't crazy talk.

True, but after a crash... people expect and understand there might be
corruption in a filesystem.  

Sage Or, we can be conservative and require a mount option so that
Sage the admin has to explicitly allow behavior that might break some
Sage existing assumptions about mtime/ctime ('-o user_noatime' I
Sage guess?).


Sage I'm happy either way, so long as in the end an unprivileged ceph
Sage daemon avoids the useless work.  In our case we always own the
Sage entire mount/disk, so a mount option is just fine.

I agree with the mount option, makes it crystal clear.  And then it's
on the sysadmin/owner of the system to understand (ha!) the problems.

This is all me speaking with my Sysadmin hat firmly on my head.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread John Stoffel
 Ingo == Ingo Molnar mi...@kernel.org writes:

Ingo * Rik van Riel r...@redhat.com wrote:

 The disadvantage is pretty obvious too: 4kB pages would no longer be 
 the fast case, with an indirection. I do not know how much of an 
 issue that would be, or whether it even makes sense for 4kB pages to 
 continue being the fast case going forward.

Ingo I strongly disagree that 4kB does not matter as much: it is _the_ 
Ingo bread and butter of 99% of Linux usecases. 4kB isn't going away 
Ingo anytime soon - THP might look nice in benchmarks, but it does not 
Ingo matter nearly as much in practice and for filesystems and IO it's 
Ingo absolutely crazy to think about 2MB granularity.

Ingo Having said that, I don't think a single jump of indirection is a big 
Ingo issue - except for the present case where all the pmem IO space is 
Ingo mapped non-cacheable. Write-through caching patches are in the works 
Ingo though, and that should make it plenty fast.

 Memory trends point in one direction, file size trends in another.
 
 For persistent memory, we would not need 4kB page struct pages 
 unless memory from a particular area was in small files AND those 
 files were being actively accessed. [...]

Ingo Average file size on my system's /usr is 12.5K:

Ingo triton:/usr ( echo -n $(echo $(find . -type f -printf %s\n) |
Ingo sed 's/ /+/g' | bc); echo -n /; find . -type f -printf %s\n
Ingo | wc -l; ) | bc 12502

Now go and look at your /home or /data/ or /work areas, where the
endusers are actually keeping their day to day work.  Photos, mp3,
design files, source code, object code littered around, etc.

Now I also have 12Tb filesystems with 30+ million files in them, which
just *suck* for backup, esp incrementals.  I have one monster with 85+
million files (time to get beat on users again ...) which needs to be
pruned.

So I'm not arguing against you, I'm just saying you need better more
representative numbers across more day to day work.  Running this
exact same command against my home directory gets:

528989

So I'm not arguing one way or another... just providing numbers.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread John Stoffel
 Linus == Linus Torvalds torva...@linux-foundation.org writes:

Linus On Fri, May 8, 2015 at 7:40 AM, John Stoffel j...@stoffel.org wrote:
 
 Now go and look at your /home or /data/ or /work areas, where the
 endusers are actually keeping their day to day work.  Photos, mp3,
 design files, source code, object code littered around, etc.

Linus However, the big files in that list are almost immaterial from a
Linus caching standpoint.

Linus Caching source code is a big deal - just try not doing it and
Linus you'll figure it out. And the kernel C source files used to
Linus have a median size around 4k.

Caching any files is a big deal, and if I'm doing batch edits of large
jpegs, won't they get cached as well?   

Linus The big files in your home directory? Let me make an educated
Linus guess.  Very few to *none* of them are actually in your page
Linus cache right now.  And you'd never even care if they ever made
Linus it into your page cache *at*all*. Much less whether you could
Linus ever cache them using large pages using some very fancy cache.

Hmm... probably not honestly, since I'm not a home and not using the
system actively right now.  But I can see situations where being able
to mix different page sizes efficiently might be a good thing.  

Linus There are big files that care about caches, but they tend to be
Linus binaries, and for other reasons (things like randomization) you
Linus would never want to use largepages for those anyway.

Or large design files, like my users at $WORK use, which can be 4Gb in
size for a large design, which is ASIC chip layout work.  So I'm a
little bit in the minority there.  

And yes I do have other users will millions of itty bitty files as
well.  

Linus So from a page cache standpoint, I think the 4kB size still
Linus matters. A *lot*. largepages are a complete red herring, and
Linus will continue to be so pretty much forever (anonymous
Linus largepages perhaps less so).

I think in the future, being able to efficiently mix page sizes will
become useful, if only to lower the memory overhead of keeping track
of large numbers of pages. 

John

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-29 Thread John Stoffel
> "David" == David Herrmann  writes:

David> Hi
David> On Wed, Apr 29, 2015 at 10:43 PM, David Lang  wrote:
>> If the justification for why this needs to be in the kernel is that you
>> can't reliably prevent apps from exiting if there are pending messages, [...]

David> It's not.

>> the answer of "preventing apps from exiting if there are pending messages
>> isn't a sane thing to try and do" is a direct counter to that justification
>> for including it in the kernel.

David> It's optionally used for reliable exit-on-idle.

Then why is there a critical race that must be solved in the kernel if
it's optional?  And can you please describe in more detail what this
'exit-on-idle' thing is and how it works and why you would use it?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-29 Thread John Stoffel
> "Austin" == Austin S Hemmelgarn  writes:

Austin> On 2015-04-29 14:54, Andy Lutomirski wrote:
>> On Apr 29, 2015 5:48 AM, "Harald Hoyer"  wrote:
>>> 
>>> * Being in the kernel closes a lot of races which can't be fixed with
>>> the current userspace solutions.  For example, with kdbus, there is a
>>> way a client can disconnect from a bus, but do so only if no further
>>> messages present in its queue, which is crucial for implementing
>>> race-free "exit-on-idle" services
>> 
>> This can be implemented in userspace.
>> 
>> Client to dbus daemon: may I exit now?
>> Dbus daemon to client: yes (and no more messages) or no
>> 

Austin> Depending on how this is implemented, there would be a
Austin> potential issue if a message arrived for the client after the
Austin> daemon told it it could exit, but before it finished shutdown,
Austin> in which case the message might get lost.

What makes anyone think they can guarrantee that a message is even
received?  I could see the daemon sending the message and the client
getting a segfault and dumping core.  What then?  How would kdbus
solve this type of "race" anyway? 

Can anyone give a concrete example of one of the races that are closed
here?  That's been one of the missing examples.  And remember, there's
no perfection.  Even in the kernel we just had a discussion about
missed/missing IPIs and lost processor interrupts, etc.  Expecting
perfection is just asking for trouble.  

That's why there are timeouts, retries and just giving up and throwing
an exception.  

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-29 Thread John Stoffel
>>>>> "Steven" == Steven Rostedt  writes:

Steven> On Wed, Apr 29, 2015 at 12:26:59PM -0400, John Stoffel wrote:
>> 
>> If your customers wnat this feature, you're more than welcome to fork
>> the kernel and support it yourself.  Oh wait... Redhat does that
>> already.  So what's the problem?   Just put it into RHEL (which I use
>> I admit, along with Debian/Mint) and be done with it.

Steven> Red Hat tries very hard to push things upstream. It's policy
Steven> is to not keep things for themselves, but always work with the
Steven> community. That way, everyone benefits. Ideally, we should
Steven> come up with a solution that works for all.

Yeah, I agree they have been good.  I'm just reacting to the off the
cuff comment of "my customers need it" which isn't a justification for
this feature, esp when it hasn't been shown to be needed in the
kernel.  

We went through alot of this with tux the in-kernel httpd server, and
pushing other stuff out to user-space over the years.  Why this needs
to come in isn't clear.  Or why not just a small part needing to come
in with the rest in userspace.   

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-29 Thread John Stoffel
> "Harald" == Harald Hoyer  writes:

Harald> On 29.04.2015 15:33, Richard Weinberger wrote:
>> It depends how you define "beginning". To me an initramfs is a *very* minimal
>> tool to prepare the rootfs and nothing more (no udev, no systemd, no
>> "mini distro").
>> If the initramfs fails to do its job it can print to the console like
>> the kernel does if it fails
>> at a very early stage.
>> 

Harald> Your solution might work for your small personal needs, but
Harald> not for our customers.

Arguing that your needs outweight mine because you have customers
ain't gonna fly... I don't care about your customers and why should I?
I'm not getting any money from them.  Nor do I make any money from
Linux kernel though as an IT person, I support Linux all day long.  Do
my requirements get listened to as well?

If your customers wnat this feature, you're more than welcome to fork
the kernel and support it yourself.  Oh wait... Redhat does that
already.  So what's the problem?   Just put it into RHEL (which I use
I admit, along with Debian/Mint) and be done with it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-29 Thread John Stoffel
 Harald == Harald Hoyer har...@redhat.com writes:

Harald On 29.04.2015 15:33, Richard Weinberger wrote:
 It depends how you define beginning. To me an initramfs is a *very* minimal
 tool to prepare the rootfs and nothing more (no udev, no systemd, no
 mini distro).
 If the initramfs fails to do its job it can print to the console like
 the kernel does if it fails
 at a very early stage.
 

Harald Your solution might work for your small personal needs, but
Harald not for our customers.

Arguing that your needs outweight mine because you have customers
ain't gonna fly... I don't care about your customers and why should I?
I'm not getting any money from them.  Nor do I make any money from
Linux kernel though as an IT person, I support Linux all day long.  Do
my requirements get listened to as well?

If your customers wnat this feature, you're more than welcome to fork
the kernel and support it yourself.  Oh wait... Redhat does that
already.  So what's the problem?   Just put it into RHEL (which I use
I admit, along with Debian/Mint) and be done with it.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-29 Thread John Stoffel
 Steven == Steven Rostedt rost...@goodmis.org writes:

Steven On Wed, Apr 29, 2015 at 12:26:59PM -0400, John Stoffel wrote:
 
 If your customers wnat this feature, you're more than welcome to fork
 the kernel and support it yourself.  Oh wait... Redhat does that
 already.  So what's the problem?   Just put it into RHEL (which I use
 I admit, along with Debian/Mint) and be done with it.

Steven Red Hat tries very hard to push things upstream. It's policy
Steven is to not keep things for themselves, but always work with the
Steven community. That way, everyone benefits. Ideally, we should
Steven come up with a solution that works for all.

Yeah, I agree they have been good.  I'm just reacting to the off the
cuff comment of my customers need it which isn't a justification for
this feature, esp when it hasn't been shown to be needed in the
kernel.  

We went through alot of this with tux the in-kernel httpd server, and
pushing other stuff out to user-space over the years.  Why this needs
to come in isn't clear.  Or why not just a small part needing to come
in with the rest in userspace.   

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-29 Thread John Stoffel
 Austin == Austin S Hemmelgarn ahferro...@gmail.com writes:

Austin On 2015-04-29 14:54, Andy Lutomirski wrote:
 On Apr 29, 2015 5:48 AM, Harald Hoyer har...@redhat.com wrote:
 
 * Being in the kernel closes a lot of races which can't be fixed with
 the current userspace solutions.  For example, with kdbus, there is a
 way a client can disconnect from a bus, but do so only if no further
 messages present in its queue, which is crucial for implementing
 race-free exit-on-idle services
 
 This can be implemented in userspace.
 
 Client to dbus daemon: may I exit now?
 Dbus daemon to client: yes (and no more messages) or no
 

Austin Depending on how this is implemented, there would be a
Austin potential issue if a message arrived for the client after the
Austin daemon told it it could exit, but before it finished shutdown,
Austin in which case the message might get lost.

What makes anyone think they can guarrantee that a message is even
received?  I could see the daemon sending the message and the client
getting a segfault and dumping core.  What then?  How would kdbus
solve this type of race anyway? 

Can anyone give a concrete example of one of the races that are closed
here?  That's been one of the missing examples.  And remember, there's
no perfection.  Even in the kernel we just had a discussion about
missed/missing IPIs and lost processor interrupts, etc.  Expecting
perfection is just asking for trouble.  

That's why there are timeouts, retries and just giving up and throwing
an exception.  

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-29 Thread John Stoffel
 David == David Herrmann dh.herrm...@gmail.com writes:

David Hi
David On Wed, Apr 29, 2015 at 10:43 PM, David Lang da...@lang.hm wrote:
 If the justification for why this needs to be in the kernel is that you
 can't reliably prevent apps from exiting if there are pending messages, [...]

David It's not.

 the answer of preventing apps from exiting if there are pending messages
 isn't a sane thing to try and do is a direct counter to that justification
 for including it in the kernel.

David It's optionally used for reliable exit-on-idle.

Then why is there a critical race that must be solved in the kernel if
it's optional?  And can you please describe in more detail what this
'exit-on-idle' thing is and how it works and why you would use it?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-28 Thread John Stoffel
> "Havoc" == Havoc Pennington  writes:

Havoc> On Tue, Apr 28, 2015 at 1:18 PM, Theodore Ts'o  wrote:
>> So the question is if one of the justifications for moving the daemon
>> into kernel space is that it's performance is crap, then I think it is
>> useful to determine whether a fully optimized userspace daemon would
>> be good enough.
>> 

Havoc> Yeah. I don't know how you answer that, because the answer is
Havoc> probably "it would be good enough for some things and not for
Havoc> other things." It depends on whether an app is sending enough
Havoc> data to be too slow, and it depends on the hardware, right.

So what happens if we put kdbus into the kernel and it's still too
slow?  What then?  

Havoc> What I think we might know: the userspace:kernel time-to-send
Havoc> ratio should always be around 2:1, if both of them are
Havoc> similarly-implemented, because the userspace version has about
Havoc> 2x the work to do.

I'm not sure I agree with this statement, just putting something into
the kernel doesn't magically make the work go away, and the overhead
people are talking about won't change if applications and libraries
keep opening/closing the connection to the bus all the time.

Havoc> The actual wall-clock time of course depends on the hardware
Havoc> and what's being sent.

Havoc> If there was a deviation from 2:1 in a benchmark, it might be
Havoc> because of implementation issues - so for example
Havoc> libdbus+dbus-daemon might be 3:1 or 5:1 to sd-dbus+kdbus,
Havoc> because sd-dbus isn't as bloated as libdbus, say. That isn't
Havoc> telling you anything about kernel vs.  userspace architecture,
Havoc> the extra ratio above 2:1 is only telling you about userspace
Havoc> implementation quality.

Which is also telling you that maybe userspace could be improved more,
before it needs to even think about going into the kernel?  

Havoc> For purposes of deciding what to put in kernel - the
Havoc> differences between dbus client implementations (sd-dbus,
Havoc> libdbus, gdbus, etc.)  seem like irrelevant noise to me.

Havoc> Re: the slippery slope to LDAP in the kernel - my questions
Havoc> would be things like 1) what are non-performance reasons to
Havoc> have dbus in the kernel, such as early boot or security
Havoc> considerations; 2) does LDAP in kernel give these kind of 2:1
Havoc> gains; 3) is there a simpler way to get the 2:1 gain for
Havoc> dbus...

Havoc> Others can answer those better than I can.

Havoc> I _would_ say that dbus is more "generic" than something like
Havoc> LDAP; dbus is specific to the use-case of coordinating
Havoc> processes on a single machine, but it isn't specific to any
Havoc> particular application, and it's been used for lots of
Havoc> different applications. On my laptop, which is a pretty normal
Havoc> fedora 21 as far as I know:

LDAP is pretty damn generic, in that you can put pretty large objects
into it, and pretty large OUs, etc.  So why would it be a candidate
for going into the kernel?  And why is kdbus so important in the
kernel as well?  People have talked about it needing to be there for
bootup, but isn't that why we ripped out RAID detection and such from
the kernel and built initramfs, so that there's LESS in the kernel,
and more in an early userspace?  Same idea with dbus in my opinion.

Havoc> $ rpm -q --whatrequires 'libdbus-1.so.3()(64bit)' | wc -l
Havoc> 113

Havoc> this omits anyone using a different binding, it's only libdbus users.

>> I find dbus to be extremely hard to debug when my desktop starts doing
>> things I don't want it to do.  The fact that it might be flinging around 
>> hundreds
>> of thousands of messages, and that this is something we want to encourage,

Havoc> This particular argument doesn't resonate with me ... if dbus
Havoc> is hard to debug, it's not as if "ad hoc application-specific
Havoc> sidechannel somebody cooked up" is going to be easier.

When Ted is saying it's hard to debug... then maybe it's a bit crappy
in design or implementation?  

Havoc> People aren't usually making up data to send around just because they
Havoc> can. If they need to send an audio stream, and dbus is too slow,
Havoc> they'll send it another ad hoc way, but it ultimately has to get sent.
Havoc> Same for most data, it is the size it is and it needs to go where it
Havoc> needs to go, for some what-the-user-wants-to-do kind of reason.

Havoc> If apps have to, they say "I'm sorry Dave I can't do that - you
Havoc> can't software-decode 4K video on your 300mhz ARM" - of course.

So why DOES audio need to go via DBUS?  What about video?  Why
shouldn't that go via dbus as well?  

If one userspace implementation is so crappy, why can't that
implementation be tossed and a better one done?  Or why can't they
just optimize/tune it in userspace instead?  

John

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-28 Thread John Stoffel
 Havoc == Havoc Pennington h...@pobox.com writes:

Havoc On Tue, Apr 28, 2015 at 1:18 PM, Theodore Ts'o ty...@mit.edu wrote:
 So the question is if one of the justifications for moving the daemon
 into kernel space is that it's performance is crap, then I think it is
 useful to determine whether a fully optimized userspace daemon would
 be good enough.
 

Havoc Yeah. I don't know how you answer that, because the answer is
Havoc probably it would be good enough for some things and not for
Havoc other things. It depends on whether an app is sending enough
Havoc data to be too slow, and it depends on the hardware, right.

So what happens if we put kdbus into the kernel and it's still too
slow?  What then?  

Havoc What I think we might know: the userspace:kernel time-to-send
Havoc ratio should always be around 2:1, if both of them are
Havoc similarly-implemented, because the userspace version has about
Havoc 2x the work to do.

I'm not sure I agree with this statement, just putting something into
the kernel doesn't magically make the work go away, and the overhead
people are talking about won't change if applications and libraries
keep opening/closing the connection to the bus all the time.

Havoc The actual wall-clock time of course depends on the hardware
Havoc and what's being sent.

Havoc If there was a deviation from 2:1 in a benchmark, it might be
Havoc because of implementation issues - so for example
Havoc libdbus+dbus-daemon might be 3:1 or 5:1 to sd-dbus+kdbus,
Havoc because sd-dbus isn't as bloated as libdbus, say. That isn't
Havoc telling you anything about kernel vs.  userspace architecture,
Havoc the extra ratio above 2:1 is only telling you about userspace
Havoc implementation quality.

Which is also telling you that maybe userspace could be improved more,
before it needs to even think about going into the kernel?  

Havoc For purposes of deciding what to put in kernel - the
Havoc differences between dbus client implementations (sd-dbus,
Havoc libdbus, gdbus, etc.)  seem like irrelevant noise to me.

Havoc Re: the slippery slope to LDAP in the kernel - my questions
Havoc would be things like 1) what are non-performance reasons to
Havoc have dbus in the kernel, such as early boot or security
Havoc considerations; 2) does LDAP in kernel give these kind of 2:1
Havoc gains; 3) is there a simpler way to get the 2:1 gain for
Havoc dbus...

Havoc Others can answer those better than I can.

Havoc I _would_ say that dbus is more generic than something like
Havoc LDAP; dbus is specific to the use-case of coordinating
Havoc processes on a single machine, but it isn't specific to any
Havoc particular application, and it's been used for lots of
Havoc different applications. On my laptop, which is a pretty normal
Havoc fedora 21 as far as I know:

LDAP is pretty damn generic, in that you can put pretty large objects
into it, and pretty large OUs, etc.  So why would it be a candidate
for going into the kernel?  And why is kdbus so important in the
kernel as well?  People have talked about it needing to be there for
bootup, but isn't that why we ripped out RAID detection and such from
the kernel and built initramfs, so that there's LESS in the kernel,
and more in an early userspace?  Same idea with dbus in my opinion.

Havoc $ rpm -q --whatrequires 'libdbus-1.so.3()(64bit)' | wc -l
Havoc 113

Havoc this omits anyone using a different binding, it's only libdbus users.

 I find dbus to be extremely hard to debug when my desktop starts doing
 things I don't want it to do.  The fact that it might be flinging around 
 hundreds
 of thousands of messages, and that this is something we want to encourage,

Havoc This particular argument doesn't resonate with me ... if dbus
Havoc is hard to debug, it's not as if ad hoc application-specific
Havoc sidechannel somebody cooked up is going to be easier.

When Ted is saying it's hard to debug... then maybe it's a bit crappy
in design or implementation?  

Havoc People aren't usually making up data to send around just because they
Havoc can. If they need to send an audio stream, and dbus is too slow,
Havoc they'll send it another ad hoc way, but it ultimately has to get sent.
Havoc Same for most data, it is the size it is and it needs to go where it
Havoc needs to go, for some what-the-user-wants-to-do kind of reason.

Havoc If apps have to, they say I'm sorry Dave I can't do that - you
Havoc can't software-decode 4K video on your 300mhz ARM - of course.

So why DOES audio need to go via DBUS?  What about video?  Why
shouldn't that go via dbus as well?  

If one userspace implementation is so crappy, why can't that
implementation be tossed and a better one done?  Or why can't they
just optimize/tune it in userspace instead?  

John

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-14 Thread John Stoffel
> "Greg" == Greg Kroah-Hartman  writes:

Greg> On Tue, Apr 14, 2015 at 11:57:22AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 14, 2015 at 10:50 AM, Greg Kroah-Hartman
>>  wrote:
>> > On Mon, Apr 13, 2015 at 02:01:21PM -0700, Andy Lutomirski wrote:
>> >> On Mon, Apr 13, 2015 at 1:45 PM, Greg Kroah-Hartman
>> >>  wrote:
>> >> > On Mon, Apr 13, 2015 at 01:13:26PM -0700, Andy Lutomirski wrote:
>> >> >> On Mon, Apr 13, 2015 at 12:03 PM, Greg Kroah-Hartman
>> >> >>  wrote:
>> >> >> > The following changes since commit 
>> >> >> > 9eccca0843205f87c00404b663188b88eb248051:
>> >> >> >
>> >> >> >   Linux 4.0-rc3 (2015-03-08 16:09:09 -0700)
>> >> >> >
>> >> >> > are available in the git repository at:
>> >> >> >
>> >> >> >   
>> >> >> > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git/ 
>> >> >> > tags/kdbus-4.1-rc1
>> >> >> >
>> >> >> > for you to fetch changes up to 
>> >> >> > 9fb9cd0f4434a23487b6ef3237e733afae90e336:
>> >> >> >
>> >> >> >   kdbus: avoid the use of struct timespec (2015-04-10 14:34:53 +0200)
>> >> >> >
>> >> >> > 
>> >> >> > kdbus for 4.1-rc1
>> >> >> >
>> >> >> > Here's the kdbus pull request for 4.1-rc1.
>> >> >> >
>> >> >> > It's been under development for many years now, and been in 
>> >> >> > linux-next
>> >> >> > for many months, and has undergone loads of testing a review and 
>> >> >> > even a few
>> >> >> > good arguments.  It comes with full documentation and tests.
>> >> >> >
>> >> >> > There has been a few complaints about the code, notably from people 
>> >> >> > who
>> >> >> > don't like the use of metadata in the bus messages.  That is actually
>> >> >> > one of the main features here, as we can get this data in a secure 
>> >> >> > and
>> >> >> > reliable way, and it's something that userspace requires today.  So
>> >> >> > while it does look "odd" to people who are not familiar with dbus, 
>> >> >> > this
>> >> >> > is something that finally fixes a number of almost unfixable races in
>> >> >> > the current dbus implementations.
>> >> >>
>> >> >> While I generally like the concept of having a better in-kernel IPC
>> >> >> mechanism, after some consideration I don't think this belongs in the
>> >> >> kernel in its current form.  Here's why.
>> >> >>
>> >> >> First, the naming is counterintuitive.  There are "endpoints", but you
>> >> >> don't send messages to endpoints.  In fact, an basic kdbus setup will
>> >> >> have exactly one endpoint AFAICT.  Wtf?  This makes talking about it
>> >> >> awkward.
>> >> >
>> >> > Did you read the documentation?  We've been over this before, and it
>> >> > should all be addressed in the documentation based on this coming up.
>> >> >
>> >> >> A lot of the design seems to be to violate the concept of "mechanism,
>> >> >> not policy".  Kdbus is very much a port of userspace dbus to the
>> >> >> kernel, and it appears to be a port designed to preserve some
>> >> >> questionable design decisions instead of learning from them.
>> >> >>
>> >> >> For example, kdbus sticks a whole policy database in the kernel, but
>> >> >> that policy database (AFAICT -- holy crap it's overcomplicated) is
>> >> >> *not* a simple set of rules like "if A then allow B".  Instead it has
>> >> >> really weird dependencies not on what name you're sending to but on
>> >> >> what *other* names the thing you're sending to has.  Sorry, but this
>> >> >> way lies (a) the inability for a large set of developers to understand
>> >> >> what's going on and (b) security bugs.  Also, the result probably
>> >> >> can't be reused as part of a non-legacy-filled sensible design
>> >> >
>> >> > What policy database?  Matching messages to subscribers?  That's the
>> >> > same type of "database" that other ipc subsystems need/want, there's
>> >> > nothing radical here.
>> >>
>> >> Let me quote from the latest version of the kdbus docs:
>> >>
>> >>   Note that TALK access is checked against all names of a connection. 
>> >> For
>> >>   example, if a connection owns both 
>> >> 'org.foo.bar' and
>> >>   'org.blah.baz', and the policy database allows
>> >>   'org.blah.baz' to be talked to by WORLD, then 
>> >> this
>> >>   permission is also granted to 'org.foo.bar'. 
>> >> That
>> >>   might sound illogical, but after all, we allow messages to be 
>> >> directed to
>> >>   either the ID or a well-known name, and policy is applied to the
>> >>   connection, not the name. In other words, the effective TALK policy 
>> >> for a
>> >>   connection is the most permissive of all names the connection owns.
>> >>
>> >> In my humble opinion, this paragraph speaks for itself.  The design is
>> >> bad, full stop.
>> >
>> > First off, thanks for reading the docs, I appreciate that.  But realize
>> > also, that this is straight from the D-Bus spec.  We aren't doing
>> > anything "radical" here, this is what your desktop uses that you are
>> > typing your email from.
>> >
>> > Yes, it's an 

Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-14 Thread John Stoffel
 Greg == Greg Kroah-Hartman gre...@linuxfoundation.org writes:

Greg On Tue, Apr 14, 2015 at 11:57:22AM -0700, Andy Lutomirski wrote:
 On Tue, Apr 14, 2015 at 10:50 AM, Greg Kroah-Hartman
 gre...@linuxfoundation.org wrote:
  On Mon, Apr 13, 2015 at 02:01:21PM -0700, Andy Lutomirski wrote:
  On Mon, Apr 13, 2015 at 1:45 PM, Greg Kroah-Hartman
  gre...@linuxfoundation.org wrote:
   On Mon, Apr 13, 2015 at 01:13:26PM -0700, Andy Lutomirski wrote:
   On Mon, Apr 13, 2015 at 12:03 PM, Greg Kroah-Hartman
   gre...@linuxfoundation.org wrote:
The following changes since commit 
9eccca0843205f87c00404b663188b88eb248051:
   
  Linux 4.0-rc3 (2015-03-08 16:09:09 -0700)
   
are available in the git repository at:
   
  
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git/ 
tags/kdbus-4.1-rc1
   
for you to fetch changes up to 
9fb9cd0f4434a23487b6ef3237e733afae90e336:
   
  kdbus: avoid the use of struct timespec (2015-04-10 14:34:53 +0200)
   

kdbus for 4.1-rc1
   
Here's the kdbus pull request for 4.1-rc1.
   
It's been under development for many years now, and been in 
linux-next
for many months, and has undergone loads of testing a review and 
even a few
good arguments.  It comes with full documentation and tests.
   
There has been a few complaints about the code, notably from people 
who
don't like the use of metadata in the bus messages.  That is actually
one of the main features here, as we can get this data in a secure 
and
reliable way, and it's something that userspace requires today.  So
while it does look odd to people who are not familiar with dbus, 
this
is something that finally fixes a number of almost unfixable races in
the current dbus implementations.
  
   While I generally like the concept of having a better in-kernel IPC
   mechanism, after some consideration I don't think this belongs in the
   kernel in its current form.  Here's why.
  
   First, the naming is counterintuitive.  There are endpoints, but you
   don't send messages to endpoints.  In fact, an basic kdbus setup will
   have exactly one endpoint AFAICT.  Wtf?  This makes talking about it
   awkward.
  
   Did you read the documentation?  We've been over this before, and it
   should all be addressed in the documentation based on this coming up.
  
   A lot of the design seems to be to violate the concept of mechanism,
   not policy.  Kdbus is very much a port of userspace dbus to the
   kernel, and it appears to be a port designed to preserve some
   questionable design decisions instead of learning from them.
  
   For example, kdbus sticks a whole policy database in the kernel, but
   that policy database (AFAICT -- holy crap it's overcomplicated) is
   *not* a simple set of rules like if A then allow B.  Instead it has
   really weird dependencies not on what name you're sending to but on
   what *other* names the thing you're sending to has.  Sorry, but this
   way lies (a) the inability for a large set of developers to understand
   what's going on and (b) security bugs.  Also, the result probably
   can't be reused as part of a non-legacy-filled sensible design
  
   What policy database?  Matching messages to subscribers?  That's the
   same type of database that other ipc subsystems need/want, there's
   nothing radical here.
 
  Let me quote from the latest version of the kdbus docs:
 
Note that TALK access is checked against all names of a connection. 
  For
example, if a connection owns both 
  constant'org.foo.bar'/constant and
constant'org.blah.baz'/constant, and the policy database allows
constant'org.blah.baz'/constant to be talked to by WORLD, then 
  this
permission is also granted to constant'org.foo.bar'/constant. 
  That
might sound illogical, but after all, we allow messages to be 
  directed to
either the ID or a well-known name, and policy is applied to the
connection, not the name. In other words, the effective TALK policy 
  for a
connection is the most permissive of all names the connection owns.
 
  In my humble opinion, this paragraph speaks for itself.  The design is
  bad, full stop.
 
  First off, thanks for reading the docs, I appreciate that.  But realize
  also, that this is straight from the D-Bus spec.  We aren't doing
  anything radical here, this is what your desktop uses that you are
  typing your email from.
 
  Yes, it's an unfortunate design, but one that we are all stuck with
  (think of it as having to implement code for horrid hardware that you
  have to get to work properly.)
 
 I agree.  You've sent a pull request for an unfortunate design.  I
 don't think that unfortunate design belongs in the kernel.  If it says
 in userspace, then user programmers could potentially fix it some day.

Greg You might not like the design, but 

Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread John Stoffel
>>>>> "David" == David Miller  writes:

David> From: "John Stoffel" 
David> Date: Mon, 23 Mar 2015 12:51:03 -0400

>> Would it make sense to have some memmove()/memcopy() tests on bootup
>> to catch problems like this?  I know this is a strange case, and
>> probably not too common, but how hard would it be to wire up tests
>> that go through 1 to 128 byte memmove() on bootup to make sure things
>> work properly?
>> 
>> This seems like one of those critical, but subtle things to be
>> checked.  And doing it only on bootup wouldn't slow anything down and
>> would (ideally) automatically get us coverage when people add new
>> archs or update the code.

David> One of two things is already happening.

David> There have been assembler memcpy/memset development test harnesses
David> around that most arch developers are using, and those test things
David> rather extensively.

David> Also, the memcpy/memset routines on sparc in particular are completely
David> shared with glibc, we use the same exact code in both trees.  So it's
David> getting tested there too.

Thats' good to know.   I wasn't sure.

David> memmove() is just not handled this way.

Bummers.  So why isn't this covered by the glibc tests too?  Not
accusing, not at all!  Just wondering.  

Thanks for all your work David, I've been amazed at your energy here!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread John Stoffel

David> 
David> [PATCH] sparc64: Fix several bugs in memmove().

David> Firstly, handle zero length calls properly.  Believe it or not there
David> are a few of these happening during early boot.

David> Next, we can't just drop to a memcpy() call in the forward copy case
David> where dst <= src.  The reason is that the cache initializing stores
David> used in the Niagara memcpy() implementations can end up clearing out
David> cache lines before we've sourced their original contents completely.

David> For example, considering NG4memcpy, the main unrolled loop begins like
David> this:

David>  load   src + 0x00
David>  load   src + 0x08
David>  load   src + 0x10
David>  load   src + 0x18
David>  load   src + 0x20
David>  store  dst + 0x00

David> Assume dst is 64 byte aligned and let's say that dst is src - 8 for
David> this memcpy() call.  That store at the end there is the one to the
David> first line in the cache line, thus clearing the whole line, which thus
David> clobbers "src + 0x28" before it even gets loaded.

David> To avoid this, just fall through to a simple copy only mildly
David> optimized for the case where src and dst are 8 byte aligned and the
David> length is a multiple of 8 as well.  We could get fancy and call
David> GENmemcpy() but this is good enough for how this thing is actually
David> used.

Would it make sense to have some memmove()/memcopy() tests on bootup
to catch problems like this?  I know this is a strange case, and
probably not too common, but how hard would it be to wire up tests
that go through 1 to 128 byte memmove() on bootup to make sure things
work properly?

This seems like one of those critical, but subtle things to be
checked.  And doing it only on bootup wouldn't slow anything down and
would (ideally) automatically get us coverage when people add new
archs or update the code.

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread John Stoffel
 David == David Miller da...@davemloft.net writes:

David From: John Stoffel j...@stoffel.org
David Date: Mon, 23 Mar 2015 12:51:03 -0400

 Would it make sense to have some memmove()/memcopy() tests on bootup
 to catch problems like this?  I know this is a strange case, and
 probably not too common, but how hard would it be to wire up tests
 that go through 1 to 128 byte memmove() on bootup to make sure things
 work properly?
 
 This seems like one of those critical, but subtle things to be
 checked.  And doing it only on bootup wouldn't slow anything down and
 would (ideally) automatically get us coverage when people add new
 archs or update the code.

David One of two things is already happening.

David There have been assembler memcpy/memset development test harnesses
David around that most arch developers are using, and those test things
David rather extensively.

David Also, the memcpy/memset routines on sparc in particular are completely
David shared with glibc, we use the same exact code in both trees.  So it's
David getting tested there too.

Thats' good to know.   I wasn't sure.

David memmove() is just not handled this way.

Bummers.  So why isn't this covered by the glibc tests too?  Not
accusing, not at all!  Just wondering.  

Thanks for all your work David, I've been amazed at your energy here!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.0.0-rc4: panic in free_block

2015-03-23 Thread John Stoffel

David 
David [PATCH] sparc64: Fix several bugs in memmove().

David Firstly, handle zero length calls properly.  Believe it or not there
David are a few of these happening during early boot.

David Next, we can't just drop to a memcpy() call in the forward copy case
David where dst = src.  The reason is that the cache initializing stores
David used in the Niagara memcpy() implementations can end up clearing out
David cache lines before we've sourced their original contents completely.

David For example, considering NG4memcpy, the main unrolled loop begins like
David this:

David  load   src + 0x00
David  load   src + 0x08
David  load   src + 0x10
David  load   src + 0x18
David  load   src + 0x20
David  store  dst + 0x00

David Assume dst is 64 byte aligned and let's say that dst is src - 8 for
David this memcpy() call.  That store at the end there is the one to the
David first line in the cache line, thus clearing the whole line, which thus
David clobbers src + 0x28 before it even gets loaded.

David To avoid this, just fall through to a simple copy only mildly
David optimized for the case where src and dst are 8 byte aligned and the
David length is a multiple of 8 as well.  We could get fancy and call
David GENmemcpy() but this is good enough for how this thing is actually
David used.

Would it make sense to have some memmove()/memcopy() tests on bootup
to catch problems like this?  I know this is a strange case, and
probably not too common, but how hard would it be to wire up tests
that go through 1 to 128 byte memmove() on bootup to make sure things
work properly?

This seems like one of those critical, but subtle things to be
checked.  And doing it only on bootup wouldn't slow anything down and
would (ideally) automatically get us coverage when people add new
archs or update the code.

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] x86_64,signal: Remove 'fs' and 'gs' from sigcontext

2015-03-10 Thread John Stoffel
> "Andy" == Andy Lutomirski  writes:

Andy> As far as I can tell, these fields have been set to zero on save and
Andy> ignored on restore since Linux was imported into git.  Rename them
Andy> '__pad1' and '__pad2' to avoid confusion and to allow them to be
Andy> recycled some day.

Andy> I'm intentionally avoiding calling either of them __pad0: the field
Andy> formerly known as __pad0 is now ss.

Andy> Signed-off-by: Andy Lutomirski 
Andy> ---
Andy>  arch/x86/include/asm/sigcontext.h  | 4 ++--
Andy>  arch/x86/include/uapi/asm/sigcontext.h | 4 ++--
Andy>  arch/x86/kernel/signal.c   | 4 ++--
Andy>  3 files changed, 6 insertions(+), 6 deletions(-)

Andy> diff --git a/arch/x86/include/asm/sigcontext.h 
b/arch/x86/include/asm/sigcontext.h
Andy> index f910cdcb71fd..5f0ef11719e1 100644
Andy> --- a/arch/x86/include/asm/sigcontext.h
Andy> +++ b/arch/x86/include/asm/sigcontext.h
Andy> @@ -57,8 +57,8 @@ struct sigcontext {
Andy>   unsigned long ip;
Andy>   unsigned long flags;
Andy>   unsigned short cs;
Andy> - unsigned short gs;
Andy> - unsigned short fs;
Andy> + unsigned short __pad2;  /* Was called gs, but was always zero. */
Andy> + unsigned short __pad1;  /* Was called gs, but was always zero. */

Shouldn't this comment read:

  /* Was called fs, but was always zero. */

for the second one?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] x86_64,signal: Remove 'fs' and 'gs' from sigcontext

2015-03-10 Thread John Stoffel
 Andy == Andy Lutomirski l...@amacapital.net writes:

Andy As far as I can tell, these fields have been set to zero on save and
Andy ignored on restore since Linux was imported into git.  Rename them
Andy '__pad1' and '__pad2' to avoid confusion and to allow them to be
Andy recycled some day.

Andy I'm intentionally avoiding calling either of them __pad0: the field
Andy formerly known as __pad0 is now ss.

Andy Signed-off-by: Andy Lutomirski l...@amacapital.net
Andy ---
Andy  arch/x86/include/asm/sigcontext.h  | 4 ++--
Andy  arch/x86/include/uapi/asm/sigcontext.h | 4 ++--
Andy  arch/x86/kernel/signal.c   | 4 ++--
Andy  3 files changed, 6 insertions(+), 6 deletions(-)

Andy diff --git a/arch/x86/include/asm/sigcontext.h 
b/arch/x86/include/asm/sigcontext.h
Andy index f910cdcb71fd..5f0ef11719e1 100644
Andy --- a/arch/x86/include/asm/sigcontext.h
Andy +++ b/arch/x86/include/asm/sigcontext.h
Andy @@ -57,8 +57,8 @@ struct sigcontext {
Andy   unsigned long ip;
Andy   unsigned long flags;
Andy   unsigned short cs;
Andy - unsigned short gs;
Andy - unsigned short fs;
Andy + unsigned short __pad2;  /* Was called gs, but was always zero. */
Andy + unsigned short __pad1;  /* Was called gs, but was always zero. */

Shouldn't this comment read:

  /* Was called fs, but was always zero. */

for the second one?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] Support follow_link in RCU-walk.

2015-03-05 Thread John Stoffel
>>>>> "Al" == Al Viro  writes:

Al> On Thu, Mar 05, 2015 at 08:52:20AM -0500, John Stoffel wrote:
>> So what happens if your filesystem is 10Tb in size, and you have 50
>> million files and lots of them are symlinks?  I've got developers who
>> do shit like this and wonder why performance sucks  and I just
>> worry that GPF_KERNEL is a limited resource.  But maybe 64bit systems
>> won't really have any problems?

Al> What would keep all those symlinks' contents pinned in page cache?

Dunno honestly... but your statement nudges my memory that they would
be evicted as memory pressure grows, so it's probably not a problem
afterall.  I just know that my users love tickling strange bugs like
this because they hate to actually cleanup after themselves unless
absolutely necessary.

Thanks for the nudge.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] Support follow_link in RCU-walk.

2015-03-05 Thread John Stoffel
> "Al" == Al Viro  writes:

Al> On Thu, Mar 05, 2015 at 04:21:21PM +1100, NeilBrown wrote:
>> Hi Al (and others),
>> 
>> I wonder if you could look over this patchset.
>> It allows RCU-walk to follow symlinks in many common cases,
>> thus removing a surprising performance hit caused by using symlinks.
>> 
>> The last could of patches make changes to XFS and NFS to support
>> this but I haven't forwarded to the relevant lists yet.
>> If/when the early code meets with approval I'll do that.
>> 
>> The first patch almost certainly needs to be changed.  I originally
>> wrote this code when filesystems could see inside nameidata.
>> It is now opaque so the simplest solution was to provide an
>> accessor function.
>> Maybe I should as a 'flags' arg to ->follow_link?? Or have
-> follow_link and ->follow_link_rcu ??
>> What do you suggest?

Al> Umm...  Some observations:
Al> * now ->follow_link() can be called in RCU mode, which means
Al> that it can race with fs shutdown; not a problem, except that now it
Al> joins ->lookup() et.al. in "if some data structure is needed in RCU
Al> case of that, make sure it's not destroyed without an RCU delay somewhere
Al> between the entry into ->kill_sb() and destruction.
Al> * highmem pages in symlinks: that BS shouldn't be allowed at
Al> all.  Just make sure that at least for those filesystems symlink inodes
Al> get mapping_set_gfp_mask(>i_data, GFP_KERNEL) and be done with that.


So what happens if your filesystem is 10Tb in size, and you have 50
million files and lots of them are symlinks?  I've got developers who
do shit like this and wonder why performance sucks  and I just
worry that GPF_KERNEL is a limited resource.  But maybe 64bit systems
won't really have any problems?

Rest of this is way outside my pay grade to commment on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] Support follow_link in RCU-walk.

2015-03-05 Thread John Stoffel
 Al == Al Viro v...@zeniv.linux.org.uk writes:

Al On Thu, Mar 05, 2015 at 08:52:20AM -0500, John Stoffel wrote:
 So what happens if your filesystem is 10Tb in size, and you have 50
 million files and lots of them are symlinks?  I've got developers who
 do shit like this and wonder why performance sucks  and I just
 worry that GPF_KERNEL is a limited resource.  But maybe 64bit systems
 won't really have any problems?

Al What would keep all those symlinks' contents pinned in page cache?

Dunno honestly... but your statement nudges my memory that they would
be evicted as memory pressure grows, so it's probably not a problem
afterall.  I just know that my users love tickling strange bugs like
this because they hate to actually cleanup after themselves unless
absolutely necessary.

Thanks for the nudge.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] Support follow_link in RCU-walk.

2015-03-05 Thread John Stoffel
 Al == Al Viro v...@zeniv.linux.org.uk writes:

Al On Thu, Mar 05, 2015 at 04:21:21PM +1100, NeilBrown wrote:
 Hi Al (and others),
 
 I wonder if you could look over this patchset.
 It allows RCU-walk to follow symlinks in many common cases,
 thus removing a surprising performance hit caused by using symlinks.
 
 The last could of patches make changes to XFS and NFS to support
 this but I haven't forwarded to the relevant lists yet.
 If/when the early code meets with approval I'll do that.
 
 The first patch almost certainly needs to be changed.  I originally
 wrote this code when filesystems could see inside nameidata.
 It is now opaque so the simplest solution was to provide an
 accessor function.
 Maybe I should as a 'flags' arg to -follow_link?? Or have
- follow_link and -follow_link_rcu ??
 What do you suggest?

Al Umm...  Some observations:
Al * now -follow_link() can be called in RCU mode, which means
Al that it can race with fs shutdown; not a problem, except that now it
Al joins -lookup() et.al. in if some data structure is needed in RCU
Al case of that, make sure it's not destroyed without an RCU delay somewhere
Al between the entry into -kill_sb() and destruction.
Al * highmem pages in symlinks: that BS shouldn't be allowed at
Al all.  Just make sure that at least for those filesystems symlink inodes
Al get mapping_set_gfp_mask(inode-i_data, GFP_KERNEL) and be done with that.


So what happens if your filesystem is 10Tb in size, and you have 50
million files and lots of them are symlinks?  I've got developers who
do shit like this and wonder why performance sucks  and I just
worry that GPF_KERNEL is a limited resource.  But maybe 64bit systems
won't really have any problems?

Rest of this is way outside my pay grade to commment on.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] usb: serial: Perform verification for FTDI FT232R devices

2014-10-23 Thread John Stoffel


So what else is in those magic 2.12.00 official drivers besides this
eeprom magic?  And why don't you printer a much more informative
message to the logs when you do fail a chip?  

No matter what you say here, you're targetting end users with this
patch, even if you're just trying to put pressure on the vendors
making knock off copies of the chip.  Which isn't good, but I know who
I'd be mad at if this bricked my USB to serial cable in the name of
vendor chip purity.  

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] usb: serial: Perform verification for FTDI FT232R devices

2014-10-23 Thread John Stoffel


So what else is in those magic 2.12.00 official drivers besides this
eeprom magic?  And why don't you printer a much more informative
message to the logs when you do fail a chip?  

No matter what you say here, you're targetting end users with this
patch, even if you're just trying to put pressure on the vendors
making knock off copies of the chip.  Which isn't good, but I know who
I'd be mad at if this bricked my USB to serial cable in the name of
vendor chip purity.  

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.16-rc6

2014-07-24 Thread John Stoffel
> "Waiman" == Waiman Long  writes:

Waiman> On 07/24/2014 02:36 PM, Peter Zijlstra wrote:
>> On Thu, Jul 24, 2014 at 11:18:16AM -0700, Linus Torvalds wrote:
>>> On Thu, Jul 24, 2014 at 5:58 AM, Peter Zijlstra  
>>> wrote:
 So going by the nifty picture rostedt made:
 
 [   61.454336]CPU0CPU1
 [   61.454336]
 [   61.454336]   lock(&(>alloc_lock)->rlock);
 [   61.454336]local_irq_disable();
 [   61.454336]lock(tasklist_lock);
 [   61.454336]
 lock(&(>alloc_lock)->rlock);
 [   61.454336]
 [   61.454336] lock(tasklist_lock);
>>> So this *should* be fine. It always has been in the past, and it was
>>> certainly the *intention* that it should continue to work with
>>> qrwlock, even in the presense of pending writers on other cpu's.
>>> 
>>> The qrwlock rules are that a read-lock in an interrupt is still going
>>> to be unfair and succeed if there are other readers.
>> Ah, indeed. Should have checked :/
>> 
>>> So it sounds to me like the new lockdep rules in tip/master are too
>>> strict and are throwing a false positive.
>> Right. Waiman can you have a look?

Waiman> Yes, I think I may have a solution for that.

Waiman> Borislav, can you apply the following patch on top of the lockdep patch 
Waiman> to see if it can fix the problem?

Waiman> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
Waiman> index d24e433..507a8ce 100644
Waiman> --- a/kernel/locking/lockdep.c
Waiman> +++ b/kernel/locking/lockdep.c
Waiman> @@ -3595,6 +3595,12 @@ void lock_acquire(struct lockdep_map *lock, 
Waiman> unsigned int
Waiman>  raw_local_irq_save(flags);
Waiman>  check_flags(flags);

Waiman> +   /*
Waiman> +* An interrupt recursive read in interrupt context can be 
Waiman> considered
Waiman> +* to be the same as a recursive read from checking perspective.
Waiman> +*/
Waiman> +   if ((read == 3) && in_interrupt())
Waiman> +   read = 2;
current-> lockdep_recursion = 1;
Waiman>  trace_lock_acquire(lock, subclass, trylock, read, check, 
Waiman> nest_lock, ip);
Waiman>  __lock_acquire(lock, subclass, trylock, read, check,

Instead of the magic numbers 1,2,3, could you use some nicely named
constants here instead?  

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.16-rc6

2014-07-24 Thread John Stoffel
 Waiman == Waiman Long waiman.l...@hp.com writes:

Waiman On 07/24/2014 02:36 PM, Peter Zijlstra wrote:
 On Thu, Jul 24, 2014 at 11:18:16AM -0700, Linus Torvalds wrote:
 On Thu, Jul 24, 2014 at 5:58 AM, Peter Zijlstrapet...@infradead.org  
 wrote:
 So going by the nifty picture rostedt made:
 
 [   61.454336]CPU0CPU1
 [   61.454336]
 [   61.454336]   lock((p-alloc_lock)-rlock);
 [   61.454336]local_irq_disable();
 [   61.454336]lock(tasklist_lock);
 [   61.454336]
 lock((p-alloc_lock)-rlock);
 [   61.454336]Interrupt
 [   61.454336] lock(tasklist_lock);
 So this *should* be fine. It always has been in the past, and it was
 certainly the *intention* that it should continue to work with
 qrwlock, even in the presense of pending writers on other cpu's.
 
 The qrwlock rules are that a read-lock in an interrupt is still going
 to be unfair and succeed if there are other readers.
 Ah, indeed. Should have checked :/
 
 So it sounds to me like the new lockdep rules in tip/master are too
 strict and are throwing a false positive.
 Right. Waiman can you have a look?

Waiman Yes, I think I may have a solution for that.

Waiman Borislav, can you apply the following patch on top of the lockdep patch 
Waiman to see if it can fix the problem?

Waiman diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
Waiman index d24e433..507a8ce 100644
Waiman --- a/kernel/locking/lockdep.c
Waiman +++ b/kernel/locking/lockdep.c
Waiman @@ -3595,6 +3595,12 @@ void lock_acquire(struct lockdep_map *lock, 
Waiman unsigned int
Waiman  raw_local_irq_save(flags);
Waiman  check_flags(flags);

Waiman +   /*
Waiman +* An interrupt recursive read in interrupt context can be 
Waiman considered
Waiman +* to be the same as a recursive read from checking perspective.
Waiman +*/
Waiman +   if ((read == 3)  in_interrupt())
Waiman +   read = 2;
current- lockdep_recursion = 1;
Waiman  trace_lock_acquire(lock, subclass, trylock, read, check, 
Waiman nest_lock, ip);
Waiman  __lock_acquire(lock, subclass, trylock, read, check,

Instead of the magic numbers 1,2,3, could you use some nicely named
constants here instead?  

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: After unlinking a large file on ext4, the process stalls for a long time

2014-07-16 Thread John Stoffel

Mason> (I hope you'll forgive me for reformatting the quote characters
Mason> to my taste.)

No problem.

Mason> On 16/07/2014 17:16, John Stoffel wrote:

>> Mason wrote:
>> 
>>> I'm using Linux (3.1.10 at the moment) on a embedded system
>>> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
>>> 800-MHz CPU, USB).
>> 
>> Sounds like a Raspberry Pi...  And have you investigated using
>> something like XFS as your filesystem instead?

Mason> The system is a set-top box (DVB-S2 receiver). The system CPU is
Mason> MIPS 74K, not ARM (not that it matters, in this case).

So it's a slow slow box... and it's only going to handle writing data
at 3Mbs... so why do you insist that the filesystem work at magic
speeds?  

Mason> No, I have not investigated other file systems (yet).

>>> I need to be able to create large files (50-1000 GB) "as fast
>>> as possible".  These files are created on an external hard disk
>>> drive, connected over Hi-Speed USB (typical throughput 30 MB/s).
>> 
>> Really... so you just need to create allocations of space as quickly
>> as possible,

Mason> I may not have been clear. The creation needs to be fast (in UX terms,
Mason> so less than 5-10 seconds), but it only occurs a few times during the
Mason> lifetime of the system.

If this only happens a few times, why do you care how quick the delete
is?  And if it's only happening a few times, why don't you just do the
space reservation OUTSIDE of the filesystem? 

Or do you need to do encryption of these containers and strictly
segrate them?  Basically, implement a daemon which knows how much free
space is on the device, how much is already pre-committed to other
users, and then how much free space there is.  

If the space isn't actually used, then you don't care, because you've
reserved it.  

>> which will then be filled in later with actual data?

Mason> Yes. In fact, I use the loopback device to format the file as an
Mason> ext4 partition. 

Why are you doing it like this?  What advantage does this buy you?  In
any case, you're now slowing things down because you have the overhead
of the base filesystem, which you then create a large file on top of,
which you then mount and format with a SECOND filesystem.  

Instead, you should probably just have a small boot/OS filesystem, and
then put the rest of the storage under LVM control.  At that point,
you can reserve space using 'lvcreate ...' which will succeed or
fail.  If good, create a filesystem in there and use it.  When you
need to delete it, just unmount the LV and just do 'lvdestroy' which
should be much faster, since you won't bother to zero out the blocks.

Now I don't know offhand if lvcreate ontop of a recently deleted LV
volume whill make sure to zero all the blocks, but I suspect so, and
probably only when they're used.

Does this make more sense?  It seems to fit your strange requirements
better...

John


>> basically someone will say "give me 600G of space reservation" and
>> then will eventually fill it up, otherwise you say "Nope, can't do
>> it!"

Mason> Right, take a 1000 GB disk,
Mason> Reserve(R1 = 300 GB) <- SUCCESS
Mason> Reserve(R2 = 300 GB) <- SUCCESS
Mason> Reserve(R3 = 300 GB) <- SUCCESS
Mason> Reserve(R4 = 300 GB) <- FAIL
Mason> Delete (R1)  <- SUCCESS
Mason> Reserve(R4 = 300 GB) <- SUCCESS

>>> So I create an ext4 partition with
>>> $ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
>>> (Using e2fsprogs-1.42.10 if it matters)
>>> 
>>> And mount with "typical" mount options
>>> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
>>> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
>>> 
>>> I wrote a small test program to create a large file, then immediately
>>> unlink it.
>>> 
>>> My problem is that, while file creation is "fast enough" (4 seconds
>>> for a 300 GB file) and unlink is "immediate", the process hangs
>>> while it waits (I suppose) for the OS to actually complete the
>>> operation (almost two minutes for a 300 GB file).

Mason> [snip performance numbers]

>>> QUESTIONS:
>>> 
>>> 1) Did I provide enough information for someone to reproduce?
>> 
>> Sure, but you didn't give enough information to explain what you're
>> trying to accomplish here.  And what the use case is.  Also, since you
>> know you cannot fill 500Gb in any sort of reasonable time over USB2,
>> why are you concerned that the delete takes so long?

Mason> I don't understand your question. If the user asks to create a 300 GB
Mason> file, then immediately realizes than he won't n

Re: After unlinking a large file on ext4, the process stalls for a long time

2014-07-16 Thread John Stoffel

Mason> I'm using Linux (3.1.10 at the moment) on a embedded system
Mason> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
Mason> 800-MHz CPU, USB).

Sounds like a Raspberry Pi...  And have you investigated using
something like XFS as your filesystem instead?  

Mason> I need to be able to create large files (50-1000 GB) "as fast
Mason> as possible".  These files are created on an external hard disk
Mason> drive, connected over Hi-Speed USB (typical throughput 30
Mason> MB/s).

Really... so you just need to create allocations of space as quickly
as possible, which will then be filled in later with actuall data?  So
basically someone will say "give me 600G of space reservation" and
then will eventually fill it up, otherwise you say "Nope, can't do
it!"

Mason> Sparse files were not an acceptable solution (because the space
Mason> must be reserved, and the operation must fail if the space is
Mason> unavailable).  And filling the file with zeros was too slow
Mason> (typically 35 s/GB).

Mason> Someone mentioned fallocate on an ext4 partition.

Mason> So I create an ext4 partition with
Mason> $ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
Mason> (Using e2fsprogs-1.42.10 if it matters)

Mason> And mount with "typical" mount options
Mason> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
Mason> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)

Mason> I wrote a small test program to create a large file, then immediately
Mason> unlink it.

Mason> My problem is that, while file creation is "fast enough" (4 seconds
Mason> for a 300 GB file) and unlink is "immediate", the process hangs
Mason> while it waits (I suppose) for the OS to actually complete the
Mason> operation (almost two minutes for a 300 GB file).

Mason> I also note that the (weak) CPU is pegged, so perhaps this problem
Mason> does not occur on a desktop workstation?

Mason> /tmp # time ./foo /mnt/hdd/xxx 5
Mason> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
Mason> unlink(filename): 0 [0 ms]
Mason> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 
528maxresident)k
Mason> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason> /tmp # time ./foo /mnt/hdd/xxx 10
Mason> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
Mason> unlink(filename): 0 [0 ms]
Mason> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 
528maxresident)k
Mason> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason> /tmp # time ./foo /mnt/hdd/xxx 100
Mason> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
Mason> unlink(filename): 0 [0 ms]
Mason> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 
528maxresident)k
Mason> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason> /tmp # time ./foo /mnt/hdd/xxx 300
Mason> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
Mason> unlink(filename): 0 [0 ms]
Mason> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 
528maxresident)k
Mason> 0inputs+0outputs (0major+168minor)pagefaults 0swaps


Mason> QUESTIONS:

Mason> 1) Did I provide enough information for someone to reproduce?

Sure, but you didn't give enough information to explain what you're
trying to accomplish here.  And what the use case is.  Also, since you
know you cannot fill 500Gb in any sorta of reasonable time over USB2,
why are you concerned that the delete takes so long?  

I think that maybe using the filesystem for the reservations is the
wrong approach.  You should use a simple daemon which listens for
requests, and then checks the filesystem space and decides if it can
honor them or not.

Then you just store the files as they get writen...

Mason> 2) Is this expected behavior?

Sure, unlinking a 1Gb file that's been written too means (on EXT4)
that you need to update all the filesystem structures.  Now it should
be quicker honestly, but maybe you're not mounting it with a journal?
And have you tried tuning the filesystem to use larger allocations and
blocks?  You're not going to make alot of files on there obviously,
but just a few large ones.  

Mason> 3) Are there knobs I can tweak (at FS creation, or at mount
Mason> time) to improve the performance of file unlinking?  (Maybe
Mason> there is a safety/performance trade-off?

Sure, there are all kinds of things you can do.  For example, how
many of these files are you expecting to store?  Will you have to be
able to handle writing of more than one file at a time?  Or are they
purely sequential?  

If you are creating a small embedded system to manage a bunch of USB2
hard drives and write data to them with a space reservation process,
then you need to make sure you can actually handle the data throughput
requirements.  And I'm not sure you can.

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: After unlinking a large file on ext4, the process stalls for a long time

2014-07-16 Thread John Stoffel

Mason I'm using Linux (3.1.10 at the moment) on a embedded system
Mason similar in spec to a desktop PC from 15 years ago (256 MB RAM,
Mason 800-MHz CPU, USB).

Sounds like a Raspberry Pi...  And have you investigated using
something like XFS as your filesystem instead?  

Mason I need to be able to create large files (50-1000 GB) as fast
Mason as possible.  These files are created on an external hard disk
Mason drive, connected over Hi-Speed USB (typical throughput 30
Mason MB/s).

Really... so you just need to create allocations of space as quickly
as possible, which will then be filled in later with actuall data?  So
basically someone will say give me 600G of space reservation and
then will eventually fill it up, otherwise you say Nope, can't do
it!

Mason Sparse files were not an acceptable solution (because the space
Mason must be reserved, and the operation must fail if the space is
Mason unavailable).  And filling the file with zeros was too slow
Mason (typically 35 s/GB).

Mason Someone mentioned fallocate on an ext4 partition.

Mason So I create an ext4 partition with
Mason $ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
Mason (Using e2fsprogs-1.42.10 if it matters)

Mason And mount with typical mount options
Mason $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
Mason /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)

Mason I wrote a small test program to create a large file, then immediately
Mason unlink it.

Mason My problem is that, while file creation is fast enough (4 seconds
Mason for a 300 GB file) and unlink is immediate, the process hangs
Mason while it waits (I suppose) for the OS to actually complete the
Mason operation (almost two minutes for a 300 GB file).

Mason I also note that the (weak) CPU is pegged, so perhaps this problem
Mason does not occur on a desktop workstation?

Mason /tmp # time ./foo /mnt/hdd/xxx 5
Mason posix_fallocate(fd, 0, size_in_GiB  30): 0 [68 ms]
Mason unlink(filename): 0 [0 ms]
Mason 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 
528maxresident)k
Mason 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason /tmp # time ./foo /mnt/hdd/xxx 10
Mason posix_fallocate(fd, 0, size_in_GiB  30): 0 [141 ms]
Mason unlink(filename): 0 [0 ms]
Mason 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 
528maxresident)k
Mason 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason /tmp # time ./foo /mnt/hdd/xxx 100
Mason posix_fallocate(fd, 0, size_in_GiB  30): 0 [1882 ms]
Mason unlink(filename): 0 [0 ms]
Mason 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 
528maxresident)k
Mason 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason /tmp # time ./foo /mnt/hdd/xxx 300
Mason posix_fallocate(fd, 0, size_in_GiB  30): 0 [3883 ms]
Mason unlink(filename): 0 [0 ms]
Mason 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 
528maxresident)k
Mason 0inputs+0outputs (0major+168minor)pagefaults 0swaps


Mason QUESTIONS:

Mason 1) Did I provide enough information for someone to reproduce?

Sure, but you didn't give enough information to explain what you're
trying to accomplish here.  And what the use case is.  Also, since you
know you cannot fill 500Gb in any sorta of reasonable time over USB2,
why are you concerned that the delete takes so long?  

I think that maybe using the filesystem for the reservations is the
wrong approach.  You should use a simple daemon which listens for
requests, and then checks the filesystem space and decides if it can
honor them or not.

Then you just store the files as they get writen...

Mason 2) Is this expected behavior?

Sure, unlinking a 1Gb file that's been written too means (on EXT4)
that you need to update all the filesystem structures.  Now it should
be quicker honestly, but maybe you're not mounting it with a journal?
And have you tried tuning the filesystem to use larger allocations and
blocks?  You're not going to make alot of files on there obviously,
but just a few large ones.  

Mason 3) Are there knobs I can tweak (at FS creation, or at mount
Mason time) to improve the performance of file unlinking?  (Maybe
Mason there is a safety/performance trade-off?

Sure, there are all kinds of things you can do.  For example, how
many of these files are you expecting to store?  Will you have to be
able to handle writing of more than one file at a time?  Or are they
purely sequential?  

If you are creating a small embedded system to manage a bunch of USB2
hard drives and write data to them with a space reservation process,
then you need to make sure you can actually handle the data throughput
requirements.  And I'm not sure you can.

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: After unlinking a large file on ext4, the process stalls for a long time

2014-07-16 Thread John Stoffel

Mason (I hope you'll forgive me for reformatting the quote characters
Mason to my taste.)

No problem.

Mason On 16/07/2014 17:16, John Stoffel wrote:

 Mason wrote:
 
 I'm using Linux (3.1.10 at the moment) on a embedded system
 similar in spec to a desktop PC from 15 years ago (256 MB RAM,
 800-MHz CPU, USB).
 
 Sounds like a Raspberry Pi...  And have you investigated using
 something like XFS as your filesystem instead?

Mason The system is a set-top box (DVB-S2 receiver). The system CPU is
Mason MIPS 74K, not ARM (not that it matters, in this case).

So it's a slow slow box... and it's only going to handle writing data
at 3Mbs... so why do you insist that the filesystem work at magic
speeds?  

Mason No, I have not investigated other file systems (yet).

 I need to be able to create large files (50-1000 GB) as fast
 as possible.  These files are created on an external hard disk
 drive, connected over Hi-Speed USB (typical throughput 30 MB/s).
 
 Really... so you just need to create allocations of space as quickly
 as possible,

Mason I may not have been clear. The creation needs to be fast (in UX terms,
Mason so less than 5-10 seconds), but it only occurs a few times during the
Mason lifetime of the system.

If this only happens a few times, why do you care how quick the delete
is?  And if it's only happening a few times, why don't you just do the
space reservation OUTSIDE of the filesystem? 

Or do you need to do encryption of these containers and strictly
segrate them?  Basically, implement a daemon which knows how much free
space is on the device, how much is already pre-committed to other
users, and then how much free space there is.  

If the space isn't actually used, then you don't care, because you've
reserved it.  

 which will then be filled in later with actual data?

Mason Yes. In fact, I use the loopback device to format the file as an
Mason ext4 partition. 

Why are you doing it like this?  What advantage does this buy you?  In
any case, you're now slowing things down because you have the overhead
of the base filesystem, which you then create a large file on top of,
which you then mount and format with a SECOND filesystem.  

Instead, you should probably just have a small boot/OS filesystem, and
then put the rest of the storage under LVM control.  At that point,
you can reserve space using 'lvcreate ...' which will succeed or
fail.  If good, create a filesystem in there and use it.  When you
need to delete it, just unmount the LV and just do 'lvdestroy' which
should be much faster, since you won't bother to zero out the blocks.

Now I don't know offhand if lvcreate ontop of a recently deleted LV
volume whill make sure to zero all the blocks, but I suspect so, and
probably only when they're used.

Does this make more sense?  It seems to fit your strange requirements
better...

John


 basically someone will say give me 600G of space reservation and
 then will eventually fill it up, otherwise you say Nope, can't do
 it!

Mason Right, take a 1000 GB disk,
Mason Reserve(R1 = 300 GB) - SUCCESS
Mason Reserve(R2 = 300 GB) - SUCCESS
Mason Reserve(R3 = 300 GB) - SUCCESS
Mason Reserve(R4 = 300 GB) - FAIL
Mason Delete (R1)  - SUCCESS
Mason Reserve(R4 = 300 GB) - SUCCESS

 So I create an ext4 partition with
 $ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
 (Using e2fsprogs-1.42.10 if it matters)
 
 And mount with typical mount options
 $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
 /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
 
 I wrote a small test program to create a large file, then immediately
 unlink it.
 
 My problem is that, while file creation is fast enough (4 seconds
 for a 300 GB file) and unlink is immediate, the process hangs
 while it waits (I suppose) for the OS to actually complete the
 operation (almost two minutes for a 300 GB file).

Mason [snip performance numbers]

 QUESTIONS:
 
 1) Did I provide enough information for someone to reproduce?
 
 Sure, but you didn't give enough information to explain what you're
 trying to accomplish here.  And what the use case is.  Also, since you
 know you cannot fill 500Gb in any sort of reasonable time over USB2,
 why are you concerned that the delete takes so long?

Mason I don't understand your question. If the user asks to create a 300 GB
Mason file, then immediately realizes than he won't need it, and asks for it
Mason to be deleted, I don't see why the process should hang for 2 minutes.

Mason The use case is
Mason - allocate a large file
Mason - stick a file system on it
Mason - store stuff (typically video files) inside this private FS
Mason - when the user decides he doesn't need it anymore, unmount and unlink
Mason (I also have a resize operation in there, but I wanted to get the
Mason basics before taking the hard stuff head on.)

Mason So, in the limit, we don't store anything at all: just create and
Mason immediately delete. This was my test.

 I think that maybe using

Re: [PATCH] mm readahead: Fix sys_readahead breakage by reverting 2MB limit (bug 79111)

2014-07-03 Thread John Stoffel
> "Linus" == Linus Torvalds  writes:

Linus> On Thu, Jul 3, 2014 at 11:22 AM, Linus Torvalds
Linus>  wrote:
>> 
>> So the bugzilla entry worries me a bit - we definitely do not want to
>> regress in case somebody really relied on timing - but without more
>> specific information I still think the real bug is just in the
>> man-page.

Linus> Side note: the 2MB limit may be too small. 2M is peanuts on modern
Linus> machines, even for fairly slow IO, and there are lots of files (like
Linus> glibc etc) that people might want to read-ahead during boot. We
Linus> already do bigger read-ahead if people just do "read()" system calls.
Linus> So I could certainly imagine that we should increase it.

Linus> I do *not* think we should bow down to insane man-pages that have
Linus> always been wrong, though, and I don't think we should increase it to
Linus> "let's just read-ahead a whole ISO image" kind of sizes..

This is one of those perenial questions of how to tune this.  I agree
we should increase the number, but shouldn't it be based on both the
amount of memory in the machine, number of devices (or is it all just
one big pool?) and the speed of the actual device doing readahead?
Doesn't make sense to do 32mb of readahead on a USB 1.1 thumb drive or
even a CDROM.  But maybe it does for USB3 thumb drives?  

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm readahead: Fix sys_readahead breakage by reverting 2MB limit (bug 79111)

2014-07-03 Thread John Stoffel
 Linus == Linus Torvalds torva...@linux-foundation.org writes:

Linus On Thu, Jul 3, 2014 at 11:22 AM, Linus Torvalds
Linus torva...@linux-foundation.org wrote:
 
 So the bugzilla entry worries me a bit - we definitely do not want to
 regress in case somebody really relied on timing - but without more
 specific information I still think the real bug is just in the
 man-page.

Linus Side note: the 2MB limit may be too small. 2M is peanuts on modern
Linus machines, even for fairly slow IO, and there are lots of files (like
Linus glibc etc) that people might want to read-ahead during boot. We
Linus already do bigger read-ahead if people just do read() system calls.
Linus So I could certainly imagine that we should increase it.

Linus I do *not* think we should bow down to insane man-pages that have
Linus always been wrong, though, and I don't think we should increase it to
Linus let's just read-ahead a whole ISO image kind of sizes..

This is one of those perenial questions of how to tune this.  I agree
we should increase the number, but shouldn't it be based on both the
amount of memory in the machine, number of devices (or is it all just
one big pool?) and the speed of the actual device doing readahead?
Doesn't make sense to do 32mb of readahead on a USB 1.1 thumb drive or
even a CDROM.  But maybe it does for USB3 thumb drives?  

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Disable zone_reclaim_mode by default v2

2014-04-18 Thread John Stoffel
> "Andrew" == Andrew Morton  writes:

Andrew> On Tue,  8 Apr 2014 09:22:58 +0100 Mel Gorman  wrote:
>> Changelog since v1
>> o topology comment updates
>> 
>> When it was introduced, zone_reclaim_mode made sense as NUMA distances
>> punished and workloads were generally partitioned to fit into a NUMA
>> node. NUMA machines are now common but few of the workloads are NUMA-aware
>> and it's routine to see major performance due to zone_reclaim_mode being
>> enabled but relatively few can identify the problem.


This is unclear here.  "see major performance  due" doesn't make
sense to me.  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Disable zone_reclaim_mode by default v2

2014-04-18 Thread John Stoffel
 Andrew == Andrew Morton a...@linux-foundation.org writes:

Andrew On Tue,  8 Apr 2014 09:22:58 +0100 Mel Gorman mgor...@suse.de wrote:
 Changelog since v1
 o topology comment updates
 
 When it was introduced, zone_reclaim_mode made sense as NUMA distances
 punished and workloads were generally partitioned to fit into a NUMA
 node. NUMA machines are now common but few of the workloads are NUMA-aware
 and it's routine to see major performance due to zone_reclaim_mode being
 enabled but relatively few can identify the problem.


This is unclear here.  see major performance what due doesn't make
sense to me.  

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] cmdline: Hide "debug" from /proc/cmdline

2014-04-05 Thread John Stoffel
>>>>> "Greg" == Greg Kroah-Hartman  writes:

Greg> On Fri, Apr 04, 2014 at 05:17:09PM -0400, John Stoffel wrote:
>> >>>>> "Linus" == Linus Torvalds  writes:
>> 
Linus> On Fri, Apr 4, 2014 at 11:21 AM, Andy Lutomirski  
wrote:
>> >> 
>> >> The other thing I've used /dev/kmsg for is to shove a "I'm starting
>> >> something now" message in.  This is only really necessary because the
>> >> current kernel log timestamps are unusable crap.  (We could fix that,
>> >> hint hint.)
>> 
Linus> I'd actually love to fix that, but I disagree with the "we could fix
Linus> it". There are tons of people who know how to parse them (admittedly
Linus> often only to ignore them), so changing the format is not likely to
Linus> work.
>> 
Linus> The good news is that "dmesg -H" does help if you're
Linus> human. While at the same time being an example of that very
Linus> "there are tools that know about the current horrid format"
Linus> issue.. D'oh.
>> 
>> I think you mean "dmesg -T", and unfortunately it seems Debian 6.0.9
>> (or older) doesn ship a new enough linux-util since I've only got
>> 2.17.2-9 install.  

Greg> No, 'dmesg -H' is the right thing, you just need a modern version of
Greg> util-linux :)

Probably.  But I'm working within my limitations, esp at work we can
upgrade too aggresively because of support needs for the tools we
use.  And at home... usually Debian is pretty good about updates like
this, but for my main NAS box, I'm not aggresive either.

>> And RHEL/Centos 5.6 and 6.5 don't seem to ship that by default either,
>> they have got util-linux-2.13-0.56.el5 and
>> util-linux-ng-2.17.2-12.14.el6.x86_64 respectively.  Blech!  It's in
>> Linux Mint 16 at least, haven't checked older versions.

Greg> util-linux is on version 2.24.1 at the moment:
Greg>   https://www.kernel.org/pub/linux/utils/util-linux/v2.24/
Greg>   2.17.2 was from back in 2012, I think it's time to switch to a
Greg>   modern distro...

2012 isn't ancient... :-)  2002 is ancient.  And I still run some
Solaris 5.8 hosts at work from that era.  

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] cmdline: Hide debug from /proc/cmdline

2014-04-05 Thread John Stoffel
 Greg == Greg Kroah-Hartman gre...@linuxfoundation.org writes:

Greg On Fri, Apr 04, 2014 at 05:17:09PM -0400, John Stoffel wrote:
  Linus == Linus Torvalds torva...@linux-foundation.org writes:
 
Linus On Fri, Apr 4, 2014 at 11:21 AM, Andy Lutomirski l...@amacapital.net 
wrote:
  
  The other thing I've used /dev/kmsg for is to shove a I'm starting
  something now message in.  This is only really necessary because the
  current kernel log timestamps are unusable crap.  (We could fix that,
  hint hint.)
 
Linus I'd actually love to fix that, but I disagree with the we could fix
Linus it. There are tons of people who know how to parse them (admittedly
Linus often only to ignore them), so changing the format is not likely to
Linus work.
 
Linus The good news is that dmesg -H does help if you're
Linus human. While at the same time being an example of that very
Linus there are tools that know about the current horrid format
Linus issue.. D'oh.
 
 I think you mean dmesg -T, and unfortunately it seems Debian 6.0.9
 (or older) doesn ship a new enough linux-util since I've only got
 2.17.2-9 install.  

Greg No, 'dmesg -H' is the right thing, you just need a modern version of
Greg util-linux :)

Probably.  But I'm working within my limitations, esp at work we can
upgrade too aggresively because of support needs for the tools we
use.  And at home... usually Debian is pretty good about updates like
this, but for my main NAS box, I'm not aggresive either.

 And RHEL/Centos 5.6 and 6.5 don't seem to ship that by default either,
 they have got util-linux-2.13-0.56.el5 and
 util-linux-ng-2.17.2-12.14.el6.x86_64 respectively.  Blech!  It's in
 Linux Mint 16 at least, haven't checked older versions.

Greg util-linux is on version 2.24.1 at the moment:
Greg   https://www.kernel.org/pub/linux/utils/util-linux/v2.24/
Greg   2.17.2 was from back in 2012, I think it's time to switch to a
Greg   modern distro...

2012 isn't ancient... :-)  2002 is ancient.  And I still run some
Solaris 5.8 hosts at work from that era.  

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] cmdline: Hide "debug" from /proc/cmdline

2014-04-04 Thread John Stoffel
> "Linus" == Linus Torvalds  writes:

Linus> On Fri, Apr 4, 2014 at 11:21 AM, Andy Lutomirski  
wrote:
>> 
>> The other thing I've used /dev/kmsg for is to shove a "I'm starting
>> something now" message in.  This is only really necessary because the
>> current kernel log timestamps are unusable crap.  (We could fix that,
>> hint hint.)

Linus> I'd actually love to fix that, but I disagree with the "we could fix
Linus> it". There are tons of people who know how to parse them (admittedly
Linus> often only to ignore them), so changing the format is not likely to
Linus> work.

Linus> The good news is that "dmesg -H" does help if you're
Linus> human. While at the same time being an example of that very
Linus> "there are tools that know about the current horrid format"
Linus> issue.. D'oh.

I think you mean "dmesg -T", and unfortunately it seems Debian 6.0.9
(or older) doesn ship a new enough linux-util since I've only got
2.17.2-9 install.  

And RHEL/Centos 5.6 and 6.5 don't seem to ship that by default either,
they have got util-linux-2.13-0.56.el5 and
util-linux-ng-2.17.2-12.14.el6.x86_64 respectively.  Blech!  It's in
Linux Mint 16 at least, haven't checked older versions.

John



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] cmdline: Hide debug from /proc/cmdline

2014-04-04 Thread John Stoffel
 Linus == Linus Torvalds torva...@linux-foundation.org writes:

Linus On Fri, Apr 4, 2014 at 11:21 AM, Andy Lutomirski l...@amacapital.net 
wrote:
 
 The other thing I've used /dev/kmsg for is to shove a I'm starting
 something now message in.  This is only really necessary because the
 current kernel log timestamps are unusable crap.  (We could fix that,
 hint hint.)

Linus I'd actually love to fix that, but I disagree with the we could fix
Linus it. There are tons of people who know how to parse them (admittedly
Linus often only to ignore them), so changing the format is not likely to
Linus work.

Linus The good news is that dmesg -H does help if you're
Linus human. While at the same time being an example of that very
Linus there are tools that know about the current horrid format
Linus issue.. D'oh.

I think you mean dmesg -T, and unfortunately it seems Debian 6.0.9
(or older) doesn ship a new enough linux-util since I've only got
2.17.2-9 install.  

And RHEL/Centos 5.6 and 6.5 don't seem to ship that by default either,
they have got util-linux-2.13-0.56.el5 and
util-linux-ng-2.17.2-12.14.el6.x86_64 respectively.  Blech!  It's in
Linux Mint 16 at least, haven't checked older versions.

John



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] speeding up the stat() family of system calls...

2013-12-21 Thread John Stoffel

Linus> Here's both x86 people and filesystem people involved, because this
Linus> hacky RFC patch touches both.

Linus> NOTE NOTE NOTE! I've modified "cp_new_stat()" in place, in a way that
Linus> is x86-64 specific. So the attached patch *only* works on x86-64, and
Linus> will very actively break on anything else. That's intentional, because
Linus> that way it's more obvious how the code changes, but a real patch
Linus> would create a *new* cp_new_stat() for x86-64, and conditionalize the
Linus> existing generic "cp_new_stat()" on not already having an
Linus> architecture-optimized one.

As a SysAdmin, I'm always interested in any speedups to filesystem
ops, since I tend to do lots of trawling through filesystems with find
looking for data on filesystem usage, largest files, etc.  So this is
good news.  Any numbers of how much better this is?  I'm travelling
tomorrow, so I won't have time to spin up a VM and play, though it's
tempting to do so.

Thanks,
John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] speeding up the stat() family of system calls...

2013-12-21 Thread John Stoffel

Linus Here's both x86 people and filesystem people involved, because this
Linus hacky RFC patch touches both.

Linus NOTE NOTE NOTE! I've modified cp_new_stat() in place, in a way that
Linus is x86-64 specific. So the attached patch *only* works on x86-64, and
Linus will very actively break on anything else. That's intentional, because
Linus that way it's more obvious how the code changes, but a real patch
Linus would create a *new* cp_new_stat() for x86-64, and conditionalize the
Linus existing generic cp_new_stat() on not already having an
Linus architecture-optimized one.

As a SysAdmin, I'm always interested in any speedups to filesystem
ops, since I tend to do lots of trawling through filesystems with find
looking for data on filesystem usage, largest files, etc.  So this is
good news.  Any numbers of how much better this is?  I'm travelling
tomorrow, so I won't have time to spin up a VM and play, though it's
tempting to do so.

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RIP - dead harddisk..

2013-09-11 Thread John Stoffel
>>>>> "H" == H Peter Anvin  writes:

H> On 09/10/2013 08:00 PM, Linus Torvalds wrote:
>> On Tue, Sep 10, 2013 at 7:46 PM, John Stoffel  wrote:
>>> 
Linus> The timing absolutely sucks, but it looks like the SSD in my
Linus> main workstation just died on me.
>>> 
>>> What model, if you care to share?  I figure you'r a perfect storm of
>>> SSD beating with all your compiles and git pulls, etc.
>> 
>> So I don't want to necessarily blame the harddisk, since it's just ten
>> days since I upgraded the rest of my machine, after it worked years in
>> the previous one. That just makes me go "hmm". As far as I know, all
>> the fans etc were working fine, but..
>> 
>>> And may I suggest that you get TWO of them next time and mirror them,
>>> for just this case?  The SysAdmin in my shouting out here...
>> 
>> I long ago gave up on doing backups. I have actively moved to a model
>> where I use replacable machines instead. I've got the stuff I care
>> about generally on a couple of different machines, and then keys etc
>> backed up on a separate encrypted USB key.
>> 
>> So it's inconvenient. Mainly from a timing standpoint. But nothing more.
>> 

H> I won't get any stationary machines without mirrored drives anymore.
H> Storage just isn't reliable enough.

And I won't trust a single USB thumb drive to hold my most important
stuff.  And how do you hold onto family pictures and such?  It's
amazing how much crap can accumulate, but also how important it can be
to have good backups that are remote.  If the house burns down, don't
matter how many machines the stuff is spread across if it's not local.

John


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RIP - dead harddisk..

2013-09-11 Thread John Stoffel
 H == H Peter Anvin h...@linux.intel.com writes:

H On 09/10/2013 08:00 PM, Linus Torvalds wrote:
 On Tue, Sep 10, 2013 at 7:46 PM, John Stoffel j...@stoffel.org wrote:
 
Linus The timing absolutely sucks, but it looks like the SSD in my
Linus main workstation just died on me.
 
 What model, if you care to share?  I figure you'r a perfect storm of
 SSD beating with all your compiles and git pulls, etc.
 
 So I don't want to necessarily blame the harddisk, since it's just ten
 days since I upgraded the rest of my machine, after it worked years in
 the previous one. That just makes me go hmm. As far as I know, all
 the fans etc were working fine, but..
 
 And may I suggest that you get TWO of them next time and mirror them,
 for just this case?  The SysAdmin in my shouting out here...
 
 I long ago gave up on doing backups. I have actively moved to a model
 where I use replacable machines instead. I've got the stuff I care
 about generally on a couple of different machines, and then keys etc
 backed up on a separate encrypted USB key.
 
 So it's inconvenient. Mainly from a timing standpoint. But nothing more.
 

H I won't get any stationary machines without mirrored drives anymore.
H Storage just isn't reliable enough.

And I won't trust a single USB thumb drive to hold my most important
stuff.  And how do you hold onto family pictures and such?  It's
amazing how much crap can accumulate, but also how important it can be
to have good backups that are remote.  If the house burns down, don't
matter how many machines the stuff is spread across if it's not local.

John


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RIP - dead harddisk..

2013-09-10 Thread John Stoffel

Linus> The timing absolutely sucks, but it looks like the SSD in my
Linus> main workstation just died on me.

What model, if you care to share?  I figure you'r a perfect storm of
SSD beating with all your compiles and git pulls, etc.  

And may I suggest that you get TWO of them next time and mirror them,
for just this case?  The SysAdmin in my shouting out here...  

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RIP - dead harddisk..

2013-09-10 Thread John Stoffel

Linus The timing absolutely sucks, but it looks like the SSD in my
Linus main workstation just died on me.

What model, if you care to share?  I figure you'r a perfect storm of
SSD beating with all your compiles and git pulls, etc.  

And may I suggest that you get TWO of them next time and mirror them,
for just this case?  The SysAdmin in my shouting out here...  

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dcache: Translating dentry into pathname without taking rename_lock

2013-09-05 Thread John Stoffel
>>>>> "Waiman" == Waiman Long  writes:

Waiman> On 09/04/2013 04:40 PM, John Stoffel wrote:
>>>>>>> "Waiman" == Waiman Long  writes:
Waiman> In term of AIM7 performance, this patch has a performance boost of
Waiman> about 6-7% on top of Linus' lockref patch on a 8-socket 80-core DL980.
>> 
Waiman> User Range  |   10-100  | 200-1 | 1100-2000 |
Waiman> Mean JPM w/o patch  | 4,365,114 | 7,211,115 | 6,964,439 |
Waiman> Mean JPM with patch | 3,872,850 | 7,655,958 | 7,422,598 |
Waiman> % Change|  -11.3%   |   +6.2%   |   +6.6%   |
>> 
>> This -11% impact is worisome to me, because at smaller numbers of
>> users, I would still expect the performance to go up.  So why the big
>> drop?
>> 
>> Also, how is the impact of these changes on smaller 1 socket, 4 core
>> systems?  Just because it helps a couple of big boxes, doesn't mean it
>> won't hurt the more common small case.
>> 
>> John

Waiman> I don't believe the patch will make it slower with less
Waiman> user. It is more a result of run-to-run variation. The short
Waiman> workload typically completed in a very short time. In the
Waiman> 10-100 user range, the completion times range from
Waiman> 0.02-0.11s. With a higher user count, it needs several seconds
Waiman> to run and hence the results are more reliable.

Can you then show the variation over multiple runs?  I think you have
a good justification for larger boxes to make this change, I just
worry about smaller systems getting hit and losing performance.

John

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dcache: Translating dentry into pathname without taking rename_lock

2013-09-05 Thread John Stoffel
 Waiman == Waiman Long waiman.l...@hp.com writes:

Waiman On 09/04/2013 04:40 PM, John Stoffel wrote:
 Waiman == Waiman Longwaiman.l...@hp.com  writes:
Waiman In term of AIM7 performance, this patch has a performance boost of
Waiman about 6-7% on top of Linus' lockref patch on a 8-socket 80-core DL980.
 
Waiman User Range  |   10-100  | 200-1 | 1100-2000 |
Waiman Mean JPM w/o patch  | 4,365,114 | 7,211,115 | 6,964,439 |
Waiman Mean JPM with patch | 3,872,850 | 7,655,958 | 7,422,598 |
Waiman % Change|  -11.3%   |   +6.2%   |   +6.6%   |
 
 This -11% impact is worisome to me, because at smaller numbers of
 users, I would still expect the performance to go up.  So why the big
 drop?
 
 Also, how is the impact of these changes on smaller 1 socket, 4 core
 systems?  Just because it helps a couple of big boxes, doesn't mean it
 won't hurt the more common small case.
 
 John

Waiman I don't believe the patch will make it slower with less
Waiman user. It is more a result of run-to-run variation. The short
Waiman workload typically completed in a very short time. In the
Waiman 10-100 user range, the completion times range from
Waiman 0.02-0.11s. With a higher user count, it needs several seconds
Waiman to run and hence the results are more reliable.

Can you then show the variation over multiple runs?  I think you have
a good justification for larger boxes to make this change, I just
worry about smaller systems getting hit and losing performance.

John

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dcache: Translating dentry into pathname without taking rename_lock

2013-09-04 Thread John Stoffel
> "Waiman" == Waiman Long  writes:

Waiman> In term of AIM7 performance, this patch has a performance boost of
Waiman> about 6-7% on top of Linus' lockref patch on a 8-socket 80-core DL980.

Waiman> User Range  |   10-100  | 200-1 | 1100-2000 |
Waiman> Mean JPM w/o patch  | 4,365,114 | 7,211,115 | 6,964,439 |
Waiman> Mean JPM with patch | 3,872,850 | 7,655,958 | 7,422,598 |
Waiman> % Change|  -11.3%   |   +6.2%   |   +6.6%   |

This -11% impact is worisome to me, because at smaller numbers of
users, I would still expect the performance to go up.  So why the big
drop?  

Also, how is the impact of these changes on smaller 1 socket, 4 core
systems?  Just because it helps a couple of big boxes, doesn't mean it
won't hurt the more common small case.

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dcache: Translating dentry into pathname without taking rename_lock

2013-09-04 Thread John Stoffel
 Waiman == Waiman Long waiman.l...@hp.com writes:

Waiman In term of AIM7 performance, this patch has a performance boost of
Waiman about 6-7% on top of Linus' lockref patch on a 8-socket 80-core DL980.

Waiman User Range  |   10-100  | 200-1 | 1100-2000 |
Waiman Mean JPM w/o patch  | 4,365,114 | 7,211,115 | 6,964,439 |
Waiman Mean JPM with patch | 3,872,850 | 7,655,958 | 7,422,598 |
Waiman % Change|  -11.3%   |   +6.2%   |   +6.6%   |

This -11% impact is worisome to me, because at smaller numbers of
users, I would still expect the performance to go up.  So why the big
drop?  

Also, how is the impact of these changes on smaller 1 socket, 4 core
systems?  Just because it helps a couple of big boxes, doesn't mean it
won't hurt the more common small case.

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: negative left shift count when PAGE_SHIFT > 20

2013-07-18 Thread John Stoffel

Jerry> When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The
Jerry> calculating here will generate an unexpected result. In addition, if
Jerry> PAGE_SHIFT > 20, The memory size represented by numentries was already
Jerry> integral multiple of 1MB.

Why this magic number of 20?  Please explain it better and replace it
was a #define that means something here.  


Jerry> Signed-off-by: Jerry 
Jerry> ---
Jerry>  mm/page_alloc.c | 8 +---
Jerry>  1 file changed, 5 insertions(+), 3 deletions(-)

Jerry> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
Jerry> index b100255..cd41797 100644
Jerry> --- a/mm/page_alloc.c
Jerry> +++ b/mm/page_alloc.c
Jerry> @@ -5745,9 +5745,11 @@ void *__init alloc_large_system_hash(const char 
*tablename,
Jerry>  if (!numentries) {
Jerry>  /* round applicable memory size up to nearest megabyte 
*/
Jerry>  numentries = nr_kernel_pages;
Jerry> -numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
Jerry> -numentries >>= 20 - PAGE_SHIFT;
Jerry> -numentries <<= 20 - PAGE_SHIFT;
Jerry> +if (20 > PAGE_SHIFT) {
Jerry> +numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
Jerry> +numentries >>= 20 - PAGE_SHIFT;
Jerry> +numentries <<= 20 - PAGE_SHIFT;
Jerry> +}
 
Jerry>  /* limit to 1 bucket per 2^scale bytes of low memory */
Jerry>  if (scale > PAGE_SHIFT)
Jerry> -- 
Jerry> 1.8.1.5

Jerry> --
Jerry> To unsubscribe from this list: send the line "unsubscribe linux-kernel" 
in
Jerry> the body of a message to majord...@vger.kernel.org
Jerry> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jerry> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: negative left shift count when PAGE_SHIFT 20

2013-07-18 Thread John Stoffel

Jerry When PAGE_SHIFT  20, the result of 20 - PAGE_SHIFT is negative. The
Jerry calculating here will generate an unexpected result. In addition, if
Jerry PAGE_SHIFT  20, The memory size represented by numentries was already
Jerry integral multiple of 1MB.

Why this magic number of 20?  Please explain it better and replace it
was a #define that means something here.  


Jerry Signed-off-by: Jerry uuli...@gmail.com
Jerry ---
Jerry  mm/page_alloc.c | 8 +---
Jerry  1 file changed, 5 insertions(+), 3 deletions(-)

Jerry diff --git a/mm/page_alloc.c b/mm/page_alloc.c
Jerry index b100255..cd41797 100644
Jerry --- a/mm/page_alloc.c
Jerry +++ b/mm/page_alloc.c
Jerry @@ -5745,9 +5745,11 @@ void *__init alloc_large_system_hash(const char 
*tablename,
Jerry  if (!numentries) {
Jerry  /* round applicable memory size up to nearest megabyte 
*/
Jerry  numentries = nr_kernel_pages;
Jerry -numentries += (1UL  (20 - PAGE_SHIFT)) - 1;
Jerry -numentries = 20 - PAGE_SHIFT;
Jerry -numentries = 20 - PAGE_SHIFT;
Jerry +if (20  PAGE_SHIFT) {
Jerry +numentries += (1UL  (20 - PAGE_SHIFT)) - 1;
Jerry +numentries = 20 - PAGE_SHIFT;
Jerry +numentries = 20 - PAGE_SHIFT;
Jerry +}
 
Jerry  /* limit to 1 bucket per 2^scale bytes of low memory */
Jerry  if (scale  PAGE_SHIFT)
Jerry -- 
Jerry 1.8.1.5

Jerry --
Jerry To unsubscribe from this list: send the line unsubscribe linux-kernel 
in
Jerry the body of a message to majord...@vger.kernel.org
Jerry More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jerry Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT] Networking

2013-05-02 Thread John Stoffel
>>>>> "Ben" == Ben Hutchings  writes:

Ben> On Thu, 2013-05-02 at 14:53 -0400, John Stoffel wrote:
>> >>>>> "David" == David Miller  writes:
>> 
David> From: Bjørn Mork 
David> Date: Thu, 02 May 2013 11:06:42 +0200
>> 
>> >> From d957cf339bf625869c39d852ac6733ef597ecef9 Mon Sep 17 00:00:00 2001
>> >> From: Bjørn Mork 
>> >> Date: Thu, 2 May 2013 10:37:05 +0200
>> >> Subject: [PATCH] net: vlan,ethtool: netdev_features_t is more than 32 bit
>> >> MIME-Version: 1.0
>> >> Content-Type: text/plain; charset=UTF-8
>> >> Content-Transfer-Encoding: 8bit
>> >> 
>> >> Signed-off-by: Bjørn Mork 
>> 
David> Also applied and queued up for -stable.
>> 
David> These changes show me that this special type isn't providing type
David> safety in the way that we actually need it.
>> 
David> Something like how we do the MM page table types would work better:
>> 
David> typedef struct { u64 val; } netdev_features_t;
>> 
David> #define __netdev_feature(X)  ((netdev_features_t) { X } )
>> 
David> and also with the appropriate set of accessors.
>> 
David> Then you can't get it wrong without a compile error.
>> 
>> Isn't part of the problem that you're exporting it into /sys in a
>> binary format?
Ben> [...]

Ben> Features are exported through SIOCETHTOOL, not sysfs (though they *used*
Ben> to be there).

Ben> The 'flags' attribue in sysfs is something different.

THanks for the clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT] Networking

2013-05-02 Thread John Stoffel
> "David" == David Miller  writes:

David> From: Bjørn Mork 
David> Date: Thu, 02 May 2013 11:06:42 +0200

>> From d957cf339bf625869c39d852ac6733ef597ecef9 Mon Sep 17 00:00:00 2001
>> From: Bjørn Mork 
>> Date: Thu, 2 May 2013 10:37:05 +0200
>> Subject: [PATCH] net: vlan,ethtool: netdev_features_t is more than 32 bit
>> MIME-Version: 1.0
>> Content-Type: text/plain; charset=UTF-8
>> Content-Transfer-Encoding: 8bit
>> 
>> Signed-off-by: Bjørn Mork 

David> Also applied and queued up for -stable.

David> These changes show me that this special type isn't providing type
David> safety in the way that we actually need it.

David> Something like how we do the MM page table types would work better:

David> typedef struct { u64 val; } netdev_features_t;

David> #define __netdev_feature(X)  ((netdev_features_t) { X } )

David> and also with the appropriate set of accessors.

David> Then you can't get it wrong without a compile error.

Isn't part of the problem that you're exporting it into /sys in a
binary format?  Why not just have each flag as it's own file and
value?  Sure, it's a waste in some ways, but then it makes it simpler
to just do an 'opendir()' to see if the flag exists, much less what
it's set to.

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT] Networking

2013-05-02 Thread John Stoffel
 David == David Miller da...@davemloft.net writes:

David From: Bjørn Mork bj...@mork.no
David Date: Thu, 02 May 2013 11:06:42 +0200

 From d957cf339bf625869c39d852ac6733ef597ecef9 Mon Sep 17 00:00:00 2001
 From: Bjørn Mork bj...@mork.no
 Date: Thu, 2 May 2013 10:37:05 +0200
 Subject: [PATCH] net: vlan,ethtool: netdev_features_t is more than 32 bit
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 
 Signed-off-by: Bjørn Mork bj...@mork.no

David Also applied and queued up for -stable.

David These changes show me that this special type isn't providing type
David safety in the way that we actually need it.

David Something like how we do the MM page table types would work better:

David typedef struct { u64 val; } netdev_features_t;

David #define __netdev_feature(X)  ((netdev_features_t) { X } )

David and also with the appropriate set of accessors.

David Then you can't get it wrong without a compile error.

Isn't part of the problem that you're exporting it into /sys in a
binary format?  Why not just have each flag as it's own file and
value?  Sure, it's a waste in some ways, but then it makes it simpler
to just do an 'opendir()' to see if the flag exists, much less what
it's set to.

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT] Networking

2013-05-02 Thread John Stoffel
 Ben == Ben Hutchings bhutchi...@solarflare.com writes:

Ben On Thu, 2013-05-02 at 14:53 -0400, John Stoffel wrote:
  David == David Miller da...@davemloft.net writes:
 
David From: Bjørn Mork bj...@mork.no
David Date: Thu, 02 May 2013 11:06:42 +0200
 
  From d957cf339bf625869c39d852ac6733ef597ecef9 Mon Sep 17 00:00:00 2001
  From: Bjørn Mork bj...@mork.no
  Date: Thu, 2 May 2013 10:37:05 +0200
  Subject: [PATCH] net: vlan,ethtool: netdev_features_t is more than 32 bit
  MIME-Version: 1.0
  Content-Type: text/plain; charset=UTF-8
  Content-Transfer-Encoding: 8bit
  
  Signed-off-by: Bjørn Mork bj...@mork.no
 
David Also applied and queued up for -stable.
 
David These changes show me that this special type isn't providing type
David safety in the way that we actually need it.
 
David Something like how we do the MM page table types would work better:
 
David typedef struct { u64 val; } netdev_features_t;
 
David #define __netdev_feature(X)  ((netdev_features_t) { X } )
 
David and also with the appropriate set of accessors.
 
David Then you can't get it wrong without a compile error.
 
 Isn't part of the problem that you're exporting it into /sys in a
 binary format?
Ben [...]

Ben Features are exported through SIOCETHTOOL, not sysfs (though they *used*
Ben to be there).

Ben The 'flags' attribue in sysfs is something different.

THanks for the clarification.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread John Stoffel
> "Willy" == Willy Tarreau  writes:

Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
>> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
>> > > 
>> > > (sd->len is usually 4096, which is expected, but sd->total_len value is
>> > > huge in your case, so we always set the flag in fs/splice.c)
>> > 
>> > I am testing :
>> > 
>> >if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> > more |= MSG_SENDPAGE_NOTLAST;
>> > 
>> 
>> Yes, this should fix the problem :
>> 
>> If there is no following buffer in the pipe, we should not set NOTLAST.
>> 
>> diff --git a/fs/splice.c b/fs/splice.c
>> index 8890604..6909d89 100644
>> --- a/fs/splice.c
>> +++ b/fs/splice.c
>> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
>> *pipe,
>> return -EINVAL;
>> 
>> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
>> -if (sd->len < sd->total_len)
>> +
>> +if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> more |= MSG_SENDPAGE_NOTLAST;
>> +
>> return file->f_op->sendpage(file, buf->page, buf->offset,
sd-> len, , more);
>> }
 
Willy> OK it works like a charm here now ! I can't break it anymore, so it
Willy> looks like you finally got it !

It's still broken, there's no comments in the code to explain all this
magic to mere mortals!  *grin*

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread John Stoffel
>>>>> "Willy" == Willy Tarreau  writes:

Willy> On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote:
>> >>>>> "Willy" == Willy Tarreau  writes:
>> 
Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
>> >> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
>> >> > > 
>> >> > > (sd->len is usually 4096, which is expected, but sd->total_len value 
>> >> > > is
>> >> > > huge in your case, so we always set the flag in fs/splice.c)
>> >> > 
>> >> > I am testing :
>> >> > 
>> >> >if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> >> > more |= MSG_SENDPAGE_NOTLAST;
>> >> > 
>> >> 
>> >> Yes, this should fix the problem :
>> >> 
>> >> If there is no following buffer in the pipe, we should not set NOTLAST.
>> >> 
>> >> diff --git a/fs/splice.c b/fs/splice.c
>> >> index 8890604..6909d89 100644
>> >> --- a/fs/splice.c
>> >> +++ b/fs/splice.c
>> >> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
>> >> *pipe,
>> >> return -EINVAL;
>> >> 
>> >> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
>> >> - if (sd->len < sd->total_len)
>> >> +
>> >> + if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> >> more |= MSG_SENDPAGE_NOTLAST;
>> >> +
>> >> return file->f_op->sendpage(file, buf->page, buf->offset,
sd-> len, , more);
>> >> }
>> 
Willy> OK it works like a charm here now ! I can't break it anymore, so it
Willy> looks like you finally got it !
>> 
>> It's still broken, there's no comments in the code to explain all this
>> magic to mere mortals!  *grin*

Willy> I would generally agree, but when Eric fixes such a thing, he
Willy> generally goes with lengthy details in the commit message.

I'm sure he will too, I just wanted to nudge him because while I sorta
followed this discussion, I see lots of pain down the road if the code
wasn't updated with some nice big fat comments.

Great job finding this code and testing, testing, testing.

John

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread John Stoffel
 Willy == Willy Tarreau w...@1wt.eu writes:

Willy On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote:
  Willy == Willy Tarreau w...@1wt.eu writes:
 
Willy On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
  On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:

(sd-len is usually 4096, which is expected, but sd-total_len value 
is
huge in your case, so we always set the flag in fs/splice.c)
   
   I am testing :
   
  if (sd-len  sd-total_len  pipe-nrbufs  1)
   more |= MSG_SENDPAGE_NOTLAST;
   
  
  Yes, this should fix the problem :
  
  If there is no following buffer in the pipe, we should not set NOTLAST.
  
  diff --git a/fs/splice.c b/fs/splice.c
  index 8890604..6909d89 100644
  --- a/fs/splice.c
  +++ b/fs/splice.c
  @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
  *pipe,
  return -EINVAL;
  
  more = (sd-flags  SPLICE_F_MORE) ? MSG_MORE : 0;
  - if (sd-len  sd-total_len)
  +
  + if (sd-len  sd-total_len  pipe-nrbufs  1)
  more |= MSG_SENDPAGE_NOTLAST;
  +
  return file-f_op-sendpage(file, buf-page, buf-offset,
sd- len, pos, more);
  }
 
Willy OK it works like a charm here now ! I can't break it anymore, so it
Willy looks like you finally got it !
 
 It's still broken, there's no comments in the code to explain all this
 magic to mere mortals!  *grin*

Willy I would generally agree, but when Eric fixes such a thing, he
Willy generally goes with lengthy details in the commit message.

I'm sure he will too, I just wanted to nudge him because while I sorta
followed this discussion, I see lots of pain down the road if the code
wasn't updated with some nice big fat comments.

Great job finding this code and testing, testing, testing.

John

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >