from:"Dhaval Giani"

Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-12-15 Thread Dhaval Giani

On 12/14/20 3:25 PM, Joel Fernandes wrote:

>> No problem. That was there primarily for debugging.
> Ok. I squashed Josh's changes into this patch and several of my fixups. So
> there'll be 3 patches:
> 1. CGroup + prctl  (single patch as it is hell to split it)

Please don't do that. I am not sure we have thought the cgroup interface through
(looking at all the discussions). IMHO, it would be better for us to get a 
simpler
interface (prctl) right and then once we learn the lessons using it, apply it 
for
the cgroup interface. I think we all agree we don't want to maintain a messy 
interface
forever.

Dhaval

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread Dhaval Giani

On Mon, Aug 24, 2020 at 4:32 AM Vineeth Pillai  wrote:
>
> > Let me know your thoughts and looking forward to a good LPC MC discussion!
> >
>
> Nice write up Joel, thanks for taking time to compile this with great detail!
>
> After going through the details of interface proposal using cgroup v2
> controllers,
> and based on our discussion offline, would like to note down this idea
> about a new
> pseudo filesystem interface for core scheduling.  We could include
> this also for the
> API discussion during core scheduler MC.
>
> coreschedfs: pseudo filesystem interface for Core Scheduling
> --
>
> The basic requirement of core scheduling is simple - we need to group a set
> of tasks into a trust group that can share a core. So we don’t really
> need a nested
> hierarchy for the trust groups. Cgroups v2 follow a unified nested
> hierarchy model
> that causes a considerable confusion if the trusted tasks are in
> different levels of the
> hierarchy and we need to allow them to share the core. Cgroup v2's
> single hierarchy
> model makes it difficult to regroup tasks in different levels of
> nesting for core scheduling.
> As noted in this mail, we could use multi-file approach and other
> interfaces like prctl to
> overcome this limitation.
>
> The idea proposed here to overcome the above limitation is to come up with a 
> new
> pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
> filesystem with
> maximum nesting level of 1. That means, root directory can have
> sub-directories for
> sub-groups, but those sub-directories cannot have more sub-directories
> representing
> trust groups. Root directory is to represent the system wide trust
> group and sub-directories
> represent trusted groups. Each directory including the root directory
> has the following set
> of files/directories:
>
> - cookie_id: User exposed id for a cookie. This can be compared to a
> file descriptor.
>  This could be used in programmatic API to join/leave a group
>
> - properties: This is an interface to specify how child tasks of this
> group should behave.
>   Can be used for specifying future flag requirements as well.
>   Current list of properties include:
>   NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
> will result in
> creation of a new trust group
>   SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
> group will end up in
>  this same group
>   ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
> group goes to the root group
>
> - tasks: Lists the tasks in this group. Main interface for adding
> removing tasks in a group
>
> - : A directory per task who is am member of this trust group.
> - /properties: This file is same as the parent properties file
> but this is to override
> the group setting.
>
> This pseudo filesystem can be mounted any where in the root
> filesystem, I propose the default
> to be in “/sys/kernel/coresched”
>
> When coresched is enabled, kernel internally creates the framework for
> this filesystem.
> The filesystem gets mounted to the default location and admin can
> change this if needed.
> All tasks by default are in the root group. The admin or programs can
> then create trusted
> groups on top of this filesystem.
>
> Hooks will be placed in fork() and exit() to make sure that the
> filesystem’s view of tasks is
> up-to-date with the system. Also, APIs manipulating core scheduling
> trusted groups should
> also make sure that the filesystem's view is updated.
>
> Note: The above idea is very similar to cgroups v1. Since there is no
> unified hierarchy
> in cgroup v1, most of the features of coreschedfs could be implemented
> as a cgroup v1
> controller. As no new v1 controllers are allowed, I feel the best
> alternative to have
> a simple API is to come up with a new filesystem - coreschedfs.
>
> The advantages of this approach is:
>
> - Detached from cgroup unified hierarchy and hence the very simple requirement
>of core scheduling can be easily materialized.
> - Admin can have fine-grained control of groups using shell and scripting
> - Can have programmatic access to this using existing APIs like mkdir,rmdir,
>write, read. Or can come up with new APIs using the cookie_id which can 
> wrap
>   t he above linux apis or use a new systemcall for core scheduling.
> - Fine grained permission control using linux filesystem permissions and ACLs
>
> Disadvantages are
> - yet another psuedo filesystem.
> - very similar to  cgroup v1 and might be re-implementing features
> that are already
>   provided by cgroups v1.
>
> Use Cases
> -
>
> Usecase 1: Google cloud
> -
>
> Since we no longer depend on cgroup v2 hierarchies, there will not be
> any issue of
> nesting and sharing. The

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread Dhaval Giani

On Fri, Aug 21, 2020 at 8:01 PM Joel Fernandes  wrote:
>
> Hello!
> Core-scheduling aims to allow making it safe for more than 1 task that trust
> each other to safely share hyperthreads within a CPU core [1]. This results
> in a performance improvement for workloads that can benefit from using
> hyperthreading safely while limiting core-sharing when it is not safe.
>
> Currently no universally agreed set of interface exists and companies have
> been hacking up their own interface to make use of the patches. This post
> aims to list usecases which I got after talking to various people at Google
> and Oracle. After which actual development of code to add interfaces can 
> follow.
>
> The below text uses the terms cookie and tag interchangeably. Further, cookie
> of 0 is assumed to indicate a trusted process - such as kernel threads or
> system daemons. By default, if nothing is tagged then everything is
> considered trusted since the scheduler assumes all tasks are a match for each
> other.
>
> Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
> int32 is split into 2 parts, the color and the id. The color can only be set
> by privileged processes and the id can be set by anyone. The CGroup structure
> looks like:
>
>A B
>   / \  / \ \
>  C   DE  F  G
>
> Here A and B are container CGroups for 2 jobs are assigned a color by a
> privileged daemon. The job itself has more sub-CGroups within (for ex, B has
> E, F and G). When these sub-CGroups are spawned, they inherit the color from
> the parent. An unprivileged user can then set an id for the sub-CGroup
> without the knowledge of the privileged daemon if it desires to add further
> isolation. This setting of id can be an unprivileged operation because the
> root daemon has already isolated A and B.
>
> Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
> spawns a renderer. A renderer is a sandboxed process and it is assumed it
> could run arbitrary code (Javascript etc). When a renderer is created, a
> prctl call is made to tag the renderer. Every thread that is spawned by the
> renderer is also tagged. Essentially this turns SMT off for the renderer, but
> still gives a performance boost due to privileged system threads being able
> to share a core. The tagging also forbids the renderer from sharing a core
> with privileged system processes. In the future, we plan to allow threads to
> share a core as well (especially once we get syscall-isolation upstreamed.
> Patches were posted recently for the same [2]).
>
> Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
> tagged thus disallowing core sharing between the vCPU thread and any other
> thread on the system. This is because such VMs may run arbitrary user code
> and attack both the guest and the host systems sharing the core.
>
> Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
> talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
> to not have to share its parent's CGroup tag. In fact, it should be allowed to
> untag the child CGroup if needed thus allowing them to share a core with
> trusted tasks. Others have had similar requirements.
>

Just to augment this. This doesn't necessarily need to be cgroup
based. We do have a need where certain processes want to be tagged
separately from others, which are in the same cgroup hierarchy. The
standard mechanism for this is nested cgroups. With a unified
hierarchy, and with cgroup tagging, I am unsure what this really
means. Consider

root
|- A
|- A1
|- A2

If A is tagged, can processes in A1 and A2 share a core? Should they
share a core? In some cases we might be OK with them sharing cores
just to get some of the performance back. Core scheduling really needs
to be limited to just the processes that we want to protect.

> Proposal for tagging
> 
> We have to support both CGroup and non-CGroup users. CGroup may be overkill
> for some and the CGroup v2 unified hierarchy may be too inflexible.
> Regardless, we must support CGroup due its easy of use and existing users.
>
> For Usecase #1
> --
> Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
> to the CPU controller:
> - tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
>   tagged.  (In the kernel, the cookie will be derived from the pointer value
>   of a ref-counted cookie object.). If reset, then the CGroup will inherit
>   the parent CGroup's cookie if there is one.
>
> - color : The ref-counted object will be aligned say to a 256-byte boundary
>   (for example), then the lower 8 bits of the pointer can be used to specify
>   color. Together, the pointer with the color will form a cookie used by the
>   scheduler.
>
> Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
> their color to be the same does not imply that the 2 groups will share a
> core. This is key.

[CFP LPC 2020] Scheduler Microconference

2020-07-29 Thread Dhaval Giani

Hi all,

We are pleased to announce the Scheduler Microconference has been
accepted at LPC this year.

Please submit your proposals on the LPC website at:

https://www.linuxplumbersconf.org/event/7/abstracts/#submit-abstract

And be sure to select "Scheduler MC" in the Track pulldown menu.


Topics we are interested in, but certainly not limited to, this year are,

- Load Balancer Rework
- Idle Balance optimizations
- Flattening the group scheduling hierarchy
- Core scheduling
- Proxy Execution for CFS
- What was formerly known as latency nice

Please get your submissions in by Aug 7th!

Thanks!
The organizers of the Scheduler Microconference

Re: CFP: LPC Testing and Fuzzing microconference.

2019-07-24 Thread Dhaval Giani

On Tue, Jul 2, 2019 at 1:12 PM Dhaval Giani  wrote:
>
> Hi folks,
>
> I am pleased to announce the Testing Microconference has been accepted
> at LPC this year.
>
> The CfP process is now open, and please submit your talks on the LPC
> website. It can be found at
> https://linuxplumbersconf.org/event/4/abstracts/
>
> Potential topics include, but are not limited to
> - Defragmentation of testing infrastructure: how can we combine
> testing infrastructure to avoid duplication.
> - Better sanitizers: Tag-based KASAN, making KTSAN usable, etc
> - Better hardware testing, hardware sanitizers.
> - Are fuzzers "solved"?
> - Improving RT testing.
> - Using clang for better testing coverage.
> - Unit test framework.
> - The future of kernelCI
>

Hi all,

Just a reminder, the CfP is open for the microconference and proposals
are being accepted. We plan to start selecting topics starting Aug 11
and  can let you know that if you don't get your topic in, it will not
be selected!

https://linuxplumbersconf.org/event/4/abstracts/

Thanks!
Dhaval and Sasha

CFP: LPC Testing and Fuzzing microconference.

2019-07-02 Thread Dhaval Giani

Hi folks,

I am pleased to announce the Testing Microconference has been accepted
at LPC this year.

The CfP process is now open, and please submit your talks on the LPC
website. It can be found at
https://linuxplumbersconf.org/event/4/abstracts/

Potential topics include, but are not limited to
- Defragmentation of testing infrastructure: how can we combine
testing infrastructure to avoid duplication.
- Better sanitizers: Tag-based KASAN, making KTSAN usable, etc
- Better hardware testing, hardware sanitizers.
- Are fuzzers "solved"?
- Improving RT testing.
- Using clang for better testing coverage.
- Unit test framework.
- The future of kernelCI

Thanks!
Dhaval and Sasha

Re: Linux Testing Microconference at LPC

2019-05-22 Thread Dhaval Giani

> Please let us know what topics you believe should be a part of the
> micro conference this year.

At OSPM right now, Douglas and Ionela were talking about their
scheduler behavioral testing framework using LISA and rt-app. This is
an interesting topic, and I think has a lot of scope for making
scheduler testing/behaviour more predictable as well as
analyze/validate scheduler behavior. I am hoping they are able to make
it to LPC this year.

Dhaval

Re: Linux Testing Microconference at LPC

2019-05-22 Thread Dhaval Giani

On Wed, May 22, 2019 at 6:04 PM Dmitry Vyukov  wrote:
>
> On Thu, May 16, 2019 at 2:51 AM  wrote:
> > > -Original Message-
> > > From: Sasha Levin
> > >
> > > On Fri, Apr 26, 2019 at 02:02:53PM -0700, Tim Bird wrote:
> > ...
> > > >
> > > >With regards to the Testing microconference at Plumbers, I would like
> > > >to do a presentation on the current status of test standards and test
> > > >framework interoperability.  We recently had some good meetings
> > > >between the LAVA and Fuego people at Linaro Connect
> > > >on this topic.
> > >
> > > Hi Tim,
> > >
> > > Sorry for the delayed response, this mail got marked as read as a result
> > > of fat fingers :(
> > >
> > > I'd want to avoid having an 'overview' talk as part of the MC. We have
> > > quite a few discussion topics this year and in the spirit of LPC I'd
> > > prefer to avoid presentations.
> >
> > OK.  Sounds good.
> >
> > > Maybe it's more appropriate for the refereed track?
> > I'll consider submitting it there, but there's a certain "fun" aspect
> > to attending a conference that I don't have to prepare a talk for. :-)
> >
> > Thanks for getting back to me.  I'm already registered for Plumbers,
> > so I'll see you there.
> >  -- Tim
>
>
> I would like to give an update on syzkaller/syzbot and discuss:
>  - testability of kernel components in this context
>  - test coverage and what's still not tested
>  - discussion of the process (again): what works, what doesn't work, feedback
>

This sounds good to me.

> I also submitted a refereed track talk called "Reflections on kernel
> quality, development process and testing". If it's not accepted, I
> would like to do it on Testing MC.

I don't think refereed talks fit in the MC

Linux Testing Microconference at LPC

2019-04-11 Thread Dhaval Giani

Hi Folks,

This is a call for participation for the Linux Testing microconference
at LPC this year.

For those who were at LPC last year, as the closing panel mentioned,
testing is probably the next big push needed to improve quality. From
getting more selftests in, to regression testing to ensure we don't
break realtime as more of PREEMPT_RT comes in, to more stable distros,
we need more testing around the kernel.

We have talked about different efforts around testing, such as fuzzing
(using syzkaller and trinity), automating fuzzing with syzbot, 0day
testing, test frameworks such as ktests, smatch to find bugs in the
past. We want to push this discussion further this year and are
interested in hearing from you what you want to talk about, and where
kernel testing needs to go next.

Please let us know what topics you believe should be a part of the
micro conference this year.

Thanks!
Sasha and Dhaval

Re: [PATCH v4 00/10] steal tasks to improve CPU utilization

2019-01-31 Thread Dhaval Giani



> 
> On 12/6/2018 4:28 PM, Steve Sistare wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates.  To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed.  For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>>
>> This simple stealing yields a higher CPU utilization than idle_balance()
>> alone, because the search is cheap, so it may be called every time the CPU
>> is about to go idle.  idle_balance() does more work because it searches
>> widely for the busiest queue, so to limit its CPU consumption, it declines
>> to search if the system is too busy.  Simple stealing does not offload the
>> globally busiest queue, but it is much better than running nothing at all.
>>
>> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
>> reduce cache contention vs the usual bitmap when many threads concurrently
>> set, clear, and visit elements.
>>
>> Patch 1 defines the sparsemask type and its operations.
>>
>> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
>>
>> Patches 5 and 6 refactor existing code for a cleaner merge of later
>>   patches.
>>
>> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
>>
>> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
>> time being because of performance regressions that are not due to stealing
>> per-se.  See the patch description for details.
>>
>> Patch 10 adds schedstats for comparing the new behavior to the old, and
>>   provided as a convenience for developers only, not for integration.
>>
>> The patch series is based on kernel 4.20.0-rc1.  It compiles, boots, and
>> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
>> and CONFIG_PREEMPT.  It runs without error with CONFIG_DEBUG_PREEMPT +
>> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
>> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP.  CPU hot plug and CPU
>> bandwidth control were tested.
>>
>> Stealing improves utilization with only a modest CPU overhead in scheduler
>> code.  In the following experiment, hackbench is run with varying numbers
>> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
>> for each run, averaged per CPU, augmented with these non-standard stats:
>>
>>   %find - percent of time spent in old and new functions that search for
>> idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
>>
>>   steal - number of times a task is stolen from another CPU.
>>
>> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> hackbench  process 10
>> sched_wakeup_granularity_ns=1500
>>
>>   baseline
>>   grps  time  %busy  slice   sched   idle wake %find  steal
>>   18.084  75.02   0.10  105476  4629159183  0.31  0
>>   2   13.892  85.33   0.10  190225  70958   119264  0.45  0
>>   3   19.668  89.04   0.10  263896  87047   176850  0.49  0
>>   4   25.279  91.28   0.10  322171  94691   227474  0.51  0
>>   8   47.832  94.86   0.09  630636 144141   486322  0.56  0
>>
>>   new
>>   grps  time  %busy  slice   sched   idle wake %find  steal  %speedup
>>   15.938  96.80   0.24   31255   719024061  0.63   7433  36.1
>>   2   11.491  99.23   0.16   74097   457869512  0.84  19463  20.9
>>   3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
>>   4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
>>   8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6
>>
>> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
>> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
>> The cost is at most 0.4% more find time.
>>
>> Additional performance results follow.  A negative "speedup" is a
>> regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
>> is set to 15 msec.  Otherwise, preemptions increase at higher loads and
>> distort the comparison between baseline and new.
>>
>> -- 1 Socket Results --
>>
>> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> Average of 10 runs of: hackbench  process 10
>>
>> --- base ----- new ---
>>   groupstime %stdevtime %stdev  %speedup
>>1   8.0080.1   5.9050.2  35.6
>>2  13.8140.2  11.4380.1  20.7
>>3  19.4880.2  16.9190.1  15.1
>>4  25.0590.1  22.4090.1  11.8
>>8  47.4780.1  44.2210.1   7.3
>>
>> X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
>> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>> Average of 10 runs of: hackbench  process 10
>>
>>

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-11-08 Thread Dhaval Giani

On Mon, Nov 5, 2018 at 10:05 AM Gustavo Padovan
 wrote:
>
> Hi Dhaval,
>
> On 9/19/18 7:13 PM, Dhaval Giani wrote:
> > Hi folks,
> >
> > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > LPC [ 1 ]. We are planning to continue the discussions from last
> > year's microconference [2]. Many discussions from the Automated
> > Testing Summit [3] will also continue, and a final agenda will come up
> > only soon after that.
> >
> > Suggested Topics
> >
> > - Syzbot/syzkaller
> > - ATS
> > - Distro/stable testing
> > - kernelci
> > - kernelci auto bisection
>
> Having 2 kernelci talks don't make too much sense, I discussed with
> Kevin and we thing it would be a good idea to merge them together. Could
> you do that?
>

OK, we can make that happen. 45 minutes for the 2 combined topics?

> Thanks,
>
> Gustavo
>
>
> Gustavo Padovan
> Collabora Ltd
>

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-11-08 Thread Dhaval Giani

On Mon, Nov 5, 2018 at 10:05 AM Gustavo Padovan
 wrote:
>
> Hi Dhaval,
>
> On 9/19/18 7:13 PM, Dhaval Giani wrote:
> > Hi folks,
> >
> > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > LPC [ 1 ]. We are planning to continue the discussions from last
> > year's microconference [2]. Many discussions from the Automated
> > Testing Summit [3] will also continue, and a final agenda will come up
> > only soon after that.
> >
> > Suggested Topics
> >
> > - Syzbot/syzkaller
> > - ATS
> > - Distro/stable testing
> > - kernelci
> > - kernelci auto bisection
>
> Having 2 kernelci talks don't make too much sense, I discussed with
> Kevin and we thing it would be a good idea to merge them together. Could
> you do that?
>

OK, we can make that happen. 45 minutes for the 2 combined topics?

> Thanks,
>
> Gustavo
>
>
> Gustavo Padovan
> Collabora Ltd
>

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-10-10 Thread Dhaval Giani

On Mon, Oct 8, 2018 at 11:23 AM Steven Rostedt  wrote:
>
> On Mon, 8 Oct 2018 19:02:51 +0200
> Dmitry Vyukov  wrote:
>
> > On Wed, Sep 19, 2018 at 7:13 PM, Dhaval Giani  
> > wrote:
> > > Hi folks,
> > >
> > > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > > LPC [ 1 ]. We are planning to continue the discussions from last
> > > year's microconference [2]. Many discussions from the Automated
> > > Testing Summit [3] will also continue, and a final agenda will come up
> > > only soon after that.
> > >
> > > Suggested Topics
> > >
> > > - Syzbot/syzkaller
> > > - ATS
> > > - Distro/stable testing
> > > - kernelci
> > > - kernelci auto bisection
> > > - Unit testing framework
> > >
> > > We look forward to other interesting topics for this microconference
> > > as a reply to this email.
> >
> > Hi Dhaval and Sasha,
> >
> > My syzbot talk wasn't accepted to main track, so I would like to do
> > more or less full-fledged talk on the microconf. Is it possible?
>
> Hi Dmitry,
>
> Note, microconfs are not for full-fledged talks. They are to be
> discussion focused. You can have a 5-10 minute presentation that leads
> up to discussion of future work, but we like to refrain from any talks
> about what was done if there's nothing to go forward with.

Dmitiry,

Can you clarify the scope of what you want to discuss during the
microconference? Further to what Steven said, we don't want
presentations (So 3, maybe 4 slides). We want discussions about future
work.

Thanks!
Dhaval

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-10-10 Thread Dhaval Giani

On Mon, Oct 8, 2018 at 11:23 AM Steven Rostedt  wrote:
>
> On Mon, 8 Oct 2018 19:02:51 +0200
> Dmitry Vyukov  wrote:
>
> > On Wed, Sep 19, 2018 at 7:13 PM, Dhaval Giani  
> > wrote:
> > > Hi folks,
> > >
> > > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > > LPC [ 1 ]. We are planning to continue the discussions from last
> > > year's microconference [2]. Many discussions from the Automated
> > > Testing Summit [3] will also continue, and a final agenda will come up
> > > only soon after that.
> > >
> > > Suggested Topics
> > >
> > > - Syzbot/syzkaller
> > > - ATS
> > > - Distro/stable testing
> > > - kernelci
> > > - kernelci auto bisection
> > > - Unit testing framework
> > >
> > > We look forward to other interesting topics for this microconference
> > > as a reply to this email.
> >
> > Hi Dhaval and Sasha,
> >
> > My syzbot talk wasn't accepted to main track, so I would like to do
> > more or less full-fledged talk on the microconf. Is it possible?
>
> Hi Dmitry,
>
> Note, microconfs are not for full-fledged talks. They are to be
> discussion focused. You can have a 5-10 minute presentation that leads
> up to discussion of future work, but we like to refrain from any talks
> about what was done if there's nothing to go forward with.

Dmitiry,

Can you clarify the scope of what you want to discuss during the
microconference? Further to what Steven said, we don't want
presentations (So 3, maybe 4 slides). We want discussions about future
work.

Thanks!
Dhaval

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-10-03 Thread Dhaval Giani

On Tue, Oct 2, 2018 at 2:03 PM Sasha Levin  wrote:
>
> On Tue, Oct 2, 2018 at 4:44 PM Liam R. Howlett  
> wrote:
> >
> > * Dhaval Giani  [180919 13:15]:
> > > Hi folks,
> > >
> > > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > > LPC [ 1 ]. We are planning to continue the discussions from last
> > > year's microconference [2]. Many discussions from the Automated
> > > Testing Summit [3] will also continue, and a final agenda will come up
> > > only soon after that.
> > >
> > > Suggested Topics
> > >
> > > - Syzbot/syzkaller
> > > - ATS
> > > - Distro/stable testing
> > > - kernelci
> > > - kernelci auto bisection
> > > - Unit testing framework
> > >
> > > We look forward to other interesting topics for this microconference
> > > as a reply to this email.
> > >
> > > Thanks!
> > > Dhaval and Sasha
> > >
> > > [1] https://blog.linuxplumbersconf.org/2018/testing-and-fuzzing-mc/
> > > [2] https://lwn.net/Articles/735034/
> > > [3] https://elinux.org/Automated_Testing_Summit
> >
> >
> > Hello,
> >
> > I have a new way to analyze binaries to detect specific calls without
> > the need for source.  I would like to discuss Machine Code Trace
> > (MCTrace) at the Testing and Fuzzing LPC track.  MCTrace intercepts the
> > application prior to execution and does not rely on a specific user
> > input. It then decodes the machine instructions to follow all control
> > flows to their natural conclusions.  This includes control flows that go
> > beyond the boundaries of the static executable code into shared
> > libraries. This new technique avoids false positives which could be
> > produced by static analysis and includes paths that could be missed by
> > dynamic tracing.  This type of analysis could be useful in both testing
> > and fuzzing by providing a call graph to a given function.
> >
> > MCTrace was initially designed to help generate the seccomp() filter
> > list, which is a whitelist/blacklist of system calls for a specific
> > application. Seccomp filters easily become outdated when the application
> > or shared library is updated. This can cause failures or security
> > issues [ 1 ].  Other potential uses including examining binary blobs,
> > vulnerability analysis, and debugging.
>
> Hi Liam,
>
> Is MCTrace available anywhere?
>

Sasha,

McTrace is an early prototype, really needing a lot of feedback. I
will let Liam send more details (some how he got dropped from the cc)

Dhavla

>
> --
> Thanks,
> Sasha

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-10-03 Thread Dhaval Giani

On Tue, Oct 2, 2018 at 2:03 PM Sasha Levin  wrote:
>
> On Tue, Oct 2, 2018 at 4:44 PM Liam R. Howlett  
> wrote:
> >
> > * Dhaval Giani  [180919 13:15]:
> > > Hi folks,
> > >
> > > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > > LPC [ 1 ]. We are planning to continue the discussions from last
> > > year's microconference [2]. Many discussions from the Automated
> > > Testing Summit [3] will also continue, and a final agenda will come up
> > > only soon after that.
> > >
> > > Suggested Topics
> > >
> > > - Syzbot/syzkaller
> > > - ATS
> > > - Distro/stable testing
> > > - kernelci
> > > - kernelci auto bisection
> > > - Unit testing framework
> > >
> > > We look forward to other interesting topics for this microconference
> > > as a reply to this email.
> > >
> > > Thanks!
> > > Dhaval and Sasha
> > >
> > > [1] https://blog.linuxplumbersconf.org/2018/testing-and-fuzzing-mc/
> > > [2] https://lwn.net/Articles/735034/
> > > [3] https://elinux.org/Automated_Testing_Summit
> >
> >
> > Hello,
> >
> > I have a new way to analyze binaries to detect specific calls without
> > the need for source.  I would like to discuss Machine Code Trace
> > (MCTrace) at the Testing and Fuzzing LPC track.  MCTrace intercepts the
> > application prior to execution and does not rely on a specific user
> > input. It then decodes the machine instructions to follow all control
> > flows to their natural conclusions.  This includes control flows that go
> > beyond the boundaries of the static executable code into shared
> > libraries. This new technique avoids false positives which could be
> > produced by static analysis and includes paths that could be missed by
> > dynamic tracing.  This type of analysis could be useful in both testing
> > and fuzzing by providing a call graph to a given function.
> >
> > MCTrace was initially designed to help generate the seccomp() filter
> > list, which is a whitelist/blacklist of system calls for a specific
> > application. Seccomp filters easily become outdated when the application
> > or shared library is updated. This can cause failures or security
> > issues [ 1 ].  Other potential uses including examining binary blobs,
> > vulnerability analysis, and debugging.
>
> Hi Liam,
>
> Is MCTrace available anywhere?
>

Sasha,

McTrace is an early prototype, really needing a lot of feedback. I
will let Liam send more details (some how he got dropped from the cc)

Dhavla

>
> --
> Thanks,
> Sasha

[Announce] LPC 2018: Testing and Fuzzing Microconference

2018-09-19 Thread Dhaval Giani

Hi folks,

Sasha and I are pleased to announce the Testing and Fuzzing track at
LPC [ 1 ]. We are planning to continue the discussions from last
year's microconference [2]. Many discussions from the Automated
Testing Summit [3] will also continue, and a final agenda will come up
only soon after that.

Suggested Topics

- Syzbot/syzkaller
- ATS
- Distro/stable testing
- kernelci
- kernelci auto bisection
- Unit testing framework

We look forward to other interesting topics for this microconference
as a reply to this email.

Thanks!
Dhaval and Sasha

[1] https://blog.linuxplumbersconf.org/2018/testing-and-fuzzing-mc/
[2] https://lwn.net/Articles/735034/
[3] https://elinux.org/Automated_Testing_Summit

[Announce] LPC 2018: Testing and Fuzzing Microconference

2018-09-19 Thread Dhaval Giani

Hi folks,

Sasha and I are pleased to announce the Testing and Fuzzing track at
LPC [ 1 ]. We are planning to continue the discussions from last
year's microconference [2]. Many discussions from the Automated
Testing Summit [3] will also continue, and a final agenda will come up
only soon after that.

Suggested Topics

- Syzbot/syzkaller
- ATS
- Distro/stable testing
- kernelci
- kernelci auto bisection
- Unit testing framework

We look forward to other interesting topics for this microconference
as a reply to this email.

Thanks!
Dhaval and Sasha

[1] https://blog.linuxplumbersconf.org/2018/testing-and-fuzzing-mc/
[2] https://lwn.net/Articles/735034/
[3] https://elinux.org/Automated_Testing_Summit

Re: [PATCH v3 0/4] Ktest: add email support

2018-04-03 Thread Dhaval Giani

On 2018-03-26 04:08 PM, Tim Tianyang Chen wrote:
> This patch set will let users define a mailer, an email address and when to 
> receive
> notifications during automated testings. Users need to setup the specified 
> mailer
> prior to using this feature.
> 
> Tim Tianyang Chen (4):
>   Ktest: add email support
>   Ktest: add SigInt handling
>   Ktest: use dodie for critical falures
>   Ktest: add email options to sample.config
> 
>  ktest.pl| 125 
> +---
>  sample.conf |  22 +++
>  2 files changed, 117 insertions(+), 30 deletions(-)
> 

Steve,

Any thoughts?

Thanks!
Dhaval

Re: [PATCH v3 0/4] Ktest: add email support

2018-04-03 Thread Dhaval Giani

On 2018-03-26 04:08 PM, Tim Tianyang Chen wrote:
> This patch set will let users define a mailer, an email address and when to 
> receive
> notifications during automated testings. Users need to setup the specified 
> mailer
> prior to using this feature.
> 
> Tim Tianyang Chen (4):
>   Ktest: add email support
>   Ktest: add SigInt handling
>   Ktest: use dodie for critical falures
>   Ktest: add email options to sample.config
> 
>  ktest.pl| 125 
> +---
>  sample.conf |  22 +++
>  2 files changed, 117 insertions(+), 30 deletions(-)
> 

Steve,

Any thoughts?

Thanks!
Dhaval

Re: [PATCH] lockdep: Show up to three levels for a deadlock scenario

2017-12-19 Thread Dhaval Giani

On 2017-12-19 11:52 AM, Steven Rostedt wrote:
> On Tue, 19 Dec 2017 17:46:19 +0100
> Peter Zijlstra  wrote:
> 
> 
>> It really isn't that hard, Its mostly a question of TL;DR.
>>
>> #0 is useless and should be thrown out
>> #1 shows where we take #1 while holding #0
>> ..
>> #n shows where we take #n while holding #n-1
>>
>> And the bottom callstack shows where we take #0 while holding #n. Which
>> gets you a nice circle in your graph, which spells deadlock.
>>
>> Plenty people have shown they get this stuff.
> 
> 
> Then I suggest that you can either take my patch to improve the
> visual or remove the visual completely, as nobody cares about it.
> 

I prefer the former. As Steven has mentioned elsewhere, people find
lockdep output hard to follow (enough that he has given talks :) )

Dhaval

Re: [PATCH] lockdep: Show up to three levels for a deadlock scenario

2017-12-19 Thread Dhaval Giani

On 2017-12-19 11:52 AM, Steven Rostedt wrote:
> On Tue, 19 Dec 2017 17:46:19 +0100
> Peter Zijlstra  wrote:
> 
> 
>> It really isn't that hard, Its mostly a question of TL;DR.
>>
>> #0 is useless and should be thrown out
>> #1 shows where we take #1 while holding #0
>> ..
>> #n shows where we take #n while holding #n-1
>>
>> And the bottom callstack shows where we take #0 while holding #n. Which
>> gets you a nice circle in your graph, which spells deadlock.
>>
>> Plenty people have shown they get this stuff.
> 
> 
> Then I suggest that you can either take my patch to improve the
> visual or remove the visual completely, as nobody cares about it.
> 

I prefer the former. As Steven has mentioned elsewhere, people find
lockdep output hard to follow (enough that he has given talks :) )

Dhaval

Re: [PATCH] lockdep: Show up to three levels for a deadlock scenario

2017-12-19 Thread Dhaval Giani

On 2017-12-14 12:59 PM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:38:52PM -0500, Steven Rostedt wrote:
>>
>> Currently, when lockdep detects a possible deadlock scenario that involves 3
>> or more levels, it just shows the chain, and a CPU sequence order of the
>> first and last part of the scenario, leaving out the middle level and this
>> can take a bit of effort to understand. By adding a third level, it becomes
>> easier to see where the deadlock is.
> 
> So is anybody actually using this? This (together with the callchain for
> #0) is always the first thing of the lockdep output I throw away.
> 

Yes :-). The other stuff is unreadable to people not you.

Dhaval

Re: [PATCH] lockdep: Show up to three levels for a deadlock scenario

2017-12-19 Thread Dhaval Giani

On 2017-12-14 12:59 PM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:38:52PM -0500, Steven Rostedt wrote:
>>
>> Currently, when lockdep detects a possible deadlock scenario that involves 3
>> or more levels, it just shows the chain, and a CPU sequence order of the
>> first and last part of the scenario, leaving out the middle level and this
>> can take a bit of effort to understand. By adding a third level, it becomes
>> easier to see where the deadlock is.
> 
> So is anybody actually using this? This (together with the callchain for
> #0) is always the first thing of the lockdep output I throw away.
> 

Yes :-). The other stuff is unreadable to people not you.

Dhaval

Re: [PATCH 0/2] [RFC] Ktest: add email support

2017-12-14 Thread Dhaval Giani

On 2017-12-06 04:40 PM, Steven Rostedt wrote:
> Hi,
> 
> Currently traveling and now I have very poor connectivity. I won't be able to 
> do anything this week.
> 

ping! :)

Dhaval

Re: [PATCH 0/2] [RFC] Ktest: add email support

2017-12-14 Thread Dhaval Giani

On 2017-12-06 04:40 PM, Steven Rostedt wrote:
> Hi,
> 
> Currently traveling and now I have very poor connectivity. I won't be able to 
> do anything this week.
> 

ping! :)

Dhaval

Re: [PATCH 0/2] [RFC] Ktest: add email support

2017-12-06 Thread Dhaval Giani

On 2017-12-01 06:55 PM, Steven Rostedt wrote:
> On Tue, 21 Nov 2017 10:53:27 -0800
> Tim Tianyang Chen  wrote:
> 
>> This patch series will let users define mailer and email address for 
>> receiving
>> notifications during automated testings. Users need to setup the specified 
>> mailer
>> prior to using this feature.
>>
>> Emails will be sent when the script completes, is aborted due to errors or 
>> interrupted
>> by Sig-Int.
>>
> 
> Hi Tim,
> 
> I was hoping to get to these this week, but unfortunately I wasn't able
> to finish my current workload. I leave tomorrow for Germany, and
> hopefully I can spend some time looking at these on that trip.
> 
> Feel free to send me a ping if you don't hear from me next week.
> 

Ping!

Dhaval

Re: [PATCH 0/2] [RFC] Ktest: add email support

2017-12-06 Thread Dhaval Giani

On 2017-12-01 06:55 PM, Steven Rostedt wrote:
> On Tue, 21 Nov 2017 10:53:27 -0800
> Tim Tianyang Chen  wrote:
> 
>> This patch series will let users define mailer and email address for 
>> receiving
>> notifications during automated testings. Users need to setup the specified 
>> mailer
>> prior to using this feature.
>>
>> Emails will be sent when the script completes, is aborted due to errors or 
>> interrupted
>> by Sig-Int.
>>
> 
> Hi Tim,
> 
> I was hoping to get to these this week, but unfortunately I wasn't able
> to finish my current workload. I leave tomorrow for Germany, and
> hopefully I can spend some time looking at these on that trip.
> 
> Feel free to send me a ping if you don't hear from me next week.
> 

Ping!

Dhaval

Re: cgroups and nice

2016-11-28 Thread Dhaval Giani

[Resending because gmail doesn't understand when to go plaintext :-) ]
[Added a few other folks who might have something to say about it]

On Fri, Nov 25, 2016 at 9:34 AM, Marat Khalili  wrote:
> I have a question as a cgroup cpu limits user: how does it interact with
> nice? Documentation creates the impression that, as long as number of
> processes demanding the cpu time exceeds number of available cores, time
> allocated will be proportional to configured cpu.shares. However, in
> practice I observe that group with niced processes significantly under
> perform.
>
> For example, suppose on a 6-core box /cgroup/cpu/group1/cpu.shares is 400,
> and /cgroup/cpu/group2/cpu.shares is 200.
> 1) If I run `stress -c 6` in both groups, I should see approximately 400% of
> cpu time in group1 and 200% in group2 in top output, regardless of their
> relative nice value.
> 2) If I run `nice -n 19 stress -c 1` in cgroup1 and `stress -c 24` in
> group2, I should see at least 100% of cpu time in group1.
>
> What I see is significantly less cpu time in group1 if group1 processes
> happen to have greater nice value, and especially if group2 have greater
> number of processes involved: cpu load of group1 in example 2 can be as low
> as 20%. It may create tensions among users in my case; how can this be
> avoided except by renicing all processes to the same value?
>
>> $ uname -a
>> Linux redacted 2.6.32-642.11.1.el6.x86_64 #1 SMP Fri Nov 18 19:25:05 UTC
>> 2016 x86_64 x86_64 x86_64 GNU/Linux
>

This is an old version of the kernel. Do you see the same behavior on
a newer version of the kernel? (4.8 is the latest stable kernel)

>
>> $ lsb_release -a
>> LSB Version:
>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
>> Distributor ID: CentOS
>> Description:CentOS release 6.8 (Final)
>> Release:6.8
>> Codename:   Final
>
>
> (My apologies if I'm posting to incorrect list.)
>
> --
>
> With Best Regards,
> Marat Khalili
> --

Thanks,
Dhaval

Re: cgroups and nice

2016-11-28 Thread Dhaval Giani

[Resending because gmail doesn't understand when to go plaintext :-) ]
[Added a few other folks who might have something to say about it]

On Fri, Nov 25, 2016 at 9:34 AM, Marat Khalili  wrote:
> I have a question as a cgroup cpu limits user: how does it interact with
> nice? Documentation creates the impression that, as long as number of
> processes demanding the cpu time exceeds number of available cores, time
> allocated will be proportional to configured cpu.shares. However, in
> practice I observe that group with niced processes significantly under
> perform.
>
> For example, suppose on a 6-core box /cgroup/cpu/group1/cpu.shares is 400,
> and /cgroup/cpu/group2/cpu.shares is 200.
> 1) If I run `stress -c 6` in both groups, I should see approximately 400% of
> cpu time in group1 and 200% in group2 in top output, regardless of their
> relative nice value.
> 2) If I run `nice -n 19 stress -c 1` in cgroup1 and `stress -c 24` in
> group2, I should see at least 100% of cpu time in group1.
>
> What I see is significantly less cpu time in group1 if group1 processes
> happen to have greater nice value, and especially if group2 have greater
> number of processes involved: cpu load of group1 in example 2 can be as low
> as 20%. It may create tensions among users in my case; how can this be
> avoided except by renicing all processes to the same value?
>
>> $ uname -a
>> Linux redacted 2.6.32-642.11.1.el6.x86_64 #1 SMP Fri Nov 18 19:25:05 UTC
>> 2016 x86_64 x86_64 x86_64 GNU/Linux
>

This is an old version of the kernel. Do you see the same behavior on
a newer version of the kernel? (4.8 is the latest stable kernel)

>
>> $ lsb_release -a
>> LSB Version:
>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
>> Distributor ID: CentOS
>> Description:CentOS release 6.8 (Final)
>> Release:6.8
>> Codename:   Final
>
>
> (My apologies if I'm posting to incorrect list.)
>
> --
>
> With Best Regards,
> Marat Khalili
> --

Thanks,
Dhaval

Re: [PATCH v2 tip/core/rcu 05/13] decnet: Apply rcu_access_pointer() to avoid sparse false positive

2013-10-09 Thread Dhaval Giani

On Wed, Oct 9, 2013 at 5:29 PM, Paul E. McKenney
 wrote:
>
> From: "Paul E. McKenney" 
>
> The sparse checking for rcu_assign_pointer() was recently upgraded
> to reject non-__kernel address spaces.  This also rejects __rcu,
> which is almost always the right thing to do.  However, the use in
> dn_insert_route() is legitimate: It is assigning a pointer to an element
> from an RCU-protected list, and all elements of this list are already
> visible to caller.
>
> This commit therefore silences this false positive by laundering the
> pointer using rcu_access_pointer() as suggested by Josh Triplett.
>
> Reported-by: kbuild test robot 


I did not realize that we were allowed to rename people :-)

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 tip/core/rcu 05/13] decnet: Apply rcu_access_pointer() to avoid sparse false positive

2013-10-09 Thread Dhaval Giani

On Wed, Oct 9, 2013 at 5:29 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 From: Paul E. McKenney paul...@linux.vnet.ibm.com

 The sparse checking for rcu_assign_pointer() was recently upgraded
 to reject non-__kernel address spaces.  This also rejects __rcu,
 which is almost always the right thing to do.  However, the use in
 dn_insert_route() is legitimate: It is assigning a pointer to an element
 from an RCU-protected list, and all elements of this list are already
 visible to caller.

 This commit therefore silences this false positive by laundering the
 pointer using rcu_access_pointer() as suggested by Josh Triplett.

 Reported-by: kbuild test robot fengguang...@intel.com


I did not realize that we were allowed to rename people :-)

Thanks!
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ftrace: Fixup !CONFIG_TRACING trace_dump_stack

2013-08-02 Thread Dhaval Giani

Hi Steve,

And since gmail will mangle this up, I have attached it as well.

Thanks!
Dhaval

commit 6379b752b4c9e5f9edf9894723be7520a987d2b5
Author: Dhaval Giani 
Date:   Fri Aug 2 14:42:53 2013 -0400

ftrace: Fixup !CONFIG_TRACING trace_dump_stack

!TRACING does not take an argument for trace_dump_stack. Fix it.

Signed-off-by: Dhaval Giani 

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index e9ef6d6..4b7cc46 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -631,7 +631,7 @@ extern void ftrace_dump(enum ftrace_dump_mode
oops_dump_mode);
 static inline void tracing_start(void) { }
 static inline void tracing_stop(void) { }
 static inline void ftrace_off_permanent(void) { }
-static inline void trace_dump_stack(void) { }
+static inline void trace_dump_stack(int skip) { }

 static inline void tracing_on(void) { }
 static inline void tracing_off(void) { }


trace.patch
Description: Binary data

[PATCH] ftrace: Fixup !CONFIG_TRACING trace_dump_stack

2013-08-02 Thread Dhaval Giani

Hi Steve,

And since gmail will mangle this up, I have attached it as well.

Thanks!
Dhaval

commit 6379b752b4c9e5f9edf9894723be7520a987d2b5
Author: Dhaval Giani dhaval.gi...@gmail.com
Date:   Fri Aug 2 14:42:53 2013 -0400

ftrace: Fixup !CONFIG_TRACING trace_dump_stack

!TRACING does not take an argument for trace_dump_stack. Fix it.

Signed-off-by: Dhaval Giani dhaval.gi...@gmail.com

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index e9ef6d6..4b7cc46 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -631,7 +631,7 @@ extern void ftrace_dump(enum ftrace_dump_mode
oops_dump_mode);
 static inline void tracing_start(void) { }
 static inline void tracing_stop(void) { }
 static inline void ftrace_off_permanent(void) { }
-static inline void trace_dump_stack(void) { }
+static inline void trace_dump_stack(int skip) { }

 static inline void tracing_on(void) { }
 static inline void tracing_off(void) { }


trace.patch
Description: Binary data

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 2013-07-25 1:53 PM, Jörn Engel wrote:

On Thu, 25 July 2013 09:42:18 -0700, Taras Glek wrote:

Footprint wins are useful on android, but it's the
increased IO throughput on crappy storage devices that makes this
most attractive.

All the world used to be a PC.  Seems to be Android these days.

The biggest problem with compression support in the past was the
physical properties of hard drives (the spinning type, if you can
still remember those).  A random seek is surprisingly expensive, of a
similar cost to 1MB or more of linear read.  So anything that
introduces more random seeks will kill the preciously little
performance you had to begin with.

As long as files are write-once and read-only from that point on, you
can just append a bunch of compressed chunks on the disk and nothing
bad happens.  But if you have a read-write file with random overwrites
somewhere in the middle, those overwrites will change the size of the
compressed data.  You have to free the old physical blocks on disk and
allocate new ones.  In effect, you have auto-fragmentation.

So if you want any kind of support for your approach, I suspect you
should either limit it to write-once files or prepare for a mob of
gray-haired oldtimers with rainbow suspenders complaining about
performance on their antiquated hardware.  And the mob may be larger
than you think.


Yes, we plan to limit it to write-once. In order to write, you have to
replace the file.

Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 2013-07-25 2:15 PM, Vyacheslav Dubeyko wrote:

On Jul 25, 2013, at 8:42 PM, Taras Glek wrote:

[snip]

To introduce transparent decompression. Let someone else do the compression for 
us, and supply decompressed data on demand  (in this case a read call). Reduces 
the complexity which would otherwise have to be brought into the filesystem.

The main use for file compression for Firefox(it's useful on Linux desktop too) 
is to improve IO-throughput and reduce startup latency. In order for 
compression to be a net win an application should be aware of what is being 
compressed and what isn't. For example patterns for IO on large libraries (eg 
30mb libxul.so) are well suited to compression, but SQLite databases are not.  
Similarly for our disk cache: images should not be compressed, but javascript 
should be. Footprint wins are useful on android, but it's the increased IO 
throughput on crappy storage devices that makes this most attractive.

In addition of being aware of which files should be compressed, Firefox is 
aware of patterns of usage of various files it could schedule compression at 
the most optimal time.

Above needs tie in nicely with the simplification of not implementing 
compression at fs-level.

There are many filesystems that uses compression as internal technique. And, as 
I understand, implementation
of compression in different Linux kernel filesystem drivers has similar code 
patterns. So, from my point of view,
it makes sense to generalize compression/decompression code in the form of 
library. The API of such generalized
compression kernel library can be used in drivers of different file systems. 
Also such generalized compression
library will simplify support of compression in file system drivers that don't 
support compression feature currently.

Moreover, I think that it is possible to implement compression support on VFS 
level. Such feature gives
opportunity to have compression support for filesystems that don't support 
compression feature as
internal technique.


I am not sure it is a very good idea at this stage.

[snip]

This transparent decompression idea is based on our experience with HFS+. Apple 
uses the fs-attribute approach. OSX is able to compress application libraries 
at installation-time, apps remain blissfully unaware but get an extra boost in 
startup perf.


HFS+ supports compression as internal filesystem technique. It means that HFS+ 
volume layout has
metadata structures for compression support (compressed xattrs or compressed 
resource forks).
So, compression is supported on FS level. As I know, Mac OS X has native 
decompression support
for compressed files but you need to use special tool for compression of files 
on HFS+. Maybe
Mac OS X has internal library that give opportunity to compress application 
libraries at installation
time. But I suppose that it is simply user-space tool or library that uses HFS+ 
compression support
on kernel-space and volume layout levels.
In addition to what Taras mentioned, there is a similar approach being 
followed here. There is a compression tool to compress files at 
https://github.com/glandium/faulty.lib/blob/master/linker/szip.cpp .


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 07/24/2013 07:36 PM, Jörn Engel wrote:

On Wed, 24 July 2013 17:03:53 -0400, Dhaval Giani wrote:

I am posting this series early in its development phase to solicit some
feedback.

At this state, a good description of the format would be nice.


Sure. The format is quite simple. There is a 20 byte header followed by 
an offset table giving us the offsets of 16k compressed zlib chunks (The 
16k is the default number, it can be changed with the use of szip tool, 
the kernel should still decompress it as that data is in the header). I 
am not tied to the format. I used it as that is what being used here. My 
final goal is the have the filesystem agnostic of the compression format 
as long as it is seekable.





We are implementing transparent decompression with a focus on ext4. One
of the main usecases is that of Firefox on Android. Currently libxul.so
is compressed and it is loaded into memory by a custom linker on
demand. With the use of transparent decompression, we can make do
without the custom linker. More details (i.e. code) about the linker can
be found at https://github.com/glandium/faulty.lib

It is not quite clear what you want to achieve here.


To introduce transparent decompression. Let someone else do the 
compression for us, and supply decompressed data on demand  (in this 
case a read call). Reduces the complexity which would otherwise have to 
be brought into the filesystem.



   One approach is
to create an empty file, chattr it to enable compression, then write
uncompressed data to it.  Nothing in userspace will ever know the file
is compressed, unless you explicitly call lsattr.

If you want to follow some other approach where userspace has one
interface to write the compressed data to a file and some other
interface to read the file uncompressed, you are likely in a world of
pain.
Why? If it is going to only be a few applications who know the file is 
compressed, and read it to get decompressed data, why would it be 
painful? What about introducing a new flag, O_COMPR which tells the 
kernel, btw, we want this file to be decompressed if it can be. It can 
fallback to O_RDONLY or something like that? That gets rid of the chattr 
ugliness.



Assuming you use the chattr approach, that pretty much comes down to
adding compression support to ext4.  There have been old patches for
ext2 around that never got merged.  Reading up on the problems
encountered by those patches might be instructive.


Do you have subjects for these? When I googled for ext4 compression, I 
found http://code.google.com/p/e4z/ which doesn't seem to exist, and 
checking in my LKML archives gives too many false positives.


Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 07/24/2013 07:36 PM, Jörn Engel wrote:

On Wed, 24 July 2013 17:03:53 -0400, Dhaval Giani wrote:

I am posting this series early in its development phase to solicit some
feedback.

At this state, a good description of the format would be nice.


Sure. The format is quite simple. There is a 20 byte header followed by 
an offset table giving us the offsets of 16k compressed zlib chunks (The 
16k is the default number, it can be changed with the use of szip tool, 
the kernel should still decompress it as that data is in the header). I 
am not tied to the format. I used it as that is what being used here. My 
final goal is the have the filesystem agnostic of the compression format 
as long as it is seekable.





We are implementing transparent decompression with a focus on ext4. One
of the main usecases is that of Firefox on Android. Currently libxul.so
is compressed and it is loaded into memory by a custom linker on
demand. With the use of transparent decompression, we can make do
without the custom linker. More details (i.e. code) about the linker can
be found at https://github.com/glandium/faulty.lib

It is not quite clear what you want to achieve here.


To introduce transparent decompression. Let someone else do the 
compression for us, and supply decompressed data on demand  (in this 
case a read call). Reduces the complexity which would otherwise have to 
be brought into the filesystem.



   One approach is
to create an empty file, chattr it to enable compression, then write
uncompressed data to it.  Nothing in userspace will ever know the file
is compressed, unless you explicitly call lsattr.

If you want to follow some other approach where userspace has one
interface to write the compressed data to a file and some other
interface to read the file uncompressed, you are likely in a world of
pain.
Why? If it is going to only be a few applications who know the file is 
compressed, and read it to get decompressed data, why would it be 
painful? What about introducing a new flag, O_COMPR which tells the 
kernel, btw, we want this file to be decompressed if it can be. It can 
fallback to O_RDONLY or something like that? That gets rid of the chattr 
ugliness.



Assuming you use the chattr approach, that pretty much comes down to
adding compression support to ext4.  There have been old patches for
ext2 around that never got merged.  Reading up on the problems
encountered by those patches might be instructive.


Do you have subjects for these? When I googled for ext4 compression, I 
found http://code.google.com/p/e4z/ which doesn't seem to exist, and 
checking in my LKML archives gives too many false positives.


Thanks!
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 2013-07-25 2:15 PM, Vyacheslav Dubeyko wrote:

On Jul 25, 2013, at 8:42 PM, Taras Glek wrote:

[snip]

To introduce transparent decompression. Let someone else do the compression for 
us, and supply decompressed data on demand  (in this case a read call). Reduces 
the complexity which would otherwise have to be brought into the filesystem.

The main use for file compression for Firefox(it's useful on Linux desktop too) 
is to improve IO-throughput and reduce startup latency. In order for 
compression to be a net win an application should be aware of what is being 
compressed and what isn't. For example patterns for IO on large libraries (eg 
30mb libxul.so) are well suited to compression, but SQLite databases are not.  
Similarly for our disk cache: images should not be compressed, but javascript 
should be. Footprint wins are useful on android, but it's the increased IO 
throughput on crappy storage devices that makes this most attractive.

In addition of being aware of which files should be compressed, Firefox is 
aware of patterns of usage of various files it could schedule compression at 
the most optimal time.

Above needs tie in nicely with the simplification of not implementing 
compression at fs-level.

There are many filesystems that uses compression as internal technique. And, as 
I understand, implementation
of compression in different Linux kernel filesystem drivers has similar code 
patterns. So, from my point of view,
it makes sense to generalize compression/decompression code in the form of 
library. The API of such generalized
compression kernel library can be used in drivers of different file systems. 
Also such generalized compression
library will simplify support of compression in file system drivers that don't 
support compression feature currently.

Moreover, I think that it is possible to implement compression support on VFS 
level. Such feature gives
opportunity to have compression support for filesystems that don't support 
compression feature as
internal technique.


I am not sure it is a very good idea at this stage.

[snip]

This transparent decompression idea is based on our experience with HFS+. Apple 
uses the fs-attribute approach. OSX is able to compress application libraries 
at installation-time, apps remain blissfully unaware but get an extra boost in 
startup perf.


HFS+ supports compression as internal filesystem technique. It means that HFS+ 
volume layout has
metadata structures for compression support (compressed xattrs or compressed 
resource forks).
So, compression is supported on FS level. As I know, Mac OS X has native 
decompression support
for compressed files but you need to use special tool for compression of files 
on HFS+. Maybe
Mac OS X has internal library that give opportunity to compress application 
libraries at installation
time. But I suppose that it is simply user-space tool or library that uses HFS+ 
compression support
on kernel-space and volume layout levels.
In addition to what Taras mentioned, there is a similar approach being 
followed here. There is a compression tool to compress files at 
https://github.com/glandium/faulty.lib/blob/master/linker/szip.cpp .


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 2013-07-25 1:53 PM, Jörn Engel wrote:

On Thu, 25 July 2013 09:42:18 -0700, Taras Glek wrote:

Footprint wins are useful on android, but it's the
increased IO throughput on crappy storage devices that makes this
most attractive.

All the world used to be a PC.  Seems to be Android these days.

The biggest problem with compression support in the past was the
physical properties of hard drives (the spinning type, if you can
still remember those).  A random seek is surprisingly expensive, of a
similar cost to 1MB or more of linear read.  So anything that
introduces more random seeks will kill the preciously little
performance you had to begin with.

As long as files are write-once and read-only from that point on, you
can just append a bunch of compressed chunks on the disk and nothing
bad happens.  But if you have a read-write file with random overwrites
somewhere in the middle, those overwrites will change the size of the
compressed data.  You have to free the old physical blocks on disk and
allocate new ones.  In effect, you have auto-fragmentation.

So if you want any kind of support for your approach, I suspect you
should either limit it to write-once files or prepare for a mob of
gray-haired oldtimers with rainbow suspenders complaining about
performance on their antiquated hardware.  And the mob may be larger
than you think.


Yes, we plan to limit it to write-once. In order to write, you have to
replace the file.

Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC/PATCH 1/2] szip: Add seekable zip format

2013-07-24 Thread Dhaval Giani


Add support for inflating seekable zip format. This uses zlib
underneath. In order to create a seekable zip file, use the
szip utility which can be obtained from

https://github.com/glandium/faulty.lib

We shall use this to implement transparent decompression on
ext4. The use would be very similar to that used by the faulty.lib
linker.

Cc: Theodore Ts'o 
Cc: Taras Glek 
Cc: Vladan Djeric 
Cc: linux-ext4 
Cc: LKML 
Cc: linux-fsdevel 
Cc: Mike Hommey 
Signed-off-by: Dhaval Giani 
---
 include/linux/szip.h |  32 
 lib/Kconfig  |   8 ++
 lib/Makefile |   1 +
 lib/szip.c   | 217 +++
 4 files changed, 258 insertions(+)
 create mode 100644 include/linux/szip.h
 create mode 100644 lib/szip.c

diff --git a/include/linux/szip.h b/include/linux/szip.h
new file mode 100644
index 000..1d4421e
--- /dev/null
+++ b/include/linux/szip.h
@@ -0,0 +1,32 @@
+#ifndef __SZIP_H
+#define __SZIP_H
+
+#include 
+#include 
+
+#define SZIP_HEADER_SIZE (20)
+
+struct szip_struct {
+   u32 magic;
+   u32 total_size;
+   u16 chunk_size;
+   u16 dict_size;
+   u32 nr_chunks;
+   u16 last_chunk_size;
+   signed char window_bits;
+   signed char filter;
+   unsigned *offset_table;
+   unsigned *dictionary;
+   char *buffer;
+   void *workspace;
+};
+
+extern int szip_decompress(struct szip_struct *, char *, size_t);
+extern int szip_seekable_decompress(struct szip_struct *, size_t,
+   size_t, char *, size_t);
+extern size_t szip_uncompressed_size(struct szip_struct *);
+extern int szip_init(struct szip_struct *, char *);
+extern void szip_init_offset_table(struct szip_struct *szip, char *buf);
+extern size_t szip_offset_table_size(struct szip_struct *szip);
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index fe01d41..0903693 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -213,6 +213,14 @@ config DECOMPRESS_LZO
select LZO_DECOMPRESS
tristate
 
+config SZIP
+   select ZLIB_INFLATE
+   tristate
+   help
+ Use this to provide szip decompression support. szip is a seekable
+ zlib format. Check https://github.com/glandium/faulty.lib for the
+ szip tool. This is required for transparent ext4 decompression.
+
 #
 # Generic allocator support is selected if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index c55a037..86a5d4b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -77,6 +77,7 @@ obj-$(CONFIG_LZO_COMPRESS) += lzo/
 obj-$(CONFIG_LZO_DECOMPRESS) += lzo/
 obj-$(CONFIG_XZ_DEC) += xz/
 obj-$(CONFIG_RAID6_PQ) += raid6/
+obj-${CONFIG_SZIP} += szip.o
 
 lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
 lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
diff --git a/lib/szip.c b/lib/szip.c
new file mode 100644
index 000..d610e62
--- /dev/null
+++ b/lib/szip.c
@@ -0,0 +1,217 @@
+/*
+ * lib/szip.c
+ *
+ * This is a seekable zip file, the format of which is based on
+ * code available at https://github.com/glandium/faulty.lib
+ *
+ * Copyright: Mozilla
+ * Author: Dhaval Giani 
+ *
+ * Based on code written by Mike Hommey  as
+ * part of faulty.lib .
+ *
+ * This code is available under the MPL v2.0 which is explicitly
+ * compatible with GPL v2.
+ */
+
+#include 
+#include 
+#include 
+
+#include 
+
+#define SZIP_MAGIC 0x7a5a6553
+
+static int szip_decompress_seekable_chunk(struct szip_struct *szip,
+   char *output, size_t offset, size_t chunk, size_t length)
+{
+   int is_last_chunk = (chunk == szip->nr_chunks - 1);
+   size_t chunk_len = is_last_chunk ? szip->last_chunk_size
+   : szip->chunk_size;
+   z_stream zstream;
+   int ret = 0;
+   int flush;
+   int success;
+
+   memset(, 0, sizeof(zstream));
+
+   if (length == 0 || length > chunk_len)
+   length = chunk_len;
+
+   if (is_last_chunk)
+   zstream.avail_in = szip->total_size;
+   else
+   zstream.avail_in = szip->offset_table[chunk + 1]
+   - szip->offset_table[chunk];
+
+   zstream.next_in = szip->buffer + offset;
+   zstream.avail_out = length;
+   zstream.next_out = output;
+   if (!szip->workspace)
+   szip->workspace = vzalloc(zlib_inflate_workspacesize());
+   zstream.workspace = szip->workspace;
+   if (!zstream.workspace) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   /* Decompress Chunk */
+   /* **TODO: Correct return value for bad zlib format** */
+   if (zlib_inflateInit2(, (int) szip->window_bits) != Z_OK) {
+   ret = -EMEDIUMTYPE;
+   goto out;
+   }
+
+   /* We don't have dictionary logic yet */
+   if (length == chunk_len) {
+   flush = Z_FINISH;
+   success = Z_STREAM_END;
+   } else {

[RFC/PATCH 2/2] Add rudimentary transparent decompression support to ext4

2013-07-24 Thread Dhaval Giani


Adds basic support for transparently reading compressed
files in ext4.

Lots of issues in this patch
1. It requires a fully read file from disk, no seeking allowed
2. Compressed files give their compressed sizes and not uncompressed
sizes. Therefore cat will return truncated data (since the buffer
isn't big enough)
3. It adds a new file operation. That will be *removed*.
4. Doesn't mmap decompressed data

Cc: Theodore Ts'o 
Cc: Taras Glek 
Cc: Vladan Djeric 
Cc: linux-ext4 
Cc: LKML 
Cc: linux-fsdevel 
Cc: Mike Hommey 
Signed-off-by: Dhaval Giani 
---
 fs/ext4/file.c | 66 ++
 fs/read_write.c|  3 +++
 include/linux/fs.h |  1 +
 3 files changed, 70 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1b4d51..5c9db04 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -31,6 +31,9 @@
 #include "xattr.h"
 #include "acl.h"
 
+#include 
+#include 
+
 /*
  * Called when an inode is released. Note that this is different
  * from ext4_file_open: open gets called at every open, but release
@@ -623,6 +626,68 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int 
whence)
return -EINVAL;
 }
 
+static int ext4_is_file_compressed(struct file *file)
+{
+   struct inode *inode = file->f_mapping->host;
+   return ext4_test_inode_flag(inode, EXT4_INODE_COMPR);
+}
+
+static int _ext4_decompress(char __user *buf, int sz)
+{
+   /*
+* We can really cheat here since we have the full buffer already read
+* and made available
+*/
+   struct szip_struct szip;
+   char *temp;
+   size_t uncom_size;
+
+   int ret = szip_init(, buf);
+   if (ret) {
+   ret = -1;
+   goto out;
+   }
+
+   uncom_size = szip_uncompressed_size();
+   temp = kmalloc(uncom_size, GFP_NOFS);
+   if (!temp) {
+   ret = -2;
+   goto out;
+   }
+
+   ret = szip_decompress(, temp, 0);
+   if (ret) {
+   ret = -3;
+   goto out_free;
+   }
+
+   sz = min_t(int, sz, uncom_size);
+
+   memset(buf, 0, sz);
+   memcpy(buf, temp, sz);
+out_free:
+   kfree(temp);
+
+out:
+   return ret;
+
+}
+
+int ext4_decompress(struct file *file, char __user *buf, size_t len)
+{
+   int ret = 0;
+
+   if (!ext4_is_file_compressed(file))
+   return 0;
+
+   ret = _ext4_decompress(buf, len);
+   if (ret) {
+   goto out;
+   }
+out:
+   return ret;
+}
+
 const struct file_operations ext4_file_operations = {
.llseek = ext4_llseek,
.read   = do_sync_read,
@@ -640,6 +705,7 @@ const struct file_operations ext4_file_operations = {
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
.fallocate  = ext4_fallocate,
+   .decompress = ext4_decompress,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index 2cefa41..44d2523 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -330,6 +330,7 @@ int rw_verify_area(int read_write, struct file *file, 
loff_t *ppos, size_t count
return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;
 }
 
+
 ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t 
*ppos)
 {
struct iovec iov = { .iov_base = buf, .iov_len = len };
@@ -345,6 +346,8 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, 
size_t len, loff_t *pp
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb();
*ppos = kiocb.ki_pos;
+   if (filp->f_op->decompress)
+   filp->f_op->decompress(filp, buf, len);
return ret;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65c2be2..ce43e82 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1543,6 +1543,7 @@ struct file_operations {
long (*fallocate)(struct file *file, int mode, loff_t offset,
  loff_t len);
int (*show_fdinfo)(struct seq_file *m, struct file *f);
+   int (*decompress)(struct file *, char *, size_t);
 };
 
 struct inode_operations {
-- 
1.8.1.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-24 Thread Dhaval Giani


Hi there!

I am posting this series early in its development phase to solicit some
feedback.

We are implementing transparent decompression with a focus on ext4. One
of the main usecases is that of Firefox on Android. Currently libxul.so
is compressed and it is loaded into memory by a custom linker on
demand. With the use of transparent decompression, we can make do
without the custom linker. More details (i.e. code) about the linker can
be found at https://github.com/glandium/faulty.lib

Patch 1 introduces the seekable zip format to the kernel. The tool to
create the szip file can be found in the git repository mentioned
earlier. Patch 2 introduces transparent decompression to ext4. This
patch is really ugly, but it gives an idea of what I am upto right now.

Now let's move on the interesting bits.

There are a few flaws with the current approach (though most are easily
fixable)
1. The decompression takes place very late. We probably want to be
decompressing soon after get the data off disk.
2. No seek support. This is for simplicity as I was experimenting with
filesystems for the first time. I have a patch that does it, but it is
too ugly to see the world. I will fix it up in time for the next set.
3. No mmap support. For a similar reason as 1. There is no reason it
cannot be done, it just has not been done correctly.
4. stat still returns the compressed size. We need to modify
compressed files to return uncompressed size.
5. Implementation is tied to the szip format. However it is quite easy
to decouple the compression scheme from the filesystem. I will probably
get to that in another 2 rounds (first goal is to get seek support
working fine, and mmap in place)
6. Introduction of an additional file_operation to decompress the
buffer. This will be *removed* in the next posting once I have seek
support implemented properly.
7. The compressed file is read only. In order to write to the file, it
shall have to be replaced.
8. The kernel learns that the file is compressed with the use of the
chattr tool. For now I am abusing the +c flag. Please let me know if
that should not be used.

In order to try this patch out, please create an szip file using the
szip tool. Then, read the file. Just ensure that the buffer you provide
to the kernel is big enough to fit the uncompressed file (and that you
read the whole file in one go.)

Thanks!
Dhaval

--
Dhaval Giani (2):
  szip: Add seekable zip format
  Add rudimentary transparent decompression support to ext4

 fs/ext4/file.c   |  66 
 fs/read_write.c  |   3 +
 include/linux/fs.h   |   1 +
 include/linux/szip.h |  32 
 lib/Kconfig  |   8 ++
 lib/Makefile |   1 +
 lib/szip.c   | 217 +++
 7 files changed, 328 insertions(+)
 create mode 100644 include/linux/szip.h
 create mode 100644 lib/szip.c

-- 
1.8.1.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC/PATCH 2/2] Add rudimentary transparent decompression support to ext4

2013-07-24 Thread Dhaval Giani


Adds basic support for transparently reading compressed
files in ext4.

Lots of issues in this patch
1. It requires a fully read file from disk, no seeking allowed
2. Compressed files give their compressed sizes and not uncompressed
sizes. Therefore cat will return truncated data (since the buffer
isn't big enough)
3. It adds a new file operation. That will be *removed*.
4. Doesn't mmap decompressed data

Cc: Theodore Ts'o ty...@mit.edu
Cc: Taras Glek tg...@mozilla.com
Cc: Vladan Djeric vdje...@mozilla.com
Cc: linux-ext4 linux-e...@vger.kernel.org
Cc: LKML linux-kernel@vger.kernel.org
Cc: linux-fsdevel linux-fsde...@vger.kernel.org
Cc: Mike Hommey gland...@mozilla.com
Signed-off-by: Dhaval Giani dgi...@mozilla.com
---
 fs/ext4/file.c | 66 ++
 fs/read_write.c|  3 +++
 include/linux/fs.h |  1 +
 3 files changed, 70 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1b4d51..5c9db04 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -31,6 +31,9 @@
 #include xattr.h
 #include acl.h
 
+#include linux/zlib.h
+#include linux/szip.h
+
 /*
  * Called when an inode is released. Note that this is different
  * from ext4_file_open: open gets called at every open, but release
@@ -623,6 +626,68 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int 
whence)
return -EINVAL;
 }
 
+static int ext4_is_file_compressed(struct file *file)
+{
+   struct inode *inode = file-f_mapping-host;
+   return ext4_test_inode_flag(inode, EXT4_INODE_COMPR);
+}
+
+static int _ext4_decompress(char __user *buf, int sz)
+{
+   /*
+* We can really cheat here since we have the full buffer already read
+* and made available
+*/
+   struct szip_struct szip;
+   char *temp;
+   size_t uncom_size;
+
+   int ret = szip_init(szip, buf);
+   if (ret) {
+   ret = -1;
+   goto out;
+   }
+
+   uncom_size = szip_uncompressed_size(szip);
+   temp = kmalloc(uncom_size, GFP_NOFS);
+   if (!temp) {
+   ret = -2;
+   goto out;
+   }
+
+   ret = szip_decompress(szip, temp, 0);
+   if (ret) {
+   ret = -3;
+   goto out_free;
+   }
+
+   sz = min_t(int, sz, uncom_size);
+
+   memset(buf, 0, sz);
+   memcpy(buf, temp, sz);
+out_free:
+   kfree(temp);
+
+out:
+   return ret;
+
+}
+
+int ext4_decompress(struct file *file, char __user *buf, size_t len)
+{
+   int ret = 0;
+
+   if (!ext4_is_file_compressed(file))
+   return 0;
+
+   ret = _ext4_decompress(buf, len);
+   if (ret) {
+   goto out;
+   }
+out:
+   return ret;
+}
+
 const struct file_operations ext4_file_operations = {
.llseek = ext4_llseek,
.read   = do_sync_read,
@@ -640,6 +705,7 @@ const struct file_operations ext4_file_operations = {
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
.fallocate  = ext4_fallocate,
+   .decompress = ext4_decompress,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index 2cefa41..44d2523 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -330,6 +330,7 @@ int rw_verify_area(int read_write, struct file *file, 
loff_t *ppos, size_t count
return count  MAX_RW_COUNT ? MAX_RW_COUNT : count;
 }
 
+
 ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t 
*ppos)
 {
struct iovec iov = { .iov_base = buf, .iov_len = len };
@@ -345,6 +346,8 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, 
size_t len, loff_t *pp
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(kiocb);
*ppos = kiocb.ki_pos;
+   if (filp-f_op-decompress)
+   filp-f_op-decompress(filp, buf, len);
return ret;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65c2be2..ce43e82 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1543,6 +1543,7 @@ struct file_operations {
long (*fallocate)(struct file *file, int mode, loff_t offset,
  loff_t len);
int (*show_fdinfo)(struct seq_file *m, struct file *f);
+   int (*decompress)(struct file *, char *, size_t);
 };
 
 struct inode_operations {
-- 
1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-24 Thread Dhaval Giani


Hi there!

I am posting this series early in its development phase to solicit some
feedback.

We are implementing transparent decompression with a focus on ext4. One
of the main usecases is that of Firefox on Android. Currently libxul.so
is compressed and it is loaded into memory by a custom linker on
demand. With the use of transparent decompression, we can make do
without the custom linker. More details (i.e. code) about the linker can
be found at https://github.com/glandium/faulty.lib

Patch 1 introduces the seekable zip format to the kernel. The tool to
create the szip file can be found in the git repository mentioned
earlier. Patch 2 introduces transparent decompression to ext4. This
patch is really ugly, but it gives an idea of what I am upto right now.

Now let's move on the interesting bits.

There are a few flaws with the current approach (though most are easily
fixable)
1. The decompression takes place very late. We probably want to be
decompressing soon after get the data off disk.
2. No seek support. This is for simplicity as I was experimenting with
filesystems for the first time. I have a patch that does it, but it is
too ugly to see the world. I will fix it up in time for the next set.
3. No mmap support. For a similar reason as 1. There is no reason it
cannot be done, it just has not been done correctly.
4. stat still returns the compressed size. We need to modify
compressed files to return uncompressed size.
5. Implementation is tied to the szip format. However it is quite easy
to decouple the compression scheme from the filesystem. I will probably
get to that in another 2 rounds (first goal is to get seek support
working fine, and mmap in place)
6. Introduction of an additional file_operation to decompress the
buffer. This will be *removed* in the next posting once I have seek
support implemented properly.
7. The compressed file is read only. In order to write to the file, it
shall have to be replaced.
8. The kernel learns that the file is compressed with the use of the
chattr tool. For now I am abusing the +c flag. Please let me know if
that should not be used.

In order to try this patch out, please create an szip file using the
szip tool. Then, read the file. Just ensure that the buffer you provide
to the kernel is big enough to fit the uncompressed file (and that you
read the whole file in one go.)

Thanks!
Dhaval

--
Dhaval Giani (2):
  szip: Add seekable zip format
  Add rudimentary transparent decompression support to ext4

 fs/ext4/file.c   |  66 
 fs/read_write.c  |   3 +
 include/linux/fs.h   |   1 +
 include/linux/szip.h |  32 
 lib/Kconfig  |   8 ++
 lib/Makefile |   1 +
 lib/szip.c   | 217 +++
 7 files changed, 328 insertions(+)
 create mode 100644 include/linux/szip.h
 create mode 100644 lib/szip.c

-- 
1.8.1.4



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC/PATCH 1/2] szip: Add seekable zip format

2013-07-24 Thread Dhaval Giani


Add support for inflating seekable zip format. This uses zlib
underneath. In order to create a seekable zip file, use the
szip utility which can be obtained from

https://github.com/glandium/faulty.lib

We shall use this to implement transparent decompression on
ext4. The use would be very similar to that used by the faulty.lib
linker.

Cc: Theodore Ts'o ty...@mit.edu
Cc: Taras Glek tg...@mozilla.com
Cc: Vladan Djeric vdje...@mozilla.com
Cc: linux-ext4 linux-e...@vger.kernel.org
Cc: LKML linux-kernel@vger.kernel.org
Cc: linux-fsdevel linux-fsde...@vger.kernel.org
Cc: Mike Hommey gland...@mozilla.com
Signed-off-by: Dhaval Giani dgi...@mozilla.com
---
 include/linux/szip.h |  32 
 lib/Kconfig  |   8 ++
 lib/Makefile |   1 +
 lib/szip.c   | 217 +++
 4 files changed, 258 insertions(+)
 create mode 100644 include/linux/szip.h
 create mode 100644 lib/szip.c

diff --git a/include/linux/szip.h b/include/linux/szip.h
new file mode 100644
index 000..1d4421e
--- /dev/null
+++ b/include/linux/szip.h
@@ -0,0 +1,32 @@
+#ifndef __SZIP_H
+#define __SZIP_H
+
+#include linux/zlib.h
+#include linux/types.h
+
+#define SZIP_HEADER_SIZE (20)
+
+struct szip_struct {
+   u32 magic;
+   u32 total_size;
+   u16 chunk_size;
+   u16 dict_size;
+   u32 nr_chunks;
+   u16 last_chunk_size;
+   signed char window_bits;
+   signed char filter;
+   unsigned *offset_table;
+   unsigned *dictionary;
+   char *buffer;
+   void *workspace;
+};
+
+extern int szip_decompress(struct szip_struct *, char *, size_t);
+extern int szip_seekable_decompress(struct szip_struct *, size_t,
+   size_t, char *, size_t);
+extern size_t szip_uncompressed_size(struct szip_struct *);
+extern int szip_init(struct szip_struct *, char *);
+extern void szip_init_offset_table(struct szip_struct *szip, char *buf);
+extern size_t szip_offset_table_size(struct szip_struct *szip);
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index fe01d41..0903693 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -213,6 +213,14 @@ config DECOMPRESS_LZO
select LZO_DECOMPRESS
tristate
 
+config SZIP
+   select ZLIB_INFLATE
+   tristate
+   help
+ Use this to provide szip decompression support. szip is a seekable
+ zlib format. Check https://github.com/glandium/faulty.lib for the
+ szip tool. This is required for transparent ext4 decompression.
+
 #
 # Generic allocator support is selected if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index c55a037..86a5d4b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -77,6 +77,7 @@ obj-$(CONFIG_LZO_COMPRESS) += lzo/
 obj-$(CONFIG_LZO_DECOMPRESS) += lzo/
 obj-$(CONFIG_XZ_DEC) += xz/
 obj-$(CONFIG_RAID6_PQ) += raid6/
+obj-${CONFIG_SZIP} += szip.o
 
 lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
 lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
diff --git a/lib/szip.c b/lib/szip.c
new file mode 100644
index 000..d610e62
--- /dev/null
+++ b/lib/szip.c
@@ -0,0 +1,217 @@
+/*
+ * lib/szip.c
+ *
+ * This is a seekable zip file, the format of which is based on
+ * code available at https://github.com/glandium/faulty.lib
+ *
+ * Copyright: Mozilla
+ * Author: Dhaval Giani dgi...@mozilla.com
+ *
+ * Based on code written by Mike Hommey gland...@mozilla.com as
+ * part of faulty.lib .
+ *
+ * This code is available under the MPL v2.0 which is explicitly
+ * compatible with GPL v2.
+ */
+
+#include linux/zlib.h
+#include linux/szip.h
+#include linux/vmalloc.h
+
+#include linux/string.h
+
+#define SZIP_MAGIC 0x7a5a6553
+
+static int szip_decompress_seekable_chunk(struct szip_struct *szip,
+   char *output, size_t offset, size_t chunk, size_t length)
+{
+   int is_last_chunk = (chunk == szip-nr_chunks - 1);
+   size_t chunk_len = is_last_chunk ? szip-last_chunk_size
+   : szip-chunk_size;
+   z_stream zstream;
+   int ret = 0;
+   int flush;
+   int success;
+
+   memset(zstream, 0, sizeof(zstream));
+
+   if (length == 0 || length  chunk_len)
+   length = chunk_len;
+
+   if (is_last_chunk)
+   zstream.avail_in = szip-total_size;
+   else
+   zstream.avail_in = szip-offset_table[chunk + 1]
+   - szip-offset_table[chunk];
+
+   zstream.next_in = szip-buffer + offset;
+   zstream.avail_out = length;
+   zstream.next_out = output;
+   if (!szip-workspace)
+   szip-workspace = vzalloc(zlib_inflate_workspacesize());
+   zstream.workspace = szip-workspace;
+   if (!zstream.workspace) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   /* Decompress Chunk */
+   /* **TODO: Correct return value for bad zlib format** */
+   if (zlib_inflateInit2(zstream, (int) szip-window_bits) != Z_OK

Re: [PATCH 5/8] vrange: Add new vrange(2) system call

2013-06-20 Thread Dhaval Giani


On 2013-06-12 12:22 AM, John Stultz wrote:

From: Minchan Kim 

This patch adds new system call sys_vrange.

NAME
vrange - Mark or unmark range of memory as volatile

SYNOPSIS
int vrange(unsigned_long start, size_t length, int mode,
 int *purged);

DESCRIPTION
Applications can use vrange(2) to advise the kernel how it should
handle paging I/O in this VM area.  The idea is to help the kernel
discard pages of vrange instead of reclaiming when memory pressure
happens. It means kernel doesn't discard any pages of vrange if
there is no memory pressure.

mode:
VRANGE_VOLATILE
hint to kernel so VM can discard in vrange pages when
memory pressure happens.
VRANGE_NONVOLATILE
hint to kernel so VM doesn't discard vrange pages
any more.

If user try to access purged memory without VRANGE_NOVOLATILE call,
he can encounter SIGBUS if the page was discarded by kernel.


I wonder if it would be possible to provide additional information here, 
for example "purge range at a time" as opposed to "purge page at a 
time". There are some valid use cases for both approaches and it doesn't 
make sense to deny one use case.


Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] vrange: Add new vrange(2) system call

2013-06-20 Thread Dhaval Giani


On 2013-06-12 12:22 AM, John Stultz wrote:

From: Minchan Kim minc...@kernel.org

This patch adds new system call sys_vrange.

NAME
vrange - Mark or unmark range of memory as volatile

SYNOPSIS
int vrange(unsigned_long start, size_t length, int mode,
 int *purged);

DESCRIPTION
Applications can use vrange(2) to advise the kernel how it should
handle paging I/O in this VM area.  The idea is to help the kernel
discard pages of vrange instead of reclaiming when memory pressure
happens. It means kernel doesn't discard any pages of vrange if
there is no memory pressure.

mode:
VRANGE_VOLATILE
hint to kernel so VM can discard in vrange pages when
memory pressure happens.
VRANGE_NONVOLATILE
hint to kernel so VM doesn't discard vrange pages
any more.

If user try to access purged memory without VRANGE_NOVOLATILE call,
he can encounter SIGBUS if the page was discarded by kernel.


I wonder if it would be possible to provide additional information here, 
for example purge range at a time as opposed to purge page at a 
time. There are some valid use cases for both approaches and it doesn't 
make sense to deny one use case.


Thanks!
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-19 Thread Dhaval Giani


On 2013-06-19 12:41 AM, Minchan Kim wrote:

Hello Dhaval,

On Tue, Jun 18, 2013 at 12:59:02PM -0400, Dhaval Giani wrote:

On 2013-06-18 12:11 AM, Minchan Kim wrote:

Hello Dhaval,

On Mon, Jun 17, 2013 at 12:24:07PM -0400, Dhaval Giani wrote:

Hi John,

I have been giving your git tree a whirl, and in order to simulate a
limited memory environment, I was using memory cgroups.

The program I was using to test is attached here. It is your test
code, with some changes (changing the syscall interface, reducing
the memory pressure to be generated).

I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted

Thanks for the testing!
Does below patch fix your problem?

Yes it does! Thank you very much for the patch.

Thaks for the confirming.
While I tested it, I found several problems so I just sent fixes as reply
of each [7/8] and [8/8].
Could you test it?


Great! These patches (seem to) fix another issue I noticed yesterday 
with signal handling. I have pushed out my code for testing this stuff 
at https://github.com/volatile-ranges-test/vranges-test . The code and 
the scripts are still unpolished (as in you don't get a pass or fail) 
but they seem to work just fine.




FYI: John, Dhaval

I am working to clean purging mess up so maybe it would need not a few
change for purging part.


Great, I will also take a look at the code.

Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-19 Thread Dhaval Giani


On 2013-06-19 12:41 AM, Minchan Kim wrote:

Hello Dhaval,

On Tue, Jun 18, 2013 at 12:59:02PM -0400, Dhaval Giani wrote:

On 2013-06-18 12:11 AM, Minchan Kim wrote:

Hello Dhaval,

On Mon, Jun 17, 2013 at 12:24:07PM -0400, Dhaval Giani wrote:

Hi John,

I have been giving your git tree a whirl, and in order to simulate a
limited memory environment, I was using memory cgroups.

The program I was using to test is attached here. It is your test
code, with some changes (changing the syscall interface, reducing
the memory pressure to be generated).

I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted

Thanks for the testing!
Does below patch fix your problem?

Yes it does! Thank you very much for the patch.

Thaks for the confirming.
While I tested it, I found several problems so I just sent fixes as reply
of each [7/8] and [8/8].
Could you test it?


Great! These patches (seem to) fix another issue I noticed yesterday 
with signal handling. I have pushed out my code for testing this stuff 
at https://github.com/volatile-ranges-test/vranges-test . The code and 
the scripts are still unpolished (as in you don't get a pass or fail) 
but they seem to work just fine.




FYI: John, Dhaval

I am working to clean purging mess up so maybe it would need not a few
change for purging part.


Great, I will also take a look at the code.

Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-18 Thread Dhaval Giani


On 2013-06-18 12:11 AM, Minchan Kim wrote:

Hello Dhaval,

On Mon, Jun 17, 2013 at 12:24:07PM -0400, Dhaval Giani wrote:

Hi John,

I have been giving your git tree a whirl, and in order to simulate a
limited memory environment, I was using memory cgroups.

The program I was using to test is attached here. It is your test
code, with some changes (changing the syscall interface, reducing
the memory pressure to be generated).

I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted

Thanks for the testing!
Does below patch fix your problem?


Yes it does! Thank you very much for the patch.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-18 Thread Dhaval Giani


On 2013-06-18 12:11 AM, Minchan Kim wrote:

Hello Dhaval,

On Mon, Jun 17, 2013 at 12:24:07PM -0400, Dhaval Giani wrote:

Hi John,

I have been giving your git tree a whirl, and in order to simulate a
limited memory environment, I was using memory cgroups.

The program I was using to test is attached here. It is your test
code, with some changes (changing the syscall interface, reducing
the memory pressure to be generated).

I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted

Thanks for the testing!
Does below patch fix your problem?


Yes it does! Thank you very much for the patch.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-17 Thread Dhaval Giani


Hi John,

I have been giving your git tree a whirl, and in order to simulate a 
limited memory environment, I was using memory cgroups.


The program I was using to test is attached here. It is your test code, 
with some changes (changing the syscall interface, reducing the memory 
pressure to be generated).


I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted 
3.10.0-rc5+ #2
[  406.207650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  406.207655] task: 880006fe ti: 88001c8b task.ti: 
88001c8b
[  406.207659] RIP: 0010:[] [] 
try_to_discard_one+0x1f8/0x210

[  406.207667] RSP: :88001c8b1598  EFLAGS: 00010246
[  406.207671] RAX:  RBX: 7fde082c RCX: 
88001f199600
[  406.207675] RDX: 0006 RSI: 0007 RDI: 

[  406.207679] RBP: 88001c8b15f8 R08: 0591 R09: 
0055
[  406.207683] R10:  R11:  R12: 
ea2ae2c0
[  406.207687] R13: 88001ef9e540 R14: 88001ef9e5e0 R15: 
88000b7cfda8
[  406.207692] FS:  7fde08320740() GS:88001fc0() 
knlGS:

[  406.207696] CS:  0010 DS:  ES:  CR0: 8005003b
[  406.207700] CR2: 7fde082c CR3: 1f131000 CR4: 
06f0
[  406.207707] DR0:  DR1:  DR2: 

[  406.207711] DR3:  DR6: 0ff0 DR7: 
0400

[  406.207715] Stack:
[  406.207719]  0006 88001f199600 88001ef9e5d8 
81154f16
[  406.207724]  8801 ea7c6670 88001c8b15f8 
ea2ae2c0
[  406.207729]  88001f1386c0 88001ef9e5d8 88000b7cfda8 
880005110a10

[  406.207734] Call Trace:
[  406.207743]  [] discard_vpage+0x3c2/0x410
[  406.207753]  [] ? page_referenced+0x241/0x2c0
[  406.207762]  [] shrink_page_list+0x397/0x950
[  406.207770]  [] shrink_inactive_list+0x14f/0x400
[  406.207778]  [] shrink_lruvec+0x229/0x4e0
[  406.207787]  [] ? wake_up_process+0x27/0x50
[  406.207795]  [] shrink_zone+0x66/0x1a0
[  406.207803]  [] do_try_to_free_pages+0x110/0x5a0
[  406.207812]  [] try_to_free_mem_cgroup_pages+0xbf/0x140
[  406.207821]  [] mem_cgroup_reclaim+0x4e/0xe0
[  406.207829]  [] __mem_cgroup_try_charge+0x4ef/0xbb0
[  406.207837]  [] mem_cgroup_charge_common+0x6d/0xd0
[  406.207846]  [] mem_cgroup_newpage_charge+0x3b/0x50
[  406.207854]  [] do_wp_page+0x150/0x720
[  406.207862]  [] handle_pte_fault+0x98d/0xae0
[  406.207871]  [] handle_mm_fault+0x264/0x5e0
[  406.207880]  [] __do_page_fault+0x171/0x4e0
[  406.207888]  [] ? do_page_fault+0xe/0x10
[  406.207896]  [] ? page_fault+0x22/0x30
[  406.207905]  [] do_page_fault+0xe/0x10
[  406.207913]  [] page_fault+0x22/0x30
[  406.207917] Code: c1 e7 39 48 09 c7 f0 49 ff 8d e8 02 00 00 48 89 55 
a0 48 89 4d a8 e8 78 42 00 00 85 c0 48 8b 55 a0 48 8b 4d a8 0f 85 50 ff 
ff ff <0f> 0b 66 0f 1f 44 00 00 31 db e9 7a fe ff ff 0f 0b e8 c1 aa 4b

[  406.207937] RIP  [] try_to_discard_one+0x1f8/0x210
[  406.207941]  RSP 
[  406.207946] ---[ end trace fe9729b910a78aff ]---
[  406.207951] [ cut here ]
[  406.207957] WARNING: at kernel/exit.c:715 do_exit+0x55/0xa30()
[  406.207960] Modules linked in:
[  406.207965] CPU: 0 PID: 1579 Comm: volatile-test Tainted: G D  
3.10.0-rc5+ #2
[  406.207969] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  406.207973]  0009 88001c8b1288 81612a03 
88001c8b12c8
[  406.207978]  81049bb0 88001c8b14e8 000b 
88001c8b14e8
[  406.207983]  0246  880006fe 
88001c8b12d8

[  406.207988] Call Trace:
[  406.207997]  [] dump_stack+0x19/0x1b
[  406.208189]  [] warn_slowpath_common+0x70/0xa0
[  406.208207]  [] warn_slowpath_null+0x1a/0x20
[  406.208222]  [] do_exit+0x55/0xa30
[  406.208238]  [] ? printk+0x61/0x63
[  406.208253]  [] oops_end+0x9b/0xe0
[  406.208269]  [] die+0x58/0x90
[  406.208285]  [] do_trap+0x6b/0x170
[  406.208298]  [] ? 
__atomic_notifier_call_chain+0x12/0x20

[  406.208309]  [] do_invalid_op+0x95/0xb0
[  406.208317]  [] ? try_to_discard_one+0x1f8/0x210
[  406.208328]  [] ? blk_queue_bio+0x32e/0x3b0
[  406.208338]  [] invalid_op+0x18/0x20
[  406.208348]  [] ? try_to_discard_one+0x1f8/0x210
[  406.208360]  [] ? try_to_discard_one+0x1e8/0x210
[  406.208370]  [] discard_vpage+0x3c2/0x410
[  406.208383]  [] ? page_referenced+0x241/0x2c0
[  406.208394]  [] shrink_page_list+0x397/0x950
[  406.208405]  [] shrink_inactive_list+0x14f/0x400
[  406.208417]  [] shrink_lruvec+0x229/0x4e0
[  406.208429]  [] ? wake_up_process+0x27/0x50
[

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-17 Thread Dhaval Giani


Hi John,

I have been giving your git tree a whirl, and in order to simulate a 
limited memory environment, I was using memory cgroups.


The program I was using to test is attached here. It is your test code, 
with some changes (changing the syscall interface, reducing the memory 
pressure to be generated).


I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted 
3.10.0-rc5+ #2
[  406.207650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  406.207655] task: 880006fe ti: 88001c8b task.ti: 
88001c8b
[  406.207659] RIP: 0010:[81155758] [81155758] 
try_to_discard_one+0x1f8/0x210

[  406.207667] RSP: :88001c8b1598  EFLAGS: 00010246
[  406.207671] RAX:  RBX: 7fde082c RCX: 
88001f199600
[  406.207675] RDX: 0006 RSI: 0007 RDI: 

[  406.207679] RBP: 88001c8b15f8 R08: 0591 R09: 
0055
[  406.207683] R10:  R11:  R12: 
ea2ae2c0
[  406.207687] R13: 88001ef9e540 R14: 88001ef9e5e0 R15: 
88000b7cfda8
[  406.207692] FS:  7fde08320740() GS:88001fc0() 
knlGS:

[  406.207696] CS:  0010 DS:  ES:  CR0: 8005003b
[  406.207700] CR2: 7fde082c CR3: 1f131000 CR4: 
06f0
[  406.207707] DR0:  DR1:  DR2: 

[  406.207711] DR3:  DR6: 0ff0 DR7: 
0400

[  406.207715] Stack:
[  406.207719]  0006 88001f199600 88001ef9e5d8 
81154f16
[  406.207724]  8801 ea7c6670 88001c8b15f8 
ea2ae2c0
[  406.207729]  88001f1386c0 88001ef9e5d8 88000b7cfda8 
880005110a10

[  406.207734] Call Trace:
[  406.207743]  [81155b32] discard_vpage+0x3c2/0x410
[  406.207753]  [81150881] ? page_referenced+0x241/0x2c0
[  406.207762]  [8112e627] shrink_page_list+0x397/0x950
[  406.207770]  [8112f12f] shrink_inactive_list+0x14f/0x400
[  406.207778]  [8112f959] shrink_lruvec+0x229/0x4e0
[  406.207787]  [8107e597] ? wake_up_process+0x27/0x50
[  406.207795]  [8112fc76] shrink_zone+0x66/0x1a0
[  406.207803]  [81130130] do_try_to_free_pages+0x110/0x5a0
[  406.207812]  [8113074f] try_to_free_mem_cgroup_pages+0xbf/0x140
[  406.207821]  [81179f6e] mem_cgroup_reclaim+0x4e/0xe0
[  406.207829]  [8117a4ef] __mem_cgroup_try_charge+0x4ef/0xbb0
[  406.207837]  [8117b29d] mem_cgroup_charge_common+0x6d/0xd0
[  406.207846]  [8117cbeb] mem_cgroup_newpage_charge+0x3b/0x50
[  406.207854]  [81142170] do_wp_page+0x150/0x720
[  406.207862]  [811448ed] handle_pte_fault+0x98d/0xae0
[  406.207871]  [811452c4] handle_mm_fault+0x264/0x5e0
[  406.207880]  [8161c5b1] __do_page_fault+0x171/0x4e0
[  406.207888]  [8161c92e] ? do_page_fault+0xe/0x10
[  406.207896]  [81619172] ? page_fault+0x22/0x30
[  406.207905]  [8161c92e] do_page_fault+0xe/0x10
[  406.207913]  [81619172] page_fault+0x22/0x30
[  406.207917] Code: c1 e7 39 48 09 c7 f0 49 ff 8d e8 02 00 00 48 89 55 
a0 48 89 4d a8 e8 78 42 00 00 85 c0 48 8b 55 a0 48 8b 4d a8 0f 85 50 ff 
ff ff 0f 0b 66 0f 1f 44 00 00 31 db e9 7a fe ff ff 0f 0b e8 c1 aa 4b

[  406.207937] RIP  [81155758] try_to_discard_one+0x1f8/0x210
[  406.207941]  RSP 88001c8b1598
[  406.207946] ---[ end trace fe9729b910a78aff ]---
[  406.207951] [ cut here ]
[  406.207957] WARNING: at kernel/exit.c:715 do_exit+0x55/0xa30()
[  406.207960] Modules linked in:
[  406.207965] CPU: 0 PID: 1579 Comm: volatile-test Tainted: G D  
3.10.0-rc5+ #2
[  406.207969] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  406.207973]  0009 88001c8b1288 81612a03 
88001c8b12c8
[  406.207978]  81049bb0 88001c8b14e8 000b 
88001c8b14e8
[  406.207983]  0246  880006fe 
88001c8b12d8

[  406.207988] Call Trace:
[  406.207997]  [81612a03] dump_stack+0x19/0x1b
[  406.208189]  [81049bb0] warn_slowpath_common+0x70/0xa0
[  406.208207]  [81049bfa] warn_slowpath_null+0x1a/0x20
[  406.208222]  [8104f2e5] do_exit+0x55/0xa30
[  406.208238]  [8160e4e0] ? printk+0x61/0x63
[  406.208253]  [81619c9b] oops_end+0x9b/0xe0
[  406.208269]  [81005908] die+0x58/0x90
[  406.208285]  [8161956b] do_trap+0x6b/0x170
[  406.208298]  [8161c9b2] ? 
__atomic_notifier_call_chain+0x12/0x20

[  406.208309]  [81002e75]

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-31 Thread Dhaval Giani

On Wed, Oct 31, 2012 at 3:12 AM, Namhyung Kim  wrote:
> On Tue, 30 Oct 2012 08:05:45 -0400, Dhaval Giani wrote:
>> On Tue, Oct 30, 2012 at 3:42 AM, Namhyung Kim  wrote:
>>> Hi Dhaval,
>>>
>>> On Mon, 29 Oct 2012 12:45:53 -0400, Dhaval Giani wrote:
>>>> On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani  
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> As part of a class assignment I have to collect some performance
>>>>> statistics. In order to do so I run
>>>>>
>>>>> perf record -g 
>>>>>
>>>>> And in another window, I start 200 threads of the load generator
>>>>> (which is not recorded by perf)
>>>>>
>>>>> This generates me statistics that I expect to see, and I am happy. As
>>>>> this is academia and a class assignment, I need to collect information
>>>>> and analyze it across different setups. Which of course meant I script
>>>>> this whole thing, which basically is
>>>>>
>>>>> for i in all possibilities
>>>>> do
>>>>> perf record -g  &
>>>>> WAITPID=$!
>>>>> for j in NR_THREADS
>>>>> do
>>>>>  &
>>>>> KILLPID=$!
>>>>> done
>>>>> wait $PID
>>>
>>> You meant $WAITPID, right?
>>>
>>
>> yes. grrr. I changed the name here to WAITPID for it to be clear and
>> that was a fail. (I blame the cold)
>>
>>>
>>>>> kill $KILLPID
>>>
>>> Doesn't it kill the last load generator only?
>>>
>>>
>>
>> Well, this was a bug in me typing the pseudo code. the actual script
>> does "$KILLPID $!"
>
> Okay, so I suspect that it might be affected by the autogroup scheduling
> feature since you said running load generators in another window - I
> guess it's a terminal.  How about running them with setsid?
>

Why would that affect the data collection for the program being
profiled? The time spent (since it is a compute intensive program) in
various functions shouldn't change, correct? (Unless I am missing
something).

/me goes and tries it out

Hmm. OK, so that doesn't help. Still the same.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-31 Thread Dhaval Giani

On Wed, Oct 31, 2012 at 3:12 AM, Namhyung Kim namhy...@kernel.org wrote:
 On Tue, 30 Oct 2012 08:05:45 -0400, Dhaval Giani wrote:
 On Tue, Oct 30, 2012 at 3:42 AM, Namhyung Kim namhy...@kernel.org wrote:
 Hi Dhaval,

 On Mon, 29 Oct 2012 12:45:53 -0400, Dhaval Giani wrote:
 On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani dhaval.gi...@gmail.com 
 wrote:
 Hi,

 As part of a class assignment I have to collect some performance
 statistics. In order to do so I run

 perf record -g the program I have to profile

 And in another window, I start 200 threads of the load generator
 (which is not recorded by perf)

 This generates me statistics that I expect to see, and I am happy. As
 this is academia and a class assignment, I need to collect information
 and analyze it across different setups. Which of course meant I script
 this whole thing, which basically is

 for i in all possibilities
 do
 perf record -g the program I have to profile 
 WAITPID=$!
 for j in NR_THREADS
 do
 start load generator 
 KILLPID=$!
 done
 wait $PID

 You meant $WAITPID, right?


 yes. grrr. I changed the name here to WAITPID for it to be clear and
 that was a fail. (I blame the cold)


 kill $KILLPID

 Doesn't it kill the last load generator only?



 Well, this was a bug in me typing the pseudo code. the actual script
 does $KILLPID $!

 Okay, so I suspect that it might be affected by the autogroup scheduling
 feature since you said running load generators in another window - I
 guess it's a terminal.  How about running them with setsid?


Why would that affect the data collection for the program being
profiled? The time spent (since it is a compute intensive program) in
various functions shouldn't change, correct? (Unless I am missing
something).

/me goes and tries it out

Hmm. OK, so that doesn't help. Still the same.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-30 Thread Dhaval Giani

On Tue, Oct 30, 2012 at 3:42 AM, Namhyung Kim  wrote:
> Hi Dhaval,
>
> On Mon, 29 Oct 2012 12:45:53 -0400, Dhaval Giani wrote:
>> On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani  
>> wrote:
>>> Hi,
>>>
>>> As part of a class assignment I have to collect some performance
>>> statistics. In order to do so I run
>>>
>>> perf record -g 
>>>
>>> And in another window, I start 200 threads of the load generator
>>> (which is not recorded by perf)
>>>
>>> This generates me statistics that I expect to see, and I am happy. As
>>> this is academia and a class assignment, I need to collect information
>>> and analyze it across different setups. Which of course meant I script
>>> this whole thing, which basically is
>>>
>>> for i in all possibilities
>>> do
>>> perf record -g  &
>>> WAITPID=$!
>>> for j in NR_THREADS
>>> do
>>>  &
>>> KILLPID=$!
>>> done
>>> wait $PID
>
> You meant $WAITPID, right?
>

yes. grrr. I changed the name here to WAITPID for it to be clear and
that was a fail. (I blame the cold)

>
>>> kill $KILLPID
>
> Doesn't it kill the last load generator only?
>
>

Well, this was a bug in me typing the pseudo code. the actual script
does "$KILLPID $!"

Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-30 Thread Dhaval Giani

On Tue, Oct 30, 2012 at 3:42 AM, Namhyung Kim namhy...@kernel.org wrote:
 Hi Dhaval,

 On Mon, 29 Oct 2012 12:45:53 -0400, Dhaval Giani wrote:
 On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani dhaval.gi...@gmail.com 
 wrote:
 Hi,

 As part of a class assignment I have to collect some performance
 statistics. In order to do so I run

 perf record -g the program I have to profile

 And in another window, I start 200 threads of the load generator
 (which is not recorded by perf)

 This generates me statistics that I expect to see, and I am happy. As
 this is academia and a class assignment, I need to collect information
 and analyze it across different setups. Which of course meant I script
 this whole thing, which basically is

 for i in all possibilities
 do
 perf record -g the program I have to profile 
 WAITPID=$!
 for j in NR_THREADS
 do
 start load generator 
 KILLPID=$!
 done
 wait $PID

 You meant $WAITPID, right?


yes. grrr. I changed the name here to WAITPID for it to be clear and
that was a fail. (I blame the cold)


 kill $KILLPID

 Doesn't it kill the last load generator only?



Well, this was a bug in me typing the pseudo code. the actual script
does $KILLPID $!

Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-29 Thread Dhaval Giani

On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani  wrote:
> Hi,
>
> As part of a class assignment I have to collect some performance
> statistics. In order to do so I run
>
> perf record -g 
>
> And in another window, I start 200 threads of the load generator
> (which is not recorded by perf)
>
> This generates me statistics that I expect to see, and I am happy. As
> this is academia and a class assignment, I need to collect information
> and analyze it across different setups. Which of course meant I script
> this whole thing, which basically is
>
> for i in all possibilities
> do
> perf record -g  &
> WAITPID=$!
> for j in NR_THREADS
> do
>  &
> KILLPID=$!
> done
> wait $PID
> kill $KILLPID
> mv perf.data results/perf.data.$i
> done
>
> (This is basic pseudo script of what I am doing), which results me
> having my profile being topped by _vscanf() and the function which I
> was seeing dominating in the older report dropping down to something
> like 5% (as opposed to 16-17%)
>
> Have I misunderstood how perf works? Something deeper? I am currently
> on 3.6.3. I can update to the latest upstream and report back. Any
> debug code is very welcome. I can also make my toy program and the
> scripts available for you to try out.

I just updated to 6b0cb4eef7bdaa27b8021ea81813fba330a2d94d and I still
see this happen.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[BUG] perf report: different reports when run on terminal as opposed to script

2012-10-29 Thread Dhaval Giani

Hi,

As part of a class assignment I have to collect some performance
statistics. In order to do so I run

perf record -g 

And in another window, I start 200 threads of the load generator
(which is not recorded by perf)

This generates me statistics that I expect to see, and I am happy. As
this is academia and a class assignment, I need to collect information
and analyze it across different setups. Which of course meant I script
this whole thing, which basically is

for i in all possibilities
do
perf record -g  &
WAITPID=$!
for j in NR_THREADS
do
 &
KILLPID=$!
done
wait $PID
kill $KILLPID
mv perf.data results/perf.data.$i
done

(This is basic pseudo script of what I am doing), which results me
having my profile being topped by _vscanf() and the function which I
was seeing dominating in the older report dropping down to something
like 5% (as opposed to 16-17%)

Have I misunderstood how perf works? Something deeper? I am currently
on 3.6.3. I can update to the latest upstream and report back. Any
debug code is very welcome. I can also make my toy program and the
scripts available for you to try out.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[BUG] perf report: different reports when run on terminal as opposed to script

2012-10-29 Thread Dhaval Giani

Hi,

As part of a class assignment I have to collect some performance
statistics. In order to do so I run

perf record -g the program I have to profile

And in another window, I start 200 threads of the load generator
(which is not recorded by perf)

This generates me statistics that I expect to see, and I am happy. As
this is academia and a class assignment, I need to collect information
and analyze it across different setups. Which of course meant I script
this whole thing, which basically is

for i in all possibilities
do
perf record -g the program I have to profile 
WAITPID=$!
for j in NR_THREADS
do
start load generator 
KILLPID=$!
done
wait $PID
kill $KILLPID
mv perf.data results/perf.data.$i
done

(This is basic pseudo script of what I am doing), which results me
having my profile being topped by _vscanf() and the function which I
was seeing dominating in the older report dropping down to something
like 5% (as opposed to 16-17%)

Have I misunderstood how perf works? Something deeper? I am currently
on 3.6.3. I can update to the latest upstream and report back. Any
debug code is very welcome. I can also make my toy program and the
scripts available for you to try out.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-29 Thread Dhaval Giani

On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani dhaval.gi...@gmail.com wrote:
 Hi,

 As part of a class assignment I have to collect some performance
 statistics. In order to do so I run

 perf record -g the program I have to profile

 And in another window, I start 200 threads of the load generator
 (which is not recorded by perf)

 This generates me statistics that I expect to see, and I am happy. As
 this is academia and a class assignment, I need to collect information
 and analyze it across different setups. Which of course meant I script
 this whole thing, which basically is

 for i in all possibilities
 do
 perf record -g the program I have to profile 
 WAITPID=$!
 for j in NR_THREADS
 do
 start load generator 
 KILLPID=$!
 done
 wait $PID
 kill $KILLPID
 mv perf.data results/perf.data.$i
 done

 (This is basic pseudo script of what I am doing), which results me
 having my profile being topped by _vscanf() and the function which I
 was seeing dominating in the older report dropping down to something
 like 5% (as opposed to 16-17%)

 Have I misunderstood how perf works? Something deeper? I am currently
 on 3.6.3. I can update to the latest upstream and report back. Any
 debug code is very welcome. I can also make my toy program and the
 scripts available for you to try out.

I just updated to 6b0cb4eef7bdaa27b8021ea81813fba330a2d94d and I still
see this happen.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] RCU documentation: Correct the name of a reference

2012-10-17 Thread Dhaval Giani

[Attaching the patch as gmail breaks the patches]

Trying to go through the history of RCU (not for the weak
minded) led me to search for a non-existent paper.

Correct it to the actual reference

Signed-off-by: Dhaval Giani 
Cc: Paul McKenney 
Cc: Peter Zijlstra 

Index: linux/Documentation/RCU/RTFP.txt
===
--- linux.orig/Documentation/RCU/RTFP.txt
+++ linux/Documentation/RCU/RTFP.txt
@@ -186,7 +186,7 @@ Bibtex Entries

 @article{Kung80
 ,author="H. T. Kung and Q. Lehman"
-,title="Concurrent Maintenance of Binary Search Trees"
+,title="Concurrent Manipulation of Binary Search Trees"
 ,Year="1980"
 ,Month="September"
 ,journal="ACM Transactions on Database Systems"


rcu-doc-fix.patch
Description: Binary data

[PATCH] RCU documentation: Correct the name of a reference

2012-10-17 Thread Dhaval Giani

[Attaching the patch as gmail breaks the patches]

Trying to go through the history of RCU (not for the weak
minded) led me to search for a non-existent paper.

Correct it to the actual reference

Signed-off-by: Dhaval Giani dhaval.gi...@gmail.com
Cc: Paul McKenney paul...@linux.vnet.ibm.com
Cc: Peter Zijlstra a.p.zijls...@chello.nl

Index: linux/Documentation/RCU/RTFP.txt
===
--- linux.orig/Documentation/RCU/RTFP.txt
+++ linux/Documentation/RCU/RTFP.txt
@@ -186,7 +186,7 @@ Bibtex Entries

 @article{Kung80
 ,author=H. T. Kung and Q. Lehman
-,title=Concurrent Maintenance of Binary Search Trees
+,title=Concurrent Manipulation of Binary Search Trees
 ,Year=1980
 ,Month=September
 ,journal=ACM Transactions on Database Systems


rcu-doc-fix.patch
Description: Binary data

Re: [RFC] cgroup TODOs

2012-09-14 Thread Dhaval Giani

>
>   * Sort & unique when listing tasks.  Even the documentation says it
> doesn't happen but we have a good hunk of code doing it in
> cgroup.c.  I'm gonna rip it out at some point.  Again, if you
> don't like it, scream.
>

I think some userspace tools do assume the uniq bit. So if we can
preserve that, great!

Thanks
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] cgroup TODOs

2012-09-14 Thread Dhaval Giani


   * Sort  unique when listing tasks.  Even the documentation says it
 doesn't happen but we have a good hunk of code doing it in
 cgroup.c.  I'm gonna rip it out at some point.  Again, if you
 don't like it, scream.


I think some userspace tools do assume the uniq bit. So if we can
preserve that, great!

Thanks
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/5] forced comounts for cgroups.

2012-09-08 Thread Dhaval Giani

On Thu, Sep 6, 2012 at 5:11 PM, Paul Turner  wrote:
> On Thu, Sep 6, 2012 at 1:46 PM, Tejun Heo  wrote:
>> Hello,
>>
>> cc'ing Dhaval and Frederic.  They were interested in the subject
>> before and Dhaval was pretty vocal about cpuacct having a separate
>> hierarchy (or at least granularity).
>
> Really?  Time just has _not_ borne out this use-case.  I'll let Dhaval
> make a case for this but he should expect violent objection.
>

I am not objecting directly! I am aware of a few users who are (or at
least were) using cpu and cpuacct separately because they want to be
able to account without control. Having said that, there are tons of
flaws in the current approach, because the accounting without control
is just plain wrong. I have copied a few other folks who might be able
to shed light on those users and if we should still consider them.

[And the lesser number of controllers, the better it is!]

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/5] forced comounts for cgroups.

2012-09-08 Thread Dhaval Giani

On Thu, Sep 6, 2012 at 5:11 PM, Paul Turner p...@google.com wrote:
 On Thu, Sep 6, 2012 at 1:46 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 cc'ing Dhaval and Frederic.  They were interested in the subject
 before and Dhaval was pretty vocal about cpuacct having a separate
 hierarchy (or at least granularity).

 Really?  Time just has _not_ borne out this use-case.  I'll let Dhaval
 make a case for this but he should expect violent objection.


I am not objecting directly! I am aware of a few users who are (or at
least were) using cpu and cpuacct separately because they want to be
able to account without control. Having said that, there are tons of
flaws in the current approach, because the accounting without control
is just plain wrong. I have copied a few other folks who might be able
to shed light on those users and if we should still consider them.

[And the lesser number of controllers, the better it is!]

Thanks!
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: revert load_balance_monitor()

2008-02-25 Thread Dhaval Giani

On Mon, Feb 25, 2008 at 03:29:59PM +0100, Mike Galbraith wrote:
> 
> On Mon, 2008-02-25 at 13:22 +0100, Peter Zijlstra wrote:
> > Subject: sched: revert load_balance_monitor()
> > 
> > The following commit causes a number of serious regressions:
> > 
> >   commit 6b2d7700266b9402e12824e11e0099ae6a4a6a79
> >   Author: Srivatsa Vaddagiri <[EMAIL PROTECTED]>
> >   Date:   Fri Jan 25 21:08:00 2008 +0100
> >   sched: group scheduler, fix fairness of cpu bandwidth allocation for task 
> > groups
> > 
> > Namely:
> >  - very frequent wakeups on SMP, reported by PowerTop users.
> >  - cacheline trashing on (large) SMP
> >  - some latencies larger than 500ms
> > 
> > While there is a mergeable patch to fix the latter, the former issues
> > are IMHO not fixable in a manner suitable for .25 (we're at -rc3 now).
> > Hence I propose to revert this patch and try again for .26.
> > 
> > ( minimal revert - leaves most of the code present, just removes the 
> > activation
> >   and sysctl interface ).
> 
> top - 14:05:56 up 3 min, 16 users,  load average: 4.31, 2.14, 0.85
> Tasks: 218 total,   5 running, 213 sleeping,   0 stopped,   0 zombie
> Cpu(s): 35.5%us, 64.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
>  5294 mikeg 20   0  1464  364  304 R   99  0.0   1:00.08 0 chew-max
>  5278 root  20   0  1464  364  304 R   32  0.0   0:27.86 1 chew-max
>  5279 root  20   0  1464  360  304 R   32  0.0   0:35.53 1 chew-max
>  5290 root  20   0  1464  364  304 R   31  0.0   0:29.00 1 chew-max
> 
> The minimal revert seems to leave group fairness in a worse state than
> what the original patch meant to fix.  Maybe a full revert would be
> better?
> 

This is funny. The thread should not start. Did the full revert that I
sent you sometime back work better?

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels

2008-02-25 Thread Dhaval Giani

This patch makes the group scheduler multi hierarchy aware.

Signed-off-by: Dhaval Giani <[EMAIL PROTECTED]>

---
 include/linux/sched.h |2 +-
 kernel/sched.c|   41 -
 2 files changed, 25 insertions(+), 18 deletions(-)

Index: linux-2.6.25-rc2/include/linux/sched.h
===
--- linux-2.6.25-rc2.orig/include/linux/sched.h
+++ linux-2.6.25-rc2/include/linux/sched.h
@@ -2031,7 +2031,7 @@ extern void normalize_rt_tasks(void);
 
 extern struct task_group init_task_group;
 
-extern struct task_group *sched_create_group(void);
+extern struct task_group *sched_create_group(struct task_group *parent);
 extern void sched_destroy_group(struct task_group *tg);
 extern void sched_move_task(struct task_struct *tsk);
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: linux-2.6.25-rc2/kernel/sched.c
===
--- linux-2.6.25-rc2.orig/kernel/sched.c
+++ linux-2.6.25-rc2/kernel/sched.c
@@ -7155,10 +7155,11 @@ static void init_rt_rq(struct rt_rq *rt_
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void init_tg_cfs_entry(struct rq *rq, struct task_group *tg,
-   struct cfs_rq *cfs_rq, struct sched_entity *se,
-   int cpu, int add)
+static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
+   struct sched_entity *se, int cpu, int add,
+   struct sched_entity *parent)
 {
+   struct rq *rq = cpu_rq(cpu);
tg->cfs_rq[cpu] = cfs_rq;
init_cfs_rq(cfs_rq, rq);
cfs_rq->tg = tg;
@@ -7170,7 +7171,11 @@ static void init_tg_cfs_entry(struct rq 
if (!se)
return;
 
-   se->cfs_rq = >cfs;
+   if (parent == NULL)
+   se->cfs_rq = >cfs;
+   else
+   se->cfs_rq = parent->my_q;
+
se->my_q = cfs_rq;
se->load.weight = tg->shares;
se->load.inv_weight = div64_64(1ULL<<32, se->load.weight);
@@ -7244,7 +7249,8 @@ void __init sched_init(void)
 * We achieve this by letting init_task_group's tasks sit
 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 */
-   init_tg_cfs_entry(rq, _task_group, >cfs, NULL, i, 1);
+   init_tg_cfs_entry(_task_group, >cfs,
+   NULL, i, 1, NULL);
init_tg_rt_entry(rq, _task_group, >rt, NULL, i, 1);
 #elif defined CONFIG_USER_SCHED
/*
@@ -7260,7 +7266,7 @@ void __init sched_init(void)
 */
init_tg_cfs_entry(rq, _task_group,
_cpu(init_cfs_rq, i),
-   _cpu(init_sched_entity, i), i, 1);
+   _cpu(init_sched_entity, i), i, 1, NULL);
 
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -7630,7 +7636,8 @@ static void free_fair_sched_group(struct
kfree(tg->se);
 }
 
-static int alloc_fair_sched_group(struct task_group *tg)
+static int alloc_fair_sched_group(struct task_group *tg,
+   struct task_group *parent)
 {
struct cfs_rq *cfs_rq;
struct sched_entity *se;
@@ -7658,8 +7665,11 @@ static int alloc_fair_sched_group(struct
GFP_KERNEL|__GFP_ZERO, cpu_to_node(i));
if (!se)
goto err;
-
-   init_tg_cfs_entry(rq, tg, cfs_rq, se, i, 0);
+   if (!parent) {
+   init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent->se[i]);
+   } else {
+   init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
+   }
}
 
return 1;
@@ -7788,7 +7798,7 @@ static void free_sched_group(struct task
 }
 
 /* allocate runqueue etc for a new task group */
-struct task_group *sched_create_group(void)
+struct task_group *sched_create_group(struct task_group *parent)
 {
struct task_group *tg;
unsigned long flags;
@@ -7798,7 +7808,7 @@ struct task_group *sched_create_group(vo
if (!tg)
return ERR_PTR(-ENOMEM);
 
-   if (!alloc_fair_sched_group(tg))
+   if (!alloc_fair_sched_group(tg, parent))
goto err;
 
if (!alloc_rt_sched_group(tg))
@@ -8049,7 +8059,7 @@ static inline struct task_group *cgroup_
 static struct cgroup_subsys_state *
 cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
-   struct task_group *tg;
+   struct task_group *tg, *parent;
 
if (!cgrp->parent) {
/* This is early initialization for the top cgroup */
@@ -8057,11 +8067,8 @@ cpu_cgroup_create(struct cgroup_subsys *
return _task_group.css;
}
 
-   /* we support only 1-level deep hierarchical scheduler atm */
-   if (

Re: [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels

2008-02-25 Thread Dhaval Giani

Meant 2/2 in $subject.
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler

2008-02-25 Thread Dhaval Giani

This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

Signed-off-by: Dhaval Giani <[EMAIL PROTECTED]>
Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>
---
 kernel/sched.c  |   46 +
 kernel/sched_fair.c |  113 +---
 2 files changed, 137 insertions(+), 22 deletions(-)

Index: linux-2.6.25-rc2/kernel/sched.c
===
--- linux-2.6.25-rc2.orig/kernel/sched.c
+++ linux-2.6.25-rc2/kernel/sched.c
@@ -224,10 +224,13 @@ struct task_group {
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+
+#ifdef CONFIG_USER_SCHED
 /* Default task group's sched entity on each cpu */
 static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
 /* Default task group's cfs_rq on each cpu */
 static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) cacheline_aligned_in_smp;
+#endif
 
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
@@ -7163,6 +7166,10 @@ static void init_tg_cfs_entry(struct rq 
list_add(_rq->leaf_cfs_rq_list, >leaf_cfs_rq_list);
 
tg->se[cpu] = se;
+   /* se could be NULL for init_task_group */
+   if (!se)
+   return;
+
se->cfs_rq = >cfs;
se->my_q = cfs_rq;
se->load.weight = tg->shares;
@@ -7217,11 +7224,46 @@ void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(>leaf_cfs_rq_list);
+#ifdef CONFIG_CGROUP_SCHED
+   /*
+* How much cpu bandwidth does init_task_group get?
+*
+* In case of task-groups formed thr' the cgroup filesystem, it
+* gets 100% of the cpu resources in the system. This overall
+* system cpu resource is divided among the tasks of
+* init_task_group and its child task-groups in a fair manner,
+* based on each entity's (task or task-group's) weight
+* (se->load.weight).
+*
+* In other words, if init_task_group has 10 tasks of weight
+* 1024) and two child groups A0 and A1 (of weight 1024 each),
+* then A0's share of the cpu resource is:
+*
+*  A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
+*
+* We achieve this by letting init_task_group's tasks sit
+* directly in rq->cfs (i.e init_task_group->se[] = NULL).
+*/
+   init_tg_cfs_entry(rq, _task_group, >cfs, NULL, i, 1);
+   init_tg_rt_entry(rq, _task_group, >rt, NULL, i, 1);
+#elif defined CONFIG_USER_SCHED
+   /*
+* In case of task-groups formed thr' the user id of tasks,
+* init_task_group represents tasks belonging to root user.
+* Hence it forms a sibling of all subsequent groups formed.
+* In this case, init_task_group gets only a fraction of overall
+* system cpu resource, based on the weight assigned to root
+* user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
+* by letting tasks of init_task_group sit in a separate cfs_rq
+* (init_cfs_rq) and having one entity represent this group of
+* tasks in rq->cfs (i.e init_task_group->se[] != NULL).
+*/
init_tg_cfs_entry(rq, _task_group,
_cpu(init_cfs_rq, i),
_cpu(init_sched_entity, i), i, 1);
 
 #endif
+#endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
init_task_group.rt_runtime =
sysctl_sched_rt_runtime * NSEC_PER_USEC;
@@ -7435,6 +7477,10 @@ static int rebalance_shares(struct sched
unsigned long total_load = 0, total_shares;
struct task_group *tg = cfs_rq->tg;
 
+   /* Skip this group if there is no associated group entity */
+   if (unlikely(!tg->se[this_cpu]))
+   continue;
+
/* Gather total task load of this group across cpus */
for_each_cpu_mask(i, sdspan)
total_load += tg->cfs_rq[i]->load.weight;
Index: linux-2.6.25-rc2/kernel/sched_fair.c
===
--- linux-2.6.25-rc2.orig/kernel/sched_fair.c
+++ linux-2.6.25-rc2/kernel/sched_fair.c
@@ -732,6 +732,21 @@ static inline struct sched_entity *paren
return se->parent;
 }
 
+/* return the cpu load contributed by a given group on a giv

[RFC, PATCH 0/2] sched: add multiple hierarchy support to the CFS group scheduler

2008-02-25 Thread Dhaval Giani

Hi Ingo,

These patches change the fairness model as discussed in
http://lkml.org/lkml/2008/1/30/634

Patch 1 -> Changes the fairness model
Patch 2 -> Allows one to create multiple levels of cgroups

The second patch is not very good with SMP yet, that is the next TODO.
Also it changes the behaviour of fair user. The root task group is the
parent task group and the other users are its children.

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC, PATCH 0/2] sched: add multiple hierarchy support to the CFS group scheduler

2008-02-25 Thread Dhaval Giani

Hi Ingo,

These patches change the fairness model as discussed in
http://lkml.org/lkml/2008/1/30/634

Patch 1 - Changes the fairness model
Patch 2 - Allows one to create multiple levels of cgroups

The second patch is not very good with SMP yet, that is the next TODO.
Also it changes the behaviour of fair user. The root task group is the
parent task group and the other users are its children.

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler

2008-02-25 Thread Dhaval Giani

This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

Signed-off-by: Dhaval Giani [EMAIL PROTECTED]
Signed-off-by: Srivatsa Vaddagiri [EMAIL PROTECTED]
---
 kernel/sched.c  |   46 +
 kernel/sched_fair.c |  113 +---
 2 files changed, 137 insertions(+), 22 deletions(-)

Index: linux-2.6.25-rc2/kernel/sched.c
===
--- linux-2.6.25-rc2.orig/kernel/sched.c
+++ linux-2.6.25-rc2/kernel/sched.c
@@ -224,10 +224,13 @@ struct task_group {
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+
+#ifdef CONFIG_USER_SCHED
 /* Default task group's sched entity on each cpu */
 static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
 /* Default task group's cfs_rq on each cpu */
 static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) cacheline_aligned_in_smp;
+#endif
 
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
@@ -7163,6 +7166,10 @@ static void init_tg_cfs_entry(struct rq 
list_add(cfs_rq-leaf_cfs_rq_list, rq-leaf_cfs_rq_list);
 
tg-se[cpu] = se;
+   /* se could be NULL for init_task_group */
+   if (!se)
+   return;
+
se-cfs_rq = rq-cfs;
se-my_q = cfs_rq;
se-load.weight = tg-shares;
@@ -7217,11 +7224,46 @@ void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(rq-leaf_cfs_rq_list);
+#ifdef CONFIG_CGROUP_SCHED
+   /*
+* How much cpu bandwidth does init_task_group get?
+*
+* In case of task-groups formed thr' the cgroup filesystem, it
+* gets 100% of the cpu resources in the system. This overall
+* system cpu resource is divided among the tasks of
+* init_task_group and its child task-groups in a fair manner,
+* based on each entity's (task or task-group's) weight
+* (se-load.weight).
+*
+* In other words, if init_task_group has 10 tasks of weight
+* 1024) and two child groups A0 and A1 (of weight 1024 each),
+* then A0's share of the cpu resource is:
+*
+*  A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
+*
+* We achieve this by letting init_task_group's tasks sit
+* directly in rq-cfs (i.e init_task_group-se[] = NULL).
+*/
+   init_tg_cfs_entry(rq, init_task_group, rq-cfs, NULL, i, 1);
+   init_tg_rt_entry(rq, init_task_group, rq-rt, NULL, i, 1);
+#elif defined CONFIG_USER_SCHED
+   /*
+* In case of task-groups formed thr' the user id of tasks,
+* init_task_group represents tasks belonging to root user.
+* Hence it forms a sibling of all subsequent groups formed.
+* In this case, init_task_group gets only a fraction of overall
+* system cpu resource, based on the weight assigned to root
+* user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
+* by letting tasks of init_task_group sit in a separate cfs_rq
+* (init_cfs_rq) and having one entity represent this group of
+* tasks in rq-cfs (i.e init_task_group-se[] != NULL).
+*/
init_tg_cfs_entry(rq, init_task_group,
per_cpu(init_cfs_rq, i),
per_cpu(init_sched_entity, i), i, 1);
 
 #endif
+#endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
init_task_group.rt_runtime =
sysctl_sched_rt_runtime * NSEC_PER_USEC;
@@ -7435,6 +7477,10 @@ static int rebalance_shares(struct sched
unsigned long total_load = 0, total_shares;
struct task_group *tg = cfs_rq-tg;
 
+   /* Skip this group if there is no associated group entity */
+   if (unlikely(!tg-se[this_cpu]))
+   continue;
+
/* Gather total task load of this group across cpus */
for_each_cpu_mask(i, sdspan)
total_load += tg-cfs_rq[i]-load.weight;
Index: linux-2.6.25-rc2/kernel/sched_fair.c
===
--- linux-2.6.25-rc2.orig/kernel/sched_fair.c
+++ linux-2.6.25-rc2/kernel/sched_fair.c
@@ -732,6 +732,21 @@ static inline struct sched_entity *paren
return se-parent;
 }
 
+/* return the cpu load contributed by a given group on a given cpu */
+static inline unsigned long group_cpu_load(struct

[RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels

2008-02-25 Thread Dhaval Giani

This patch makes the group scheduler multi hierarchy aware.

Signed-off-by: Dhaval Giani [EMAIL PROTECTED]

---
 include/linux/sched.h |2 +-
 kernel/sched.c|   41 -
 2 files changed, 25 insertions(+), 18 deletions(-)

Index: linux-2.6.25-rc2/include/linux/sched.h
===
--- linux-2.6.25-rc2.orig/include/linux/sched.h
+++ linux-2.6.25-rc2/include/linux/sched.h
@@ -2031,7 +2031,7 @@ extern void normalize_rt_tasks(void);
 
 extern struct task_group init_task_group;
 
-extern struct task_group *sched_create_group(void);
+extern struct task_group *sched_create_group(struct task_group *parent);
 extern void sched_destroy_group(struct task_group *tg);
 extern void sched_move_task(struct task_struct *tsk);
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: linux-2.6.25-rc2/kernel/sched.c
===
--- linux-2.6.25-rc2.orig/kernel/sched.c
+++ linux-2.6.25-rc2/kernel/sched.c
@@ -7155,10 +7155,11 @@ static void init_rt_rq(struct rt_rq *rt_
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void init_tg_cfs_entry(struct rq *rq, struct task_group *tg,
-   struct cfs_rq *cfs_rq, struct sched_entity *se,
-   int cpu, int add)
+static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
+   struct sched_entity *se, int cpu, int add,
+   struct sched_entity *parent)
 {
+   struct rq *rq = cpu_rq(cpu);
tg-cfs_rq[cpu] = cfs_rq;
init_cfs_rq(cfs_rq, rq);
cfs_rq-tg = tg;
@@ -7170,7 +7171,11 @@ static void init_tg_cfs_entry(struct rq 
if (!se)
return;
 
-   se-cfs_rq = rq-cfs;
+   if (parent == NULL)
+   se-cfs_rq = rq-cfs;
+   else
+   se-cfs_rq = parent-my_q;
+
se-my_q = cfs_rq;
se-load.weight = tg-shares;
se-load.inv_weight = div64_64(1ULL32, se-load.weight);
@@ -7244,7 +7249,8 @@ void __init sched_init(void)
 * We achieve this by letting init_task_group's tasks sit
 * directly in rq-cfs (i.e init_task_group-se[] = NULL).
 */
-   init_tg_cfs_entry(rq, init_task_group, rq-cfs, NULL, i, 1);
+   init_tg_cfs_entry(init_task_group, rq-cfs,
+   NULL, i, 1, NULL);
init_tg_rt_entry(rq, init_task_group, rq-rt, NULL, i, 1);
 #elif defined CONFIG_USER_SCHED
/*
@@ -7260,7 +7266,7 @@ void __init sched_init(void)
 */
init_tg_cfs_entry(rq, init_task_group,
per_cpu(init_cfs_rq, i),
-   per_cpu(init_sched_entity, i), i, 1);
+   per_cpu(init_sched_entity, i), i, 1, NULL);
 
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -7630,7 +7636,8 @@ static void free_fair_sched_group(struct
kfree(tg-se);
 }
 
-static int alloc_fair_sched_group(struct task_group *tg)
+static int alloc_fair_sched_group(struct task_group *tg,
+   struct task_group *parent)
 {
struct cfs_rq *cfs_rq;
struct sched_entity *se;
@@ -7658,8 +7665,11 @@ static int alloc_fair_sched_group(struct
GFP_KERNEL|__GFP_ZERO, cpu_to_node(i));
if (!se)
goto err;
-
-   init_tg_cfs_entry(rq, tg, cfs_rq, se, i, 0);
+   if (!parent) {
+   init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent-se[i]);
+   } else {
+   init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
+   }
}
 
return 1;
@@ -7788,7 +7798,7 @@ static void free_sched_group(struct task
 }
 
 /* allocate runqueue etc for a new task group */
-struct task_group *sched_create_group(void)
+struct task_group *sched_create_group(struct task_group *parent)
 {
struct task_group *tg;
unsigned long flags;
@@ -7798,7 +7808,7 @@ struct task_group *sched_create_group(vo
if (!tg)
return ERR_PTR(-ENOMEM);
 
-   if (!alloc_fair_sched_group(tg))
+   if (!alloc_fair_sched_group(tg, parent))
goto err;
 
if (!alloc_rt_sched_group(tg))
@@ -8049,7 +8059,7 @@ static inline struct task_group *cgroup_
 static struct cgroup_subsys_state *
 cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
-   struct task_group *tg;
+   struct task_group *tg, *parent;
 
if (!cgrp-parent) {
/* This is early initialization for the top cgroup */
@@ -8057,11 +8067,8 @@ cpu_cgroup_create(struct cgroup_subsys *
return init_task_group.css;
}
 
-   /* we support only 1-level deep hierarchical scheduler atm */
-   if (cgrp-parent-parent)
-   return ERR_PTR(-EINVAL

Re: [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels

2008-02-25 Thread Dhaval Giani

Meant 2/2 in $subject.
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: revert load_balance_monitor()

2008-02-25 Thread Dhaval Giani

On Mon, Feb 25, 2008 at 03:29:59PM +0100, Mike Galbraith wrote:
 
 On Mon, 2008-02-25 at 13:22 +0100, Peter Zijlstra wrote:
  Subject: sched: revert load_balance_monitor()
  
  The following commit causes a number of serious regressions:
  
commit 6b2d7700266b9402e12824e11e0099ae6a4a6a79
Author: Srivatsa Vaddagiri [EMAIL PROTECTED]
Date:   Fri Jan 25 21:08:00 2008 +0100
sched: group scheduler, fix fairness of cpu bandwidth allocation for task 
  groups
  
  Namely:
   - very frequent wakeups on SMP, reported by PowerTop users.
   - cacheline trashing on (large) SMP
   - some latencies larger than 500ms
  
  While there is a mergeable patch to fix the latter, the former issues
  are IMHO not fixable in a manner suitable for .25 (we're at -rc3 now).
  Hence I propose to revert this patch and try again for .26.
  
  ( minimal revert - leaves most of the code present, just removes the 
  activation
and sysctl interface ).
 
 top - 14:05:56 up 3 min, 16 users,  load average: 4.31, 2.14, 0.85
 Tasks: 218 total,   5 running, 213 sleeping,   0 stopped,   0 zombie
 Cpu(s): 35.5%us, 64.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
  5294 mikeg 20   0  1464  364  304 R   99  0.0   1:00.08 0 chew-max
  5278 root  20   0  1464  364  304 R   32  0.0   0:27.86 1 chew-max
  5279 root  20   0  1464  360  304 R   32  0.0   0:35.53 1 chew-max
  5290 root  20   0  1464  364  304 R   31  0.0   0:29.00 1 chew-max
 
 The minimal revert seems to leave group fairness in a worse state than
 what the original patch meant to fix.  Maybe a full revert would be
 better?
 

This is funny. The thread should not start. Did the full revert that I
sent you sometime back work better?

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ftrace causing panics.

2008-02-20 Thread Dhaval Giani

On Wed, Feb 20, 2008 at 10:02:18AM -0500, Steven Rostedt wrote:
> Dhaval Giani wrote:
>> Hi Ingo,
>>
>> ftrace-cmd in -w option when being run for sometime cause this.
>>
>>
>> llm11.in.ibm.com login: [ 1002.937490] BUG: unable to handle kernel paging 
>> request at 285b0010
>> [ 1002.947087] IP: [] find_next_entry+0x4f/0x84
>>
>
> Dhaval,
>
> First, thanks for testing
>

If it helps solve difficult problems, its the best tool ever! :)

> Are you running the -mm kernel or sched-devel?  This will let me know which 
> version you have.  I'm working on a queue of fixes for Ingo now, to 
> incorporate into sched-devel (and later pass to Andrew for -mm).  I'm not 
> sure if the new fixes will help you, but we need to get in sync, so that we 
> are both looking at the same version of the code.
>

sched-devel as of yesterday. (I don't think anything new has gone in
today).

[sorry, not had enough time to get to the bottom of this the last few
days]

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ftrace causing panics.

2008-02-20 Thread Dhaval Giani

On Wed, Feb 20, 2008 at 10:02:18AM -0500, Steven Rostedt wrote:
 Dhaval Giani wrote:
 Hi Ingo,

 ftrace-cmd in -w option when being run for sometime cause this.


 llm11.in.ibm.com login: [ 1002.937490] BUG: unable to handle kernel paging 
 request at 285b0010
 [ 1002.947087] IP: [c015f7b5] find_next_entry+0x4f/0x84


 Dhaval,

 First, thanks for testing


If it helps solve difficult problems, its the best tool ever! :)

 Are you running the -mm kernel or sched-devel?  This will let me know which 
 version you have.  I'm working on a queue of fixes for Ingo now, to 
 incorporate into sched-devel (and later pass to Andrew for -mm).  I'm not 
 sure if the new fixes will help you, but we need to get in sync, so that we 
 are both looking at the same version of the code.


sched-devel as of yesterday. (I don't think anything new has gone in
today).

[sorry, not had enough time to get to the bottom of this the last few
days]

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ftrace causing panics.

2008-02-19 Thread Dhaval Giani

Hi Ingo,

ftrace-cmd in -w option when being run for sometime cause this.


llm11.in.ibm.com login: [ 1002.937490] BUG: unable to handle kernel paging 
request at 285b0010
[ 1002.947087] IP: [] find_next_entry+0x4f/0x84
[ 1002.955091] *pdpt = 2d589001 *pde =  
[ 1002.963651] Oops:  [#1] SMP 
[ 1002.967082] Modules linked in:
[ 1002.967082] 
[ 1002.967082] Pid: 16350, comm: cat Not tainted (2.6.25-rc2-sched-devel #9)
[ 1002.967082] EIP: 0060:[] EFLAGS: 00010206 CPU: 0
[ 1002.967082] EIP is at find_next_entry+0x4f/0x84
[ 1002.967082] EAX: f6db2c60 EBX: 0001 ECX: 0001 EDX: f6db2c60
[ 1002.967082] ESI: 285b EDI: c0850e00 EBP: eed23f04 ESP: eed23eec
[ 1002.967082]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 1002.967082] Process cat (pid: 16350, ti=eed22000 task=f0164620 
task.ti=eed22000)
[ 1002.967082] Stack:  eed23f0c f61ba550 f61ba550 f61ba550 eed23f54 
eed23f18 c015f806 
[ 1002.967082] f61ba550 f61ba550 eed23f38 c015f8ae 7334 
f6dccfc0 f61ba5e8 
[ 1002.967082]c0582970 f61ba5e8 f61ba550 eed23f70 c01998a8 0f96 
 1000 
[ 1002.967082] Call Trace:
[ 1002.967082]  [] ? find_next_entry_inc+0x1c/0x80
[ 1002.967082]  [] ? s_next+0x44/0x7e
[ 1002.967082]  [] ? seq_read+0x176/0x252
[ 1002.967082]  [] ? vfs_read+0x90/0x108
[ 1002.967082]  [] ? sys_read+0x40/0x65
[ 1002.967082]  [] ? sysenter_past_esp+0x5f/0x99
[ 1002.967082]  ===
[ 1002.967082] Code: e8 2d b2 0e 00 83 f8 07 89 c3 7f 3c 8b 54 87 14 83 3a 00 
74 25 50 8b 4d f0 89 f8 e8 41 ff ff ff 59 85 c0 89 c2 74 13 85 f6 74 0a <8b> 46 
10 2b 42 10 8 
[ 1002.967082] EIP: [] find_next_entry+0x4f/0x84 SS:ESP 0068:eed23eec
[ 1002.967200] Kernel panic - not syncing: Fatal exception

I can send you complete dmesg offlist as it has only trace data. .config
is the same as before (expect CONFIG_SMP is now on)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ftrace and kexec

2008-02-19 Thread Dhaval Giani

On Tue, Feb 19, 2008 at 03:22:39PM +0100, Ingo Molnar wrote:
> 
> * Dhaval Giani <[EMAIL PROTECTED]> wrote:
> 
> > Hi,
> > 
> > I've been running ftrace on the sched-devel tree. I just built a 
> > kernel and tried rebooting using kexec and I get this,
> 
> hm, it's not a good idea to keep using the data structures of the tracer 
> while we kexec. Does the patch below resolve the problem?
> 

It boots, but I get this now


[0.296073] [ cut here ]
[0.300018] WARNING: at kernel/lockdep.c:2689 check_flags+0xf6/0x10b()
[0.300018] Modules linked in:
[0.300018] Pid: 1, comm: swapper Not tainted 2.6.25-rc2-sched-devel #8
[0.300018]  [] warn_on_slowpath+0x46/0x60
[0.300018]  [] ? very_verbose+0x8/0xc
[0.300018]  [] ? check_chain_key+0xe/0x1a3
[0.300018]  [] ? very_verbose+0x8/0xc
[0.300018]  [] ? very_verbose+0x8/0xc
[0.300018]  [] ? check_chain_key+0xe/0x1a3
[0.300018]  [] ? mcount_call+0x5/0x9
[0.300018]  [] ? check_chain_key+0xe/0x1a3
[0.300019]  [] ? __lock_acquire+0x614/0x668
[0.300019]  [] ? check_chain_key+0xe/0x1a3
[0.300019]  [] ? ftrace_record_ip+0x124/0x130
[0.300019]  [] ? mcount_call+0x5/0x9
[0.300019]  [] ? debug_locks_off+0x8/0x40
[0.300019]  [] ? mcount_call+0x5/0x9
[0.300019]  [] check_flags+0xf6/0x10b
[0.300019]  [] lock_acquire+0x34/0x80
[0.300019]  [] _spin_lock_irqsave+0x27/0x37
[0.300019]  [] ? ftrace_record_ip+0xac/0x130
[0.300019]  [] ? do_softirq+0x2f/0x47
[0.300019]  [] ftrace_record_ip+0xac/0x130
[0.300019]  [] ? trace_softirqs_off+0x8/0xaa
[0.300019]  [] ? do_softirq+0x2f/0x47
[0.300019]  [] mcount_call+0x5/0x9
[0.300019]  [] ? do_softirq+0x2f/0x47
[0.300019]  [] ? trace_softirqs_off+0x8/0xaa
[0.300019]  [] __local_bh_disable+0x7d/0x83
[0.300019]  [] __do_softirq+0x1e/0x97
[0.300019]  [] do_softirq+0x2f/0x47
[0.300019]  [] irq_exit+0x3c/0x3e
[0.300019]  [] smp_apic_timer_interrupt+0x32/0x3b
[0.300019]  [] apic_timer_interrupt+0x2d/0x34
[0.300019]  [] ? release_console_sem+0xc0/0xda
[0.300019]  [] ? sys_unshare+0x9c/0x2a8
[0.300019]  [] ? vprintk+0x24b/0x256
[0.300019]  [] ? trace_hardirqs_on+0xb/0xd
[0.300019]  [] ? ftrace_record_ip+0x124/0x130
[0.300019]  [] ? printk+0x8/0x16
[0.300019]  [] printk+0x14/0x16
[0.300019]  [] sock_register+0x61/0x6a
[0.300019]  [] netlink_proto_init+0xf4/0x11a
[0.300019]  [] ? kernel_init+0x0/0x6c
[0.300019]  [] do_initcalls+0x7a/0x192
[0.300019]  [] ? create_proc_entry+0x6c/0x80
[0.300019]  [] ? mcount_call+0x5/0x9
[0.300019]  [] ? register_irq_proc+0xe/0x8b
[0.300019]  [] ? load_elf_binary+0x818/0x9e0
[0.300019]  [] ? kernel_init+0x0/0x6c
[0.300019]  [] do_basic_setup+0x21/0x23
[0.300019]  [] kernel_init+0x31/0x6c
[0.300019]  [] kernel_thread_helper+0x7/0x10
[0.300019]  ===
[0.300019] ---[ end trace ca143223eefdc828 ]---
[0.300019] irq event stamp: 1894
[0.300019] hardirqs last  enabled at (1893): [] 
trace_hardirqs_on+0xb/0xd
[0.300019] hardirqs last disabled at (1894): [] 
trace_hardirqs_off+0xb/0xd
[0.300019] softirqs last  enabled at (1184): [] 
__do_softirq+0x92/0x97
[0.300019] softirqs last disabled at (1175): [] 
do_softirq+0x2f/0x47

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ftrace and kexec

2008-02-19 Thread Dhaval Giani

Hi,

I've been running ftrace on the sched-devel tree. I just built a kernel
and tried rebooting using kexec and I get this,

Please stand by while rebooting the system...
[11756.528997] Starting new kernel
[11741.142898] BUG: unable to handle kernel paging request at 8d2ed42c
[11741.142898] IP: [] ftrace_record_ip+0x2b/0x14f
[11741.142898] *pdpt = 29829001 *pde =  
[11741.142898] Oops: 0002 [#1] SMP 
[11741.142898] Modules linked in:
[11741.142898] 
[11741.142898] Pid: 16765, comm: kexec Not tainted (2.6.25-rc2-sched-devel #5)
[11741.142898] EIP: 0060:[] EFLAGS: 00010002 CPU: 0
[11741.142898] EIP is at ftrace_record_ip+0x2b/0x14f
[11741.142898] EAX: c0620760 EBX: f68a3470 ECX:  EDX: 
[11741.142898] ESI: c0117000 EDI: 238ef000 EBP: eb02be20 ESP: eb02be0c
[11741.142898]  DS: 0068 ES: 0068 FS: 0068 GS: 0068 SS: 0068
[11741.142898] Process kexec (pid: 16765, ti=eb02a000 task=f6ea0520 
task.ti=eb02a000)
[11741.142898] Stack: c0620760 c011541b f68a3470 c0117000 238ef000 eb02be48 
c0105980  
[11741.142898] c000 c011541b   e38ef000 
c0115430 eb02be90 
[11741.142898]c01154f1 c0620760 238ef000 c0116000 00634000 c0634000 
00637000 c0637000 
[11741.142898] Call Trace:
[11741.142898]  [] ? set_gdt+0xb/0x18
[11741.142898]  [] ? handle_vm86_fault+0x2ce/0x75f
[11741.142898]  [] ? mcount_call+0x5/0x9
[11741.142898]  [] ? set_gdt+0xb/0x18
[11741.142898]  [] ? load_segments+0x8/0x20
[11741.142898]  [] ? machine_kexec+0x93/0xaf
[11741.142898]  [] ? relocate_kernel+0x0/0x94
[11741.142898]  [] ? kernel_kexec+0x32/0x37
[11741.142898]  [] ? sys_reboot+0x13f/0x15e
[11741.142898]  [] ? unlock_page+0x2a/0x2d
[11741.142898]  [] ? __do_fault+0x33e/0x37e
[11741.142898]  [] ? poison_obj+0x23/0x40
[11741.142898]  [] ? do_linear_fault+0x42/0x49
[11741.142898]  [] ? handle_mm_fault+0x142/0x29c
[11741.142898]  [] ? do_page_fault+0x20a/0x453
[11741.142898]  [] ? __lock_release+0x23/0x56
[11741.142898]  [] ? do_page_fault+0x20a/0x453
[11741.142898]  [] ? up_read+0x1b/0x2e
[11741.142898]  [] ? do_page_fault+0x20a/0x453
[11741.142898]  [] ? trace_hardirqs_on_thunk+0xc/0x10
[11741.142898]  [] ? sysenter_past_esp+0x5f/0x99
[11741.142898]  ===
[11741.142898] Code: 55 89 e5 57 56 53 51 51 83 3d 80 fd 84 c0 00 89 45 f0 0f 
84 30 01 00 00 c7 45 ec 60 07 62 c0 64 8b 15 10 f1 61 c0 b8 60 07 62 c0  04 
10 64 8b 15 1 
[11741.142898] EIP: [] ftrace_record_ip+0x2b/0x14f SS:ESP 
0068:eb02be0c
[11741.142898] Kernel panic - not syncing: Fatal exception

.config is here.

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc2
# Tue Feb 19 13:59:40 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_RESOURCE_COUNTERS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set

ftrace and kexec

2008-02-19 Thread Dhaval Giani

Hi,

I've been running ftrace on the sched-devel tree. I just built a kernel
and tried rebooting using kexec and I get this,

Please stand by while rebooting the system...
[11756.528997] Starting new kernel
[11741.142898] BUG: unable to handle kernel paging request at 8d2ed42c
[11741.142898] IP: [c015e1a9] ftrace_record_ip+0x2b/0x14f
[11741.142898] *pdpt = 29829001 *pde =  
[11741.142898] Oops: 0002 [#1] SMP 
[11741.142898] Modules linked in:
[11741.142898] 
[11741.142898] Pid: 16765, comm: kexec Not tainted (2.6.25-rc2-sched-devel #5)
[11741.142898] EIP: 0060:[c015e1a9] EFLAGS: 00010002 CPU: 0
[11741.142898] EIP is at ftrace_record_ip+0x2b/0x14f
[11741.142898] EAX: c0620760 EBX: f68a3470 ECX:  EDX: 
[11741.142898] ESI: c0117000 EDI: 238ef000 EBP: eb02be20 ESP: eb02be0c
[11741.142898]  DS: 0068 ES: 0068 FS: 0068 GS: 0068 SS: 0068
[11741.142898] Process kexec (pid: 16765, ti=eb02a000 task=f6ea0520 
task.ti=eb02a000)
[11741.142898] Stack: c0620760 c011541b f68a3470 c0117000 238ef000 eb02be48 
c0105980  
[11741.142898] c000 c011541b   e38ef000 
c0115430 eb02be90 
[11741.142898]c01154f1 c0620760 238ef000 c0116000 00634000 c0634000 
00637000 c0637000 
[11741.142898] Call Trace:
[11741.142898]  [c011541b] ? set_gdt+0xb/0x18
[11741.142898]  [c0117000] ? handle_vm86_fault+0x2ce/0x75f
[11741.142898]  [c0105980] ? mcount_call+0x5/0x9
[11741.142898]  [c011541b] ? set_gdt+0xb/0x18
[11741.142898]  [c0115430] ? load_segments+0x8/0x20
[11741.142898]  [c01154f1] ? machine_kexec+0x93/0xaf
[11741.142898]  [c0116000] ? relocate_kernel+0x0/0x94
[11741.142898]  [c01345bf] ? kernel_kexec+0x32/0x37
[11741.142898]  [c0134793] ? sys_reboot+0x13f/0x15e
[11741.142898]  [c0161341] ? unlock_page+0x2a/0x2d
[11741.142898]  [c016fbe5] ? __do_fault+0x33e/0x37e
[11741.142898]  [c017d776] ? poison_obj+0x23/0x40
[11741.142898]  [c016fc67] ? do_linear_fault+0x42/0x49
[11741.142898]  [c016ff74] ? handle_mm_fault+0x142/0x29c
[11741.142898]  [c04270bc] ? do_page_fault+0x20a/0x453
[11741.142898]  [c0144453] ? __lock_release+0x23/0x56
[11741.142898]  [c04270bc] ? do_page_fault+0x20a/0x453
[11741.142898]  [c013c905] ? up_read+0x1b/0x2e
[11741.142898]  [c04270bc] ? do_page_fault+0x20a/0x453
[11741.142898]  [c024fafc] ? trace_hardirqs_on_thunk+0xc/0x10
[11741.142898]  [c0105a5a] ? sysenter_past_esp+0x5f/0x99
[11741.142898]  ===
[11741.142898] Code: 55 89 e5 57 56 53 51 51 83 3d 80 fd 84 c0 00 89 45 f0 0f 
84 30 01 00 00 c7 45 ec 60 07 62 c0 64 8b 15 10 f1 61 c0 b8 60 07 62 c0 ff 04 
10 64 8b 15 1 
[11741.142898] EIP: [c015e1a9] ftrace_record_ip+0x2b/0x14f SS:ESP 
0068:eb02be0c
[11741.142898] Kernel panic - not syncing: Fatal exception

.config is here.

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc2
# Tue Feb 19 13:59:40 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_RESOURCE_COUNTERS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
#

Re: ftrace and kexec

2008-02-19 Thread Dhaval Giani

On Tue, Feb 19, 2008 at 03:22:39PM +0100, Ingo Molnar wrote:
 
 * Dhaval Giani [EMAIL PROTECTED] wrote:
 
  Hi,
  
  I've been running ftrace on the sched-devel tree. I just built a 
  kernel and tried rebooting using kexec and I get this,
 
 hm, it's not a good idea to keep using the data structures of the tracer 
 while we kexec. Does the patch below resolve the problem?
 

It boots, but I get this now


[0.296073] [ cut here ]
[0.300018] WARNING: at kernel/lockdep.c:2689 check_flags+0xf6/0x10b()
[0.300018] Modules linked in:
[0.300018] Pid: 1, comm: swapper Not tainted 2.6.25-rc2-sched-devel #8
[0.300018]  [c012089c] warn_on_slowpath+0x46/0x60
[0.300018]  [c0138ca7] ? very_verbose+0x8/0xc
[0.300018]  [c0139a7e] ? check_chain_key+0xe/0x1a3
[0.300018]  [c0138ca7] ? very_verbose+0x8/0xc
[0.300018]  [c0138ca7] ? very_verbose+0x8/0xc
[0.300018]  [c0139a7e] ? check_chain_key+0xe/0x1a3
[0.300018]  [c0104800] ? mcount_call+0x5/0x9
[0.300018]  [c0139a7e] ? check_chain_key+0xe/0x1a3
[0.300019]  [c013b2f5] ? __lock_acquire+0x614/0x668
[0.300019]  [c0139a7e] ? check_chain_key+0xe/0x1a3
[0.300019]  [c01541c6] ? ftrace_record_ip+0x124/0x130
[0.300019]  [c0104800] ? mcount_call+0x5/0x9
[0.300019]  [c0242f58] ? debug_locks_off+0x8/0x40
[0.300019]  [c0104800] ? mcount_call+0x5/0x9
[0.300019]  [c013b78c] check_flags+0xf6/0x10b
[0.300019]  [c013b7d5] lock_acquire+0x34/0x80
[0.300019]  [c040e2eb] _spin_lock_irqsave+0x27/0x37
[0.300019]  [c015414e] ? ftrace_record_ip+0xac/0x130
[0.300019]  [c0124dc4] ? do_softirq+0x2f/0x47
[0.300019]  [c015414e] ftrace_record_ip+0xac/0x130
[0.300019]  [c013a581] ? trace_softirqs_off+0x8/0xaa
[0.300019]  [c0124dc4] ? do_softirq+0x2f/0x47
[0.300019]  [c0104800] mcount_call+0x5/0x9
[0.300019]  [c0124dc4] ? do_softirq+0x2f/0x47
[0.300019]  [c013a581] ? trace_softirqs_off+0x8/0xaa
[0.300019]  [c01249b1] __local_bh_disable+0x7d/0x83
[0.300019]  [c0124d1c] __do_softirq+0x1e/0x97
[0.300019]  [c0124dc4] do_softirq+0x2f/0x47
[0.300019]  [c0124e3b] irq_exit+0x3c/0x3e
[0.300019]  [c011087b] smp_apic_timer_interrupt+0x32/0x3b
[0.300019]  [c0105375] apic_timer_interrupt+0x2d/0x34
[0.300019]  [c012153a] ? release_console_sem+0xc0/0xda
[0.300019]  [c012] ? sys_unshare+0x9c/0x2a8
[0.300019]  [c01211ee] ? vprintk+0x24b/0x256
[0.300019]  [c013a455] ? trace_hardirqs_on+0xb/0xd
[0.300019]  [c01541c6] ? ftrace_record_ip+0x124/0x130
[0.300019]  [c0120f95] ? printk+0x8/0x16
[0.300019]  [c0120fa1] printk+0x14/0x16
[0.300019]  [c03a7b85] sock_register+0x61/0x6a
[0.300019]  [c05daefd] netlink_proto_init+0xf4/0x11a
[0.300019]  [c05bca6c] ? kernel_init+0x0/0x6c
[0.300019]  [c05bc8fb] do_initcalls+0x7a/0x192
[0.300019]  [c01a7627] ? create_proc_entry+0x6c/0x80
[0.300019]  [c0104800] ? mcount_call+0x5/0x9
[0.300019]  [c01511b0] ? register_irq_proc+0xe/0x8b
[0.300019]  [c01a] ? load_elf_binary+0x818/0x9e0
[0.300019]  [c05bca6c] ? kernel_init+0x0/0x6c
[0.300019]  [c05bca34] do_basic_setup+0x21/0x23
[0.300019]  [c05bca9d] kernel_init+0x31/0x6c
[0.300019]  [c01054cf] kernel_thread_helper+0x7/0x10
[0.300019]  ===
[0.300019] ---[ end trace ca143223eefdc828 ]---
[0.300019] irq event stamp: 1894
[0.300019] hardirqs last  enabled at (1893): [c013a455] 
trace_hardirqs_on+0xb/0xd
[0.300019] hardirqs last disabled at (1894): [c013a4e6] 
trace_hardirqs_off+0xb/0xd
[0.300019] softirqs last  enabled at (1184): [c0124d90] 
__do_softirq+0x92/0x97
[0.300019] softirqs last disabled at (1175): [c0124dc4] 
do_softirq+0x2f/0x47

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ftrace causing panics.

2008-02-19 Thread Dhaval Giani

Hi Ingo,

ftrace-cmd in -w option when being run for sometime cause this.


llm11.in.ibm.com login: [ 1002.937490] BUG: unable to handle kernel paging 
request at 285b0010
[ 1002.947087] IP: [c015f7b5] find_next_entry+0x4f/0x84
[ 1002.955091] *pdpt = 2d589001 *pde =  
[ 1002.963651] Oops:  [#1] SMP 
[ 1002.967082] Modules linked in:
[ 1002.967082] 
[ 1002.967082] Pid: 16350, comm: cat Not tainted (2.6.25-rc2-sched-devel #9)
[ 1002.967082] EIP: 0060:[c015f7b5] EFLAGS: 00010206 CPU: 0
[ 1002.967082] EIP is at find_next_entry+0x4f/0x84
[ 1002.967082] EAX: f6db2c60 EBX: 0001 ECX: 0001 EDX: f6db2c60
[ 1002.967082] ESI: 285b EDI: c0850e00 EBP: eed23f04 ESP: eed23eec
[ 1002.967082]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 1002.967082] Process cat (pid: 16350, ti=eed22000 task=f0164620 
task.ti=eed22000)
[ 1002.967082] Stack:  eed23f0c f61ba550 f61ba550 f61ba550 eed23f54 
eed23f18 c015f806 
[ 1002.967082] f61ba550 f61ba550 eed23f38 c015f8ae 7334 
f6dccfc0 f61ba5e8 
[ 1002.967082]c0582970 f61ba5e8 f61ba550 eed23f70 c01998a8 0f96 
 1000 
[ 1002.967082] Call Trace:
[ 1002.967082]  [c015f806] ? find_next_entry_inc+0x1c/0x80
[ 1002.967082]  [c015f8ae] ? s_next+0x44/0x7e
[ 1002.967082]  [c01998a8] ? seq_read+0x176/0x252
[ 1002.967082]  [c0182be6] ? vfs_read+0x90/0x108
[ 1002.967082]  [c0182ea2] ? sys_read+0x40/0x65
[ 1002.967082]  [c0105a5a] ? sysenter_past_esp+0x5f/0x99
[ 1002.967082]  ===
[ 1002.967082] Code: e8 2d b2 0e 00 83 f8 07 89 c3 7f 3c 8b 54 87 14 83 3a 00 
74 25 50 8b 4d f0 89 f8 e8 41 ff ff ff 59 85 c0 89 c2 74 13 85 f6 74 0a 8b 46 
10 2b 42 10 8 
[ 1002.967082] EIP: [c015f7b5] find_next_entry+0x4f/0x84 SS:ESP 0068:eed23eec
[ 1002.967200] Kernel panic - not syncing: Fatal exception

I can send you complete dmesg offlist as it has only trace data. .config
is the same as before (expect CONFIG_SMP is now on)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sched-devel latencies

2008-02-18 Thread Dhaval Giani

On Mon, Feb 18, 2008 at 04:19:33PM +0530, Dhaval Giani wrote:
> Hi Ingo,
> 
> I am running the sched-devel tree (at HEAD
> 44e770a8750abc7e876076cda718b413bad9e654) and it is not looking good.
> 
> I am running two "make -j"s for the kernel in two different cgroups and
> interactivity is going for a toss. I can see noticable lags in
> keypresses.
> 
> Will get down to debugging it further a bit later on.
> 

Some more numbers, with exact scenario

1. Mount the cgroup
2. Make 3 groups
3. Start kernbench in each group
4. Start chew.

This is the chew output from the root cgroup 

[EMAIL PROTECTED] dhaval]# ./chew2 
pid 29345 preempted 544115 us after 1560 us
pid 29345 preempted 588109 us after 3935 us
pid 29345 preempted 632122 us after 3941 us
pid 29345 preempted 794259 us after 3954 us
pid 29345 preempted 972163 us after 3963 us
pid 29345 preempted 1024219 us after 3942 us

>From within one of the groups
[EMAIL PROTECTED] dhaval]# ./chew2 
pid 27961 preempted 4028 us after 1708 us
pid 27961 preempted 28090 us after 5466 us
pid 27961 preempted 52021 us after 6505 us
pid 27961 preempted 56100 us after 7183 us
pid 27961 preempted 61850 us after 7505 us
pid 27961 preempted 131892 us after 59 us
pid 27961 preempted 332112 us after 7607 us

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

sched-devel latencies

2008-02-18 Thread Dhaval Giani

Hi Ingo,

I am running the sched-devel tree (at HEAD
44e770a8750abc7e876076cda718b413bad9e654) and it is not looking good.

I am running two "make -j"s for the kernel in two different cgroups and
interactivity is going for a toss. I can see noticable lags in
keypresses.

Will get down to debugging it further a bit later on.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc2: Reported regressions from 2.6.24

2008-02-18 Thread Dhaval Giani

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=9982
> Subject   : 2.6.25-rc1 panics on boot
> Submitter : Dhaval Giani <[EMAIL PROTECTED]>
> Date  : 2008-02-13 18:03
> References: http://lkml.org/lkml/2008/2/13/363
> Handled-By: Chris Snook <[EMAIL PROTECTED]>

Hi Rafael,

A fix was proposed and accepted at
http://bugzilla.kernel.org/attachment.cgi?id=14832=view .

The bug has been marked as resolved. (You might want to modify your
script to handle such cases.)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/2] reworking load_balance_monitor

2008-02-18 Thread Dhaval Giani

On Thu, Feb 14, 2008 at 04:57:24PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> Here the current patches that rework load_balance_monitor.
> 
> The main reason for doing this is to eliminate the wakeups the thing 
> generates,
> esp. on an idle system. The bonus is that it removes a kernel thread.
> 

Hi Peter,

The changes look really good to me. I will give it a run in sometime and
give some more feedback.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/2] reworking load_balance_monitor

2008-02-18 Thread Dhaval Giani

On Thu, Feb 14, 2008 at 04:57:24PM +0100, Peter Zijlstra wrote:
 Hi,
 
 Here the current patches that rework load_balance_monitor.
 
 The main reason for doing this is to eliminate the wakeups the thing 
 generates,
 esp. on an idle system. The bonus is that it removes a kernel thread.
 

Hi Peter,

The changes look really good to me. I will give it a run in sometime and
give some more feedback.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc2: Reported regressions from 2.6.24

2008-02-18 Thread Dhaval Giani

 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=9982
 Subject   : 2.6.25-rc1 panics on boot
 Submitter : Dhaval Giani [EMAIL PROTECTED]
 Date  : 2008-02-13 18:03
 References: http://lkml.org/lkml/2008/2/13/363
 Handled-By: Chris Snook [EMAIL PROTECTED]

Hi Rafael,

A fix was proposed and accepted at
http://bugzilla.kernel.org/attachment.cgi?id=14832action=view .

The bug has been marked as resolved. (You might want to modify your
script to handle such cases.)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

sched-devel latencies

2008-02-18 Thread Dhaval Giani

Hi Ingo,

I am running the sched-devel tree (at HEAD
44e770a8750abc7e876076cda718b413bad9e654) and it is not looking good.

I am running two make -js for the kernel in two different cgroups and
interactivity is going for a toss. I can see noticable lags in
keypresses.

Will get down to debugging it further a bit later on.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sched-devel latencies

2008-02-18 Thread Dhaval Giani

On Mon, Feb 18, 2008 at 04:19:33PM +0530, Dhaval Giani wrote:
 Hi Ingo,
 
 I am running the sched-devel tree (at HEAD
 44e770a8750abc7e876076cda718b413bad9e654) and it is not looking good.
 
 I am running two make -js for the kernel in two different cgroups and
 interactivity is going for a toss. I can see noticable lags in
 keypresses.
 
 Will get down to debugging it further a bit later on.
 

Some more numbers, with exact scenario

1. Mount the cgroup
2. Make 3 groups
3. Start kernbench in each group
4. Start chew.

This is the chew output from the root cgroup 

[EMAIL PROTECTED] dhaval]# ./chew2 
pid 29345 preempted 544115 us after 1560 us
pid 29345 preempted 588109 us after 3935 us
pid 29345 preempted 632122 us after 3941 us
pid 29345 preempted 794259 us after 3954 us
pid 29345 preempted 972163 us after 3963 us
pid 29345 preempted 1024219 us after 3942 us

From within one of the groups
[EMAIL PROTECTED] dhaval]# ./chew2 
pid 27961 preempted 4028 us after 1708 us
pid 27961 preempted 28090 us after 5466 us
pid 27961 preempted 52021 us after 6505 us
pid 27961 preempted 56100 us after 7183 us
pid 27961 preempted 61850 us after 7505 us
pid 27961 preempted 131892 us after 59 us
pid 27961 preempted 332112 us after 7607 us

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc1 panics on boot

2008-02-13 Thread Dhaval Giani

On Thu, Feb 14, 2008 at 12:06:31PM +0530, Dhaval Giani wrote:
> On Wed, Feb 13, 2008 at 10:32:02PM -0800, Yinghai Lu wrote:
> > On Wed, Feb 13, 2008 at 10:20 PM, Dhaval Giani
> > <[EMAIL PROTECTED]> wrote:
> > > On Wed, Feb 13, 2008 at 01:08:42PM -0500, Chris Snook wrote:
> > >  > Dhaval Giani wrote:
> > >  >> I am getting the following oops on bootup on 2.6.25-rc1
> > >  > ...
> > >  >> I am booting using kexec with maxcpus=1. It does not have any problems
> > >  >> with maxcpus=2 or higher.
> > >  >
> > >  > Sounds like another (the same?) kexec cpu numbering bug.  Can you 
> > > post/link
> > >  > the entire dmesg from both a cold boot and a kexec boot so we can 
> > > compare?
> > >  >
> > >
> > >  Don't think its a kexec bug. Get the same on cold boot. dmesg from kexec 
> > > boot.
> > 
> > how about without "[EMAIL PROTECTED] nmi_watchdog=2"
> > 
> > also does intel cpu support nmi_watchdog=2?
> > 
> 
> Yes it does. I've used it to get some useful debug information. I will try
> that out.
> 

Panics at same point.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc1 panics on boot

2008-02-13 Thread Dhaval Giani

On Wed, Feb 13, 2008 at 10:32:02PM -0800, Yinghai Lu wrote:
> On Wed, Feb 13, 2008 at 10:20 PM, Dhaval Giani
> <[EMAIL PROTECTED]> wrote:
> > On Wed, Feb 13, 2008 at 01:08:42PM -0500, Chris Snook wrote:
> >  > Dhaval Giani wrote:
> >  >> I am getting the following oops on bootup on 2.6.25-rc1
> >  > ...
> >  >> I am booting using kexec with maxcpus=1. It does not have any problems
> >  >> with maxcpus=2 or higher.
> >  >
> >  > Sounds like another (the same?) kexec cpu numbering bug.  Can you 
> > post/link
> >  > the entire dmesg from both a cold boot and a kexec boot so we can 
> > compare?
> >  >
> >
> >  Don't think its a kexec bug. Get the same on cold boot. dmesg from kexec 
> > boot.
> 
> how about without "[EMAIL PROTECTED] nmi_watchdog=2"
> 
> also does intel cpu support nmi_watchdog=2?
> 

Yes it does. I've used it to get some useful debug information. I will try
that out.

> YH

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc1 panics on boot

2008-02-13 Thread Dhaval Giani

On Wed, Feb 13, 2008 at 01:08:42PM -0500, Chris Snook wrote:
> Dhaval Giani wrote:
>> I am getting the following oops on bootup on 2.6.25-rc1
> ...
>> I am booting using kexec with maxcpus=1. It does not have any problems
>> with maxcpus=2 or higher.
>
> Sounds like another (the same?) kexec cpu numbering bug.  Can you post/link 
> the entire dmesg from both a cold boot and a kexec boot so we can compare?
>

Don't think its a kexec bug. Get the same on cold boot. dmesg from kexec boot.

[0.00] Linux version 2.6.25-rc1 ([EMAIL PROTECTED]) (gcc version 3.4.4 
20050721 (Red Hat 3.4.4-2)) #5 SMP Thu Feb 14 06:46:02 IST 2008
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820: 0100 - 0009dc00 (usable)
[0.00]  BIOS-e820: 0009dc00 - 000a (reserved)
[0.00]  BIOS-e820: 0010 - e97f5f00 (usable)
[0.00]  BIOS-e820: e97f5f00 - e97ff800 (ACPI data)
[0.00]  BIOS-e820: e97ff800 - e980 (reserved)
[0.00]  BIOS-e820: fec0 - 0001 (reserved)
[0.00]  BIOS-e820: 0001 - 00014000 (usable)
[0.00] 4224MB HIGHMEM available.
[0.00] 896MB LOWMEM available.
[0.00] Scan SMP from c000 for 1024 bytes.
[0.00] Scan SMP from c009fc00 for 1024 bytes.
[0.00] Scan SMP from c00f for 65536 bytes.
[0.00] Scan SMP from c009dc00 for 1024 bytes.
[0.00] found SMP MP-table at [c009dd40] 0009dd40
[0.00] Reserving 64MB of memory at 16MB for crashkernel (System RAM: 
5111MB)
[0.00] Zone PFN ranges:
[0.00]   DMA 0 -> 4096
[0.00]   Normal   4096 ->   229376
[0.00]   HighMem229376 ->  1310720
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[1] active PFN ranges
[0.00] 0:0 ->  1310720
[0.00] DMI 2.3 present.
[0.00] Using APIC driver default
[0.00] ACPI: RSDP 000FDD90, 0014 (r0 IBM   )
[0.00] ACPI: RSDT E97FF780, 0030 (r1 IBMSERONYXP 1000 IBM  
45444F43)
[0.00] ACPI: FACP E97FF700, 0074 (r1 IBMSERONYXP 1000 IBM  
45444F43)
[0.00] ACPI: DSDT E97F5F00, 962E (r1 IBMSERAVATR 1000 MSFT  
10B)
[0.00] ACPI: FACS E97FF5C0, 0040
[0.00] ACPI: APIC E97FF600, 00CA (r1 IBMSERONYXP 1000 IBM  
45444F43)
[0.00] ACPI: ASF! E97FF540, 004B (r16 IBMSERONYXP1 IBM  
45444F43)
[0.00] ACPI: PM-Timer IO Port: 0x488
[0.00] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[0.00] Processor #0 15:2 APIC version 20
[0.00] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x02] enabled)
[0.00] Processor #2 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x04] enabled)
[0.00] Processor #4 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x06] enabled)
[0.00] Processor #6 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x01] enabled)
[0.00] Processor #1 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x05] lapic_id[0x03] enabled)
[0.00] Processor #3 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x06] lapic_id[0x05] enabled)
[0.00] Processor #5 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] enabled)
[0.00] Processor #7 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x04] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x06] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x05] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x07] dfl dfl lint[0x1])
[0.00] ACPI: IOAPIC (id[0x0e] address[0xfec0] gsi_base[0])
[0.00] IOAPIC[0]: apic_id 14, version 17, address 0xfec0, GSI 0-15
[0.00] ACPI: IOAPIC (id[0x0d] address[0xfec01000] gsi_base[16])
[0.00] IOAPIC[1]: apic_id 13, version 17, address 0xfec01000, GSI 16-31
[0.00] ACPI: IOAPIC (id[0x0c] address[0xfec02000] gsi_base[32])
[0.00] IOAPIC[2]: apic_

2.6.25-rc1 panics on boot

2008-02-13 Thread Dhaval Giani

Hi,

I am getting the following oops on bootup on 2.6.25-rc1

[2.376187] BUG: unable to handle kernel NULL pointer dereference at 010c
[2.388180] IP: [] sysfs_remove_link+0x1/0xd
[2.396182] *pdpt = 005fd001 *pde =  
[2.404751] Oops:  [#1] SMP 
[2.408179] Modules linked in:
[2.408179] 
[2.408179] Pid: 1, comm: swapper Not tainted (2.6.25-rc1 #3)
[2.408179] EIP: 0060:[] EFLAGS: 00010206 CPU: 0
[2.408179] EIP is at sysfs_remove_link+0x1/0xd
[2.408179] EAX: 00f0 EBX: f7202cc8 ECX: f789eaf0 EDX: c0533e87
[2.408179] ESI: f793c970 EDI: ffed EBP: f78a1ea0 ESP: f78a1e90
[2.408179]  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
[2.408179] Process swapper (pid: 1, ti=f78a task=f789eaf0 
task.ti=f78a)
[2.408179] Stack: f78a1ea0 c02709e4 c0568920 f793c970 f78a1eb4 c0269fb5 
f793c970 f793cb7c 
[2.408179]c0568920 f78a1ecc c0269b25  f793cb7c  
c05689f0 f78a1ee0 
[2.408179]c02b6460 c05689f0 f793cb7c  f78a1ef4 c02b6528 
f793cb7c f78c332c 
[2.408179] Call Trace:
[2.408179]  [] ? acpi_processor_remove+0x82/0xb4
[2.408179]  [] ? acpi_start_single_object+0x3a/0x41
[2.408179]  [] ? acpi_device_probe+0x3b/0x79
[2.408179]  [] ? really_probe+0x74/0xf2
[2.408179]  [] ? driver_probe_device+0x37/0x40
[2.408179]  [] ? __driver_attach+0x76/0xaf
[2.408179]  [] ? bus_for_each_dev+0x38/0x5d
[2.408179]  [] ? kobject_init_and_add+0x20/0x22
[2.408179]  [] ? driver_attach+0x14/0x16
[2.408179]  [] ? __driver_attach+0x0/0xaf
[2.408179]  [] ? bus_add_driver+0x99/0x149
[2.408179]  [] ? driver_register+0x43/0x69
[2.408179]  [] ? acpi_bus_register_driver+0x3a/0x3c
[2.408179]  [] ? acpi_processor_init+0x70/0xa6
[2.408179]  [] ? kernel_init+0x0/0x88
[2.408179]  [] ? do_initcalls+0x75/0x18d
[2.408179]  [] ? create_proc_entry+0x67/0x7b
[2.408179]  [] ? register_irq_proc+0xa4/0xba
[2.408179]  [] ? pagemap_read+0x13a/0x1c2
[2.408179]  [] ? kernel_init+0x0/0x88
[2.408179]  [] ? do_basic_setup+0x1c/0x1e
[2.408179]  [] ? kernel_init+0x4d/0x88
[2.408179]  [] ? kernel_thread_helper+0x7/0x10
[2.408179]  ===
[2.408179] Code: c0 74 07 89 f0 e8 57 f4 ff ff 85 ff 74 11 90 ff 0f 0f 94 
c0 84 c0 74 07 89 f8 e8 42 f4 ff ff 8b 45 d8 83 c4 1c 5b 5e 5f 5d c3 55 <8b> 40 
1c 89 e5 e8 c 
[2.408179] EIP: [] sysfs_remove_link+0x1/0xd SS:ESP 0068:f78a1e90
[2.408191] ---[ end trace 778e504de7e3b1e3 ]---
[2.412183] Kernel panic - not syncing: Attempted to kill init!

I am booting using kexec with maxcpus=1. It does not have any problems
with maxcpus=2 or higher.

config

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc1
# Wed Feb 13 17:30:43 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
# CONFIG_CPUSETS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_RESOURCE_COUNTERS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y

Re: Regression in latest sched-git

2008-02-13 Thread Dhaval Giani

On Wed, Feb 13, 2008 at 10:04:44PM +0530, Dhaval Giani wrote:
> > > On the same lines, I cant understand how we can be seeing 700ms latency
> > > (below) unless we had: large number of active groups/users and large 
> > > number of 
> > > tasks within each group/user.
> > 
> > All I can say it that its trivial to reproduce these horrid latencies.
> > 
> 
> Hi Peter,
> 
> I've been trying to reproduce the latencies, and the worst I have
> managed only 80ms. At an average I am getting around 60 ms. This is with
> a make -j4 as root, and dhaval running other programs. (with maxcpus=1).
> 

Totally missed here. Any more hints to reproduce?

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Regression in latest sched-git

2008-02-13 Thread Dhaval Giani

On Wed, Feb 13, 2008 at 01:51:18PM +0100, Peter Zijlstra wrote:
> 
> On Wed, 2008-02-13 at 08:30 +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Feb 12, 2008 at 08:40:08PM +0100, Peter Zijlstra wrote:
> > > Yes, latency isolation is the one thing I had to sacrifice in order to
> > > get the normal latencies under control.
> > 
> > Hi Peter,
> > I don't have easy solution in mind either to meet both fairness
> > and latency goals in a acceptable way.
> 
> Ah, do be careful with 'fairness' here. The single RQ is fair wrt cpu
> time, just not quite as 'fair' wrt to latency.
> 
> > But I am puzzled at the max latency numbers you have provided below:
> > 
> > > The problem with the old code is that under light load: a kernel make
> > > -j2 as root, under an otherwise idle X session, generates latencies up
> > > to 120ms on my UP laptop. (uid grouping; two active users: peter, root).
> > 
> > If it was just two active users, then max latency should be:
> > 
> > latency to schedule user entity (~10ms?) +
> > latency to schedule task within that user 
> > 
> > 20-30 ms seems more reaonable max latency to expect in this scenario.
> > 120ms seems abnormal, unless the user had large number of tasks.
> > 
> > On the same lines, I cant understand how we can be seeing 700ms latency
> > (below) unless we had: large number of active groups/users and large number 
> > of 
> > tasks within each group/user.
> 
> All I can say it that its trivial to reproduce these horrid latencies.
> 

Hi Peter,

I've been trying to reproduce the latencies, and the worst I have
managed only 80ms. At an average I am getting around 60 ms. This is with
a make -j4 as root, and dhaval running other programs. (with maxcpus=1).

> As for Ingo's setup, the worst that he does is run distcc with (32?)
> instances on that machine - and I assume he has that user niced waay
> down.
> 
> > > Others have reported latencies up to 300ms, and Ingo found a 700ms
> > > latency on his machine.
> > > 
> > > The source for this problem is I think the vruntime driven wakeup
> > > preemption (but I'm not quite sure). The other things that rely on
> > > global vruntime are sleeper fairness and yield. Now while I can't
> > > possibly care less about yield, the loss of sleeper fairness is somewhat
> > > sad (NB. turning it off with the old group scheduling does improve life
> > > somewhat).
> > > 
> > > So my first attempt at getting a global vruntime was flattening the
> > > whole RQ structure, you can see that patch in sched.git (I really ought
> > > to have posted that, will do so tomorrow).
> > 
> > We will do some exhaustive testing with this approach. My main concern
> > with this is that it may compromise the level of isolation between two
> > groups (imagine one group does a fork-bomb and how it would affect
> > fairness for other groups).
> 
> Again, be careful with the fairness issue. CPU time should still be
> fair, but yes, other groups might experience some latencies.
> 

I know I am missing something, but aren't we trying to reduce latencies
here?

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 304 matches

Mail list logo