Re: [linux-pm] [RFC][PATCH] PM: Document requirements for basic PM support in drivers

2007-02-14 Thread Igor Stoppa
 of the drivers currently loaded and 
  repeat
  +(that would probably involve rebooting the system, so always note what 
  drivers
  +have been loaded before the test),
  +- if the test succeeds, load a half of the drivers you have unloaded most
  +recently and repeat.
  +
  +Once you have found the failing driver (there can be more than just one of
  +them), you have to unload it every time before the STD transition.  In 
  that case
  +please make sure to report the problem with the driver.
 
 It is also possible that a cycle can still fail after you have unloaded
 all modules. In that case, you would want to look in your kernel
 configuration for possibilities that can be modularised (testing again
 with them as modules), and possibly also try boot time options such as
 noapic or noacpi.

The first step, imho, would be to identify all the peripherals requiired
for a barebone configuration to run (i.e. serial console) and verify
that at least those can reliably go through several suspend/resume
cycles. Then the dicotomic approach can be used.
 
  +
  +b) If the test mode of STD works, you can boot the system with 
  init=/bin/bash
  +and attempt to suspend in the reboot, shutdown and platform modes.  
  If
  +that does not work, there probably is a problem with one of the low level
  +drivers and you generally cannot do much about it except for reporting it
  +(fortunately, that does not happen very often these days).  Otherwise, 
  there is
 
 Oh. Perhaps some of the suggestions from above belong here?
 
  +a problem with a modular driver and you can find it by loading a half of 
  the
  +modules you normally use and binary searching in accordance with the 
  algorithm:
  +- if there are n modules loaded and the attempt to suspend and resume 
  fails,
  +unload n/2 of the modules and try again (that would probably involve 
  rebooting
  +the system),
  +- if there are n modules loaded and the attempt to suspend and resume 
  succeeds,
  +load n/2 modules more and try again.
  +
  +Again, if you find the offending module(s), it(they) must be unloaded 
  every time
  +before the STD transition, and please report the problem with it(them).
  +
  +2. To verify that the STR works, it is generally more convenient to use the
  +s2ram tool available from http://suspend.sf.net and documented at
  +http://en.opensuse.org/s2ram .  However, before doing that it is 
  recommended to
  +carry out the procedure described in section 1.
  +
  +Assume you have resolved the problems with the STD and you have found some
  +failing drivers.  These drivers are also likely to fail during the STR or
  +during the resume, so it is better to unload them every time before the STR
  +transition.  Now, you can follow the instructions at
  +http://en.opensuse.org/s2ram to test the system, but if it does not work
  +out of the box, you may need to boot it with init=/bin/bash and test
  +s2ram in the minimal configuration.  In that case, you may be able to 
  search
  +for failing drivers by following the procedure analogous to the one 
  described in
  +1b).  If you find some failing drivers, you will have to unload them every 
  time
  +before the STR transition (ie. before you run s2ram), and please report the
  +problem with them.
  +
  +II. Testing the driver
  +
  +Once you have resolved the suspend/resume-related problems with your test 
  system
  +without the new driver, you are ready to test it:
Why not first get the driver working against the barebone configuration?

  +
  +1. Build the driver as a module, load it and try the STD in the test mode
  +(cf. 1a)).
  +
  +2. Compile the driver directly into the kernel and try the STD in the test 
  mode
  +(cf. 1a)).
  +
  +3. Build the driver as a module, load it and attempt to suspend to disk in 
  the
  +reboot, shutdown and platform modes (cf. 1).
  +
  +4. Compile the driver directly into the kernel and attempt to suspend to 
  disk in
  +the reboot, shutdown and platform modes (cf. 1).
  +
  +5. Build the driver as a module, load it and attempt to run s2ram (cf. 2).
  +
  +6. Compile the driver directly into the kernel and attempt to run s2ram 
  (cf. 2).
  +
  +Each of the above tests should be repeated several times and if any of them
  +fails, the driver cannot be regarded as suspend/resume-safe.
 
 Apart from the minor comments above, looks good to me.
 
 Regards,
 
 Nigel

-- 
Cheers, Igor

Igor Stoppa [EMAIL PROTECTED]
(Nokia M - OSSO /Helsinki Finland)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] [RFC] sleepy linux

2007-12-26 Thread Igor Stoppa
Hi,
On Wed, 2007-12-26 at 00:07 +0100, ext Pavel Machek wrote:
 This is RFC. It does not even work for me... it sleeps but it will not
 wake up, because SATA wakeup code is missing. Code attached for illustration.
 
 I wonder if this is the right approach? What is right interface to the
 drivers?
 
 
   Sleepy Linux
   
 
 Copyright 2007 Pavel Machek [EMAIL PROTECTED]
 GPLv2
 
 Current Linux versions can enter suspend-to-RAM just fine, but only
 can do it on explicit request. But suspend-to-RAM is important, eating
 something like 10% of power needed for idle system. Starting suspend
 manually is not too convinient; it is not an option on multiuser
 machine, and even on single user machine, some things are not easy:
 
 1) Download this big chunk in mozilla, then go to sleep
 
 2) Compile this, then go to sleep

Why can't these cases be based on CPUIdle?

 3) You can sleep now, but wake me up in 8:30 with mp3 player

This is about setting up properly the wakeup sources which means:

- the wakeup source is really capable of generating wakeups for the
target idle state

- the wakeup source is not actually capable of genrating wakeups from
the target idle state, which can be solved in 2 ways:

- if the duration of the activity is known, set up an alarm 
  (assuming alarms are proper wakeup sources) so that the
   system is ready just in time, in a less efficient but more
   responsive power saving state

- if the duration of the activity is unknown choose the more 
  efficient amongst the following solutions:

- go to deep sleep state and periodically wakeup and
  poll, with a period compatible with the timing 
  of the event source

- prevent too deep sleep states till the event happens

 Todays hardware is mostly capable of doing better: with correctly set
 up wakeups, machine can sleep and successfully pretend it is not
 sleeping -- by waking up whenever something interesting happens. Of
 course, it is easier on machines not connected to the network, and on
 notebook computers.

It might be that some hw doesn't provide deep power saving state for
some devices, but if the only missing feature is the wakeup capability,
it could be effectively replaced by some HW timer.


 Requirements:
 
 0) Working suspend-to-RAM, with kernel being able to bring video back.
 
 1) RTC clock that can wake up system
 
 2) Lid that can wake up a system,
or keyboard that can wake up system and does not loose keypress
or special screensaver setup
 
 3) Network card that is either down
or can wake up system on any packet (and not loose too many packets)

These are just few system specific case, but if you start including USB
devices, the situation is going to get quite complicated very soon, if
you explicitly include certain HW devices in your model.


-- 
Cheers, Igor

Igor Stoppa [EMAIL PROTECTED]
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Power Management framework proposal

2007-07-22 Thread Igor Stoppa
Hi,
On Sat, 2007-07-21 at 23:49 -0700, ext
[EMAIL PROTECTED] wrote:
 I'm deliberatly breaking the threading on this so that people who have 
 tuned out the hibernation thread can take a look at this.
 
 below is the proposal that I made at the bottom of one of the posts on the 
 hibernation thread.

I have the impression that you are trying to describe a mix of the clock
and latency frameworks.

Could you elaborate on how your proposal is incompatible with enhancing
the clock framework? 

It looks like you are proposing a brand new shiny thing that frankly I
would be happy to leave alone, unless it is crystal clear that the clock
fw cannot be improved.

The clocfk fw is used for OMAP and other architectures (including SH,
iirc) and so far it has provided very good support for our power
management needs (Nokia 770 and N800).

Currently we are working on DVFS for OMAP2 (see slides presented at the
linux-pm summit for OLS 2007 http://tinyurl.com/28tact ) and even if the
current prototype is not actively involving the clock fw, our final goal
is to make it capable of supporting atomic transactions for changing the
core parameters.

OMAP3 will require suspend to ram implementation where the content of
system memory is retained, while parts or all the SoC are switched off.
The plan is still to have a clock fw based implementation (plus
interaction with the power rails, of course).

I think these are good examples of the non-ACPI systems you are
mentioning.

To make any proposal that has some chance of being accepted, you have to
compare it against the existing solution, explaining:

-what it is bringing in terms of new functionalities
-how it is different
-why the current implementation cannot simply be enhanced

You can refer to the linux-pm archives for examples of failed attempts
over the last year or so, just search for framework in the subject.

-- 
Cheers, Igor

Igor Stoppa [EMAIL PROTECTED]
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Power Management framework proposal

2007-07-22 Thread Igor Stoppa
On Sun, 2007-07-22 at 01:58 -0700, ext [EMAIL PROTECTED] wrote:
 On Sun, 22 Jul 2007, Igor Stoppa wrote:

[snip]

  Could you elaborate on how your proposal is incompatible with enhancing
  the clock framework?
 
 It's not that I think it's incompatible with any existing powersaving 
 tools (in fact I hope it's not)
 
 it's that I think that this (or something similar) could be made to cover 
 all thevarious power options instead of CPU's having one interface, ACPI 
 capable drivers having another, embeded devices presenting a third, etc
 
 this was triggered by the mess of different function calls for different 
 purposes that are used for the suspend functions where you have a bunch of 
 different functions that are each supposed to be called at a specific time 
 from a specific mode during the suspend process. with all these different 
 functions driver writes tend to not bother implementing any of them, and 
 it seems like there is a fairly steady stream of new functions that end up 
 being needed. the initial intent was to just change this into a generic 
 set of calls that every driver writer would implement the minimum set of, 
 and make it trivially extensable to future capabilities of hardware.

Every now and then there is some attempt to find One solution to bind
them all: x86, SoC, ACPI ... you name it.

Unfortunately, while it's true that there are significant similarities,
there are also notable differencies; as far as i know the USB subsystem
is the one that gets closer to what we have in the embedded arena, since
it can have complex cases of parent-child powering and wakeup.

 one other effect of this is that driver writers would see the mode 
 interface from day one rather then just completely ignoring it. right now 
 device driver authors tend to thing  why worry about figuring out how to 
 implement 'prepare to suspend', 'late suspend', 'suspend', 'quiese but 
 don't suspend', etc if they aren't really interested in working on 
 suspend, it's not really clear what each of these should do even after 
 reading the docs on it. however listing the power modes that a device can 
 be in, documenting the cost of switching between them, and implementing 
 the transition is something very straightforward for the device driver 
 author to do (and they don't have to worry about the details of how and 
 when the various modes get used, that's up to the suspend/powersaving 
 software to figure out). as such I expect that the driver support for 
 powersaving modes to improve. in fact, I expect that some driver writers 
 will implement a whole bunch of modes, just to show off the features of 
 the hardware. and even if nothing uses the modes right now at least they 
 are implemented and documented for future use (and it should be trivial to 
 have a test routine that just runs every driver you have hardware for 
 through every mode transition to make sure that they all work, so the less 
 commonly used modes shouldn't bitrot too badly)

What you are saying can be summarised as making the driver model more
expressive.

 while I was describing the issues to my roomates over dinner I realized 
 that the same type of functions are needed for the CPU clocks.
 
 if you have an accepted framework in place there that can do what I 
 described, please consider extending it to cover other types of devices 
 and drivers.

That is not part of the fw: the fw simply expresses parent-child clock
distribution and keeps usecounts so that unused clocks are automatically
gated.

The actual clock tree description is platform/arch/board specific and
doesn't affect the framework. You can just roll your own version for x86
by providing a description of the methods used to switch on/off every
individual clock on your board.

So what you are asking for is that somebody writes an x86 version of the
clock fw.

As for latencies, well, only few clocks really have significant impact.
Most notably the main system oscillator. Everything else has 0 latency
since it ends up in opening/closing a clock gate.

Powering device on/off will certainly introduce more latency, but either
the powering is supported by the hw, to make it quick or it has to go
through most, if not all of he usual initialisation sequence; in that
case it probably makes sense to avoid controlling it from kernelspace,
since it will be slow and won't require dedcisions made with us
precision.

 I want sanity and functionality far more then credit :-)

I want to avoid redesigning the wheel: the current version is not round
yet, but re-starting from a triangle every time is far less appealing.

 thanks for the link. I've read through it, and it looks like there is a 
 lot of the same ideas in your proposal. 

Unless some new hw/technology shows up, I'm afraid the available set of
ideas is very limited :-)

 I think you are passing too much 
 info up the chain to the part makeing the decision (that part doesn't need 
 to know the details of the voltage/freq choices

Re: [linux-pm] Power Management framework proposal

2007-07-23 Thread Igor Stoppa
  broadening of functionality should it retain the same name, or should it
  get renamed to something more generic?
 
  cpufreq could be renamed to anything that makes sense, but i see _no_
  massive broadening of functionality.
 
 what I'm talking about would provide an API to devices that you are 
 ignoring becouse they should be managed from userspace.

again, HAL / OHM / Mobilin

  the cpufreq implementation is very close to what I'm proposing, it would
  need to get broadend to cover other devices (like disk drives, wireless
  cards, etc), is this really the right thing to do or should the more
  generic API go in for external use and then the existing cpufreq be called
  from the set_mode() call?
 
  No, that doesn't make sense, as general approach.
  You want to manage from kernel only those parts of the system where the
  latency is so low that userspace wouldn't be able to keep up.
 
  Your examples (wireless, disk drive) can be easily controlled from
  userspace, with a timeout.
 
 absoutly, and they should be (at least most of the time). this was not 
 intended as a kernelspace only api. it is intended to be available to both 
 kernelspace and userspace.
 
  In both cases there are significant delays (change of rotation speed /
  sync with the access point).
 
 correct, and these delays should be reflected in the transition cost 
 matrix
 
  All this is hand waiving unless it is backed up by numbers.
  Real cases are required in order to establish a list of priorities for
  latency/power consumption.
 
 this isn't attempting to establish a list of priorites, simply to give the 
 software that is trying to establish such a list the info to make it's 
 decisions, and the interface to use to issue the resulting instructions.

What i'm saying is that sw is implemented to fulfill certain needs. I'd
rather see a detailed description of the need and based on that debate
on the actual API / implementation.

-- 
Cheers, Igor

Igor Stoppa [EMAIL PROTECTED]
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Power Management framework proposal

2007-07-24 Thread Igor Stoppa
On Tue, 2007-07-24 at 10:43 +0200, ext Jerome Glisse wrote:

 I believe a central place where user can set/change hw state to save
 power or to increase computational power is definitely a goal to pursue.
 But i truly think that the OHM approach is the best one ie using plugins
 so that one can make a plugin specific for each device. The point is that
 i believe there is no way to do an abstract interface for this and trying to
 do so will endup doing ugly code and any interface would fail to encompass
 all possible tweak that might exist for all devices.
 
 For instance on graphics card you could do the following (maybe more):
 -change GPU clock
 -change memory clock
 -disable part of engine
 -disable unit
 i truly don't think you can make a common interface for all this, more
 over there might be constraint on how you can change things (GPU 
 memory clock might need to follow a given ratio). So you definitely
 need knowledge in the user space program to handle this.

Even simpler case: LCD backlight can come in many flavors, both in terms
of brightness levels and fixed amount of current required to keep it ON.

Trying to abstract such details from the decision-making makes little
sense.
Isolating that into a separate module, instead, brings the best of both
worlds:
-containment of the HW-specific code
-leveraging every possible, no matter how exotic, power saving mode
available.


-- 
Cheers, Igor

Igor Stoppa [EMAIL PROTECTED]
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-03 Thread Igor Stoppa
On 02/08/17 20:08, Jerome Glisse wrote:
> On Wed, Aug 02, 2017 at 06:14:28PM +0300, Igor Stoppa wrote:

[...]

>> +set_page_private(page, 1);
> 
> Above line is pointless you overwrite value right below

yes ...
> 
>> +page->private = pmalloc_signature;
>> +} else {
>> +BUG_ON(!(page_private(page) &&
>> + page->private == pmalloc_signature));
>> +set_page_private(page, 0);
> 
> Same as above

... and yes

>> +page->private = 0;
>> +}
>> +base += PAGE_SIZE;
>> +} while ((PAGE_MASK & (unsigned long)base) <=
>> + (PAGE_MASK & (unsigned long)end));
>> +return 0;
>> +}
>>
>> ...
>>
>> +static const char msg[] = "Not a valid Pmalloc object.";
>> +const char *pmalloc_check_range(const void *ptr, unsigned long n)
>> +{
>> +unsigned long p;
>> +
>> +p = (unsigned long)ptr;
>> +n = p + n - 1;
>> +for (; (PAGE_MASK & p) <= (PAGE_MASK & n); p += PAGE_SIZE) {
>> +struct page *page;
>> +
>> +if (!is_vmalloc_addr((void *)p))
>> +return msg;
>> +page = vmalloc_to_page((void *)p);
>> +if (!(page && page_private(page) &&
>> +  page->private == pmalloc_signature))
>> +return msg;
>> +}
>> +return NULL;
>> +}
>>
>>
>> The problem here comes from the way I am using page->private:
>> the fact that the page is marked as private means only that someone is
>> using it, and the way it is used could create (spoiler: it happens) a
>> collision with pmalloc_signature, which can generate false positives.
> 
> Is page->private use for vmalloc memory ? If so then pick another field.

No, it is not in use by vmalloc, as far as I can tell, by both reading
the code and empirically printing out its value in few cases.

> Thought i doubt it is use i would need to check. What was the exact
> objection made ?

The objection made is what I tried to explain below, that the comment
besides the declaration of the private field says:
"Mapping-private opaque data: ..."

I'll reply to your answer there.

>> A way to ensure that the address really belongs to pmalloc would be to
>> pre-screen it, against either the signature or some magic number and,
>> if such test is passed, then compare the address against those really
>> available in the pmalloc pools.
>>
>> This would be slower, but it would be limited only to those cases where
>> the signature/magic number matches and the answer is likely to be true.
>>
>> 2) However, both the current (incorrect) implementation and the one I am
>> considering, are abusing something that should be used otherwise (see
>> the following snippet):
>>
>> from include/linux/mm_types.h:
>>
>> struct page {
>> ...
>>   union {
>> unsigned long private;   /* Mapping-private opaque data:
>>   * usually used for buffer_heads
>>   * if PagePrivate set; used for
>>   * swp_entry_t if PageSwapCache;
>>   * indicates order in the buddy
>>   * system if PG_buddy is set.
>>   */
>> #if USE_SPLIT_PTE_PTLOCKS
>> #if ALLOC_SPLIT_PTLOCKS
>>  spinlock_t *ptl;
>> #else
>>  spinlock_t ptl;
>> #endif
>> #endif
>>  struct kmem_cache *slab_cache;  /* SL[AU]B: Pointer to slab */
>>  };
>> ...
>> }
>>
>>
>> The "private" field is meant for mapping-private opaque data, which is
>> not how I am using it.
> 
> As you can see this is an union and thus the meaning of that field depends
> on how the page is use. The private comment you see is only meaningfull for
> page that are in the page cache and are coming from a file system ie when
> a process does an mmap of a file. When page is use by sl[au]b the slab_cache
> field is how it is interpreted ... Context in which a page is use do matter.

I am not native English speaker, but the comment seems to imply that, no
matter what, it's Mapping-private.

If the "Mapping-private" was dropped or somehow connected exclusively to
the cases listed in the comment, then I think it would be more clear
that the com

Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-03 Thread Igor Stoppa


On 03/08/17 16:55, Michal Hocko wrote:
> On Thu 03-08-17 15:20:31, Igor Stoppa wrote:
>> On 03/08/17 14:48, Michal Hocko wrote:
>>> On Thu 03-08-17 13:11:45, Igor Stoppa wrote:

[...]

>>>> But, to reply more specifically to your advice, yes, I think I could add
>>>> a flag to vm_struct and then retrieve its value, for the address being
>>>> processed, by passing through find_vm_area().
>>>
>>> ... and you can store vm_struct pointer to the struct page there 
>>
>> "there" as in the new field of the union?
>> btw, what would be a meaningful name, since "private" is already taken?
>>
>> For simplicity, I'll use, for now, "private2"
> 
> why not explicit vm_area?

ok :-)

>>> and you won't need to do the slow find_vm_area. I haven't checked
>> very closely
>>> but this should be possible in principle. I guess other callers might
>>> benefit from this as well.
>>
>> I am confused about this: if "private2" is a pointer, but when I get an
>> address, I do not even know if the address represents a valid pmalloc
>> page, how can i know when it's ok to dereference "private2"?
> 
> because you can make all pages which back vmalloc mappings have vm_area
> pointer set.

Ah, now I see, I had missed that the field would be set for *all* pages
backed by vmalloc.

So, given a pointer, I still have to figure out if it refers to a
vmalloc area or not.

However, that is something I need to do anyway, to get the reference to
the corresponding page struct, in case it is indeed a vmalloc address.

--
thanks, igor



Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-03 Thread Igor Stoppa


On 03/08/17 17:47, Jerome Glisse wrote:
> On Thu, Aug 03, 2017 at 03:55:50PM +0200, Michal Hocko wrote:
>> On Thu 03-08-17 15:20:31, Igor Stoppa wrote:

[...]

>>> I am confused about this: if "private2" is a pointer, but when I get an
>>> address, I do not even know if the address represents a valid pmalloc
>>> page, how can i know when it's ok to dereference "private2"?
>>
>> because you can make all pages which back vmalloc mappings have vm_area
>> pointer set.
> 
> Note that i think this might break some device driver that use vmap()
> i think some of them use private field to store device driver specific
> informations. But there likely is an unuse field in struct page that
> can be use for that.

This increases the unease from my side ... it looks like there is no way
to fully understand if a field is really used or not, without having
deep intimate knowledge of lots of code that is only marginally involved :-/

Similarly, how would I be able to specify what would be the correct way
to decide the member of the union to use for handling the field?

If there were either some sort of non-multiplexed tag/cookie field or a
function, that would specify how to treat the various unions, then it
would be easier to multiplex the remaining data, according to how the
page is used.

--
igor


Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-04 Thread Igor Stoppa


On 03/08/17 18:15, Michal Hocko wrote:

> I would check the one where we have mapping. It is rather unlikely
> vmalloc users would touch this one.

That was also the initial recommendation from Jerome Glisse, but it
seemed unusable, because of the related comment.

I should have asked for clarifications back then :-(

But it's never too late ...


struct page {
  /* First double word block */
  unsigned long flags;  /* Atomic flags, some possibly
 * updated asynchronously */
union {
struct address_space *mapping;  /* If low bit clear, points to
 * inode address_space, or NULL.
 * If page mapped as anonymous
 * memory, low bit is set, and
 * it points to anon_vma object:
 * see PAGE_MAPPING_ANON below.
 */
...
}

mapping seems to be used exclusively in 2 ways, based on the value of
its lower bit.

Therefore I discarded it as valid option ("private", otoh was far more
alluring), but maybe I could wrap it inside a union, together with vm_area?

---
thanks, igor


Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-03 Thread Igor Stoppa
On 03/08/17 14:48, Michal Hocko wrote:
> On Thu 03-08-17 13:11:45, Igor Stoppa wrote:
>> On 02/08/17 20:08, Jerome Glisse wrote:
>>> On Wed, Aug 02, 2017 at 06:14:28PM +0300, Igor Stoppa wrote:

[...]

>>>> from include/linux/mm_types.h:
>>>>
>>>> struct page {
>>>> ...
>>>>   union {
>>>> unsigned long private; /* Mapping-private opaque data:
>>>> * usually used for buffer_heads
>>>> * if PagePrivate set; used for
>>>> * swp_entry_t if PageSwapCache;
>>>> * indicates order in the buddy
>>>> * system if PG_buddy is set.
>>>> */

[...]

>> If the "Mapping-private" was dropped or somehow connected exclusively to
>> the cases listed in the comment, then I think it would be more clear
>> that the comment needs to be intended as related to mapping in certain
>> cases only.
>> But it is otherwise ok to use the "private" field for whatever purpose
>> it might be suitable, as long as it is not already in use.
> 
> I would recommend adding a new field into the enum...

s/enum/union/ ?

If not, I am not sure what is the enum that you are talking about.

[...]

>> But, to reply more specifically to your advice, yes, I think I could add
>> a flag to vm_struct and then retrieve its value, for the address being
>> processed, by passing through find_vm_area().
> 
> ... and you can store vm_struct pointer to the struct page there 

"there" as in the new field of the union?
btw, what would be a meaningful name, since "private" is already taken?

For simplicity, I'll use, for now, "private2"

> and you> won't need to do the slow find_vm_area. I haven't checked
very closely
> but this should be possible in principle. I guess other callers might
> benefit from this as well.

I am confused about this: if "private2" is a pointer, but when I get an
address, I do not even know if the address represents a valid pmalloc
page, how can i know when it's ok to dereference "private2"?

Since it's just another field in a union, it can actually contain a
value that should be interpreted as some other field, right?

--
thanks, igor


Re: [PATCH v10 0/3] mm: security: ro protection for dynamic data

2017-07-11 Thread Igor Stoppa

On 11/07/17 14:12, Tetsuo Handa wrote:
> Igor Stoppa wrote:
>> - I had to rebase Tetsuo Handa's patch because it didn't apply cleanly
>>   anymore, I would appreciate an ACK to that or a revised patch, whatever 
>>   comes easier.
> 
> Since we are getting several proposals of changing LSM hooks and both your 
> proposal
> and Casey's "LSM: Security module blob management" proposal touch same files, 
> I think
> we can break these changes into small pieces so that both you and Casey can 
> make
> future versions smaller.
> 
> If nobody has objections about direction of Igor's proposal and Casey's 
> proposal,
> I think merging only "[PATCH 2/3] LSM: Convert security_hook_heads into 
> explicit
> array of struct list_head" from Igor's proposal and ->security accessor 
> wrappers (e.g.

I would like to understand if there is still interest about:

* "[PATCH 1/3] Protectable memory support"  which was my main interest
* "[PATCH 3/3] Make LSM Writable Hooks a command line option"
  which was the example of how to use [1/3]

>   #define selinux_security(obj) (obj->security)
>   #define smack_security(obj) (obj->security)
>   #define tomoyo_security(obj) (obj->security)
>   #define apparmor_security(obj) (obj->security)

For example, I see that there are various kzalloc calls that might be
useful to turn into pmalloc ones.

In general, I'd think that, after a transient is complete, where modules
are loaded by allocating dynamic data structures, they could be locked
down in read-only mode.

I have the feeling that, now that I have polished up the pmalloc patch,
the proposed use case is fading away.

Can it be adjusted to the new situation or should I look elsewhere for
an example that would justify merging pmalloc?


thanks, igor


[PATCH 1/3] Protectable memory support

2017-07-10 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 arch/Kconfig|   1 +
 include/linux/pmalloc.h | 127 +
 lib/Kconfig |   1 +
 mm/Makefile |   1 +
 mm/pmalloc.c| 372 
 mm/pmalloc.h|  17 +++
 mm/pmalloc_usercopy.h   |  38 +
 mm/usercopy.c   |  23 +--
 8 files changed, 571 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c
 create mode 100644 mm/pmalloc.h
 create mode 100644 mm/pmalloc_usercopy.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 6c00e5b..9d16b51 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -228,6 +228,7 @@ config GENERIC_IDLE_POLL_SETUP
 
 # Select if arch has all set_memory_ro/rw/x/nx() functions in asm/cacheflush.h
 config ARCH_HAS_SET_MEMORY
+   select GENERIC_ALLOCATOR
bool
 
 # Select if arch init_task initializer is different to init/init_task.c
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..a374e5e
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,127 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.sto...@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+#include 
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/*
+ * Library for dynamic allocation of pools of memory that can be,
+ * after initialization, marked as read-only.
+ *
+ * This is intended to complement __read_only_after_init, for those cases
+ * where either it is not possible to know the initialization value before
+ * init is completed, or the amount of data is variable and can be
+ * determined only at runtime.
+ *
+ * ***WARNING***
+ * The user of the API is expected to synchronize:
+ * 1) allocation
+ * 2) writes to the allocated memory
+ * 3) write protection of the pool
+ * 4) freeing of the allocated memory
+ * 5) destruction of the pool
+ *
+ * For a non threaded scenario, this type of locking is not even required.
+ *
+ * Even if the library were to provided support for the locking, point 2)
+ * would still depend on the user to remember taking the lock.
+ *
+ */
+
+
+/**
+ * pmalloc_create_pool - create a new protectable memory pool -
+ * @name: the name of the pool, must be unique
+ * @min_alloc_order: log2 of the minimum allocation size obtainable
+ *   from the pool
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ *
+ * Returns a pointer to the new pool, upon succes, otherwise a NULL.
+ */
+struct gen_pool *pmalloc_create_pool(const char *name,
+int min_alloc_order);
+
+
+/**
+ * pmalloc - allocate protectable memory from a pool
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ *
+ * Allocates memory from an unprotected pool. If the pool doesn't have
+ * enough memory, an attempt is made to add to the pool a new chunk of
+ * memory (multiple of PAGE_SIZE) that can fit the new request.
+ *
+ * Returns the pointer to the memory requested, upon success,
+ * NULL otherwise (either no memory availabel or pool RO).
+ *
+ */
+void *pmalloc(struct gen_pool *pool, size_t size);
+
+
+
+/**
+ * pmalloc_free - release memory previou

[PATCH 3/3] Make LSM Writable Hooks a command line option

2017-07-10 Thread Igor Stoppa
This patch shows how it is possible to take advantage of pmalloc:
instead of using the build-time option __lsm_ro_after_init, to decide if
it is possible to keep the hooks modifiable, now this becomes a
boot-time decision, based on the kernel command line.

This patch relies on:

"Convert security_hook_heads into explicit array of struct list_head"
Author: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

to break free from the static constraint imposed by the previous
hardening model, based on __ro_after_init.

The default value is disabled, unless SE Linux debugging is turned on.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 security/security.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index 44c47b6..c7b4670 100644
--- a/security/security.c
+++ b/security/security.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MAX_LSM_EVM_XATTR  2
@@ -34,10 +35,19 @@
 /* Maximum number of letters for an LSM name string */
 #define SECURITY_NAME_MAX  10
 
-static struct list_head hook_heads[LSM_MAX_HOOK_INDEX]
-   __lsm_ro_after_init;
 static ATOMIC_NOTIFIER_HEAD(lsm_notifier_chain);
 
+static int dynamic_lsm = IS_ENABLED(CONFIG_SECURITY_SELINUX_DISABLE);
+
+static __init int set_dynamic_lsm(char *str)
+{
+   get_option(, _lsm);
+   return 0;
+}
+early_param("dynamic_lsm", set_dynamic_lsm);
+
+static struct list_head *hook_heads;
+static struct gen_pool *sec_pool;
 char *lsm_names;
 /* Boot-time LSM user choice */
 static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] =
@@ -62,6 +72,11 @@ int __init security_init(void)
 {
enum security_hook_index i;
 
+   sec_pool = pmalloc_create_pool("security", PMALLOC_DEFAULT_ALLOC_ORDER);
+   BUG_ON(!sec_pool);
+   hook_heads = pmalloc(sec_pool,
+sizeof(struct list_head) * LSM_MAX_HOOK_INDEX);
+   BUG_ON(!hook_heads);
for (i = 0; i < LSM_MAX_HOOK_INDEX; i++)
INIT_LIST_HEAD(_heads[i]);
pr_info("Security Framework initialized\n");
@@ -77,7 +92,8 @@ int __init security_init(void)
 * Load all the remaining security modules.
 */
do_security_initcalls();
-
+   if (!dynamic_lsm)
+   pmalloc_protect_pool(sec_pool);
return 0;
 }
 
-- 
2.9.3



[PATCH 2/3] LSM: Convert security_hook_heads into explicit array of struct list_head

2017-07-10 Thread Igor Stoppa
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

Commit 3dfc9b02864b19f4 ("LSM: Initialize security_hook_heads upon
registration.") treats "struct security_hook_heads" as an implicit array
of "struct list_head" so that we can eliminate code for static
initialization. Although we haven't encountered compilers which do not
treat sizeof(security_hook_heads) != sizeof(struct list_head) *
(sizeof(security_hook_heads) / sizeof(struct list_head)), Casey does not
like the assumption that a structure of N elements can be assumed to be
the same as an array of N elements.

Now that Kees found that randstruct complains about such casting

  security/security.c: In function 'security_init':
  security/security.c:59:20: note: found mismatched op0 struct pointer
types: 'struct list_head' and 'struct security_hook_heads'

struct list_head *list = (struct list_head *) _hook_heads;

and Christoph thinks that we should fix it rather than make randstruct
whitelist it, this patch fixes it.

It would be possible to revert commit 3dfc9b02864b19f4, but this patch
converts security_hook_heads into an explicit array of struct list_head
by introducing an enum, due to reasons explained below.

Igor proposed a sealable memory allocator, and the LSM hooks
("struct security_hook_heads security_hook_heads" and
"struct security_hook_list ...[]") will benefit from that allocator via
protection using set_memory_ro()/set_memory_rw(), and that allocator
will remove CONFIG_SECURITY_WRITABLE_HOOKS config option. Thus, we will
likely be moving to that direction.

This means that these structures will be allocated at run time using
that allocator, and therefore the address of these structures will be
determined at run time rather than compile time.

But currently, LSM_HOOK_INIT() macro depends on the address of
security_hook_heads being known at compile time. If we use an enum
so that LSM_HOOK_INIT() macro does not need to know absolute address of
security_hook_heads, it will help us to use that allocator for LSM hooks.

As a result of introducing an enum, security_hook_heads becomes a local
variable. In order to pass 80 columns check by scripts/checkpatch.pl ,
rename security_hook_heads to hook_heads.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Rebased-by: Igor Stoppa <igor.sto...@huawei.com>
Cc: Kees Cook <keesc...@chromium.org>
Cc: Paul Moore <p...@paul-moore.com>
Cc: Stephen Smalley <s...@tycho.nsa.gov>
Cc: Casey Schaufler <ca...@schaufler-ca.com>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Igor Stoppa <igor.sto...@huawei.com>
Cc: Christoph Hellwig <h...@infradead.org>
---
 include/linux/lsm_hooks.h | 420 +++---
 security/security.c   |  31 ++--
 2 files changed, 227 insertions(+), 224 deletions(-)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 3cc9d77..32f30fa 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1694,225 +1694,226 @@ union security_list_options {
 #endif /* CONFIG_AUDIT */
 };
 
-struct security_hook_heads {
-   struct list_head binder_set_context_mgr;
-   struct list_head binder_transaction;
-   struct list_head binder_transfer_binder;
-   struct list_head binder_transfer_file;
-   struct list_head ptrace_access_check;
-   struct list_head ptrace_traceme;
-   struct list_head capget;
-   struct list_head capset;
-   struct list_head capable;
-   struct list_head quotactl;
-   struct list_head quota_on;
-   struct list_head syslog;
-   struct list_head settime;
-   struct list_head vm_enough_memory;
-   struct list_head bprm_set_creds;
-   struct list_head bprm_check_security;
-   struct list_head bprm_secureexec;
-   struct list_head bprm_committing_creds;
-   struct list_head bprm_committed_creds;
-   struct list_head sb_alloc_security;
-   struct list_head sb_free_security;
-   struct list_head sb_copy_data;
-   struct list_head sb_remount;
-   struct list_head sb_kern_mount;
-   struct list_head sb_show_options;
-   struct list_head sb_statfs;
-   struct list_head sb_mount;
-   struct list_head sb_umount;
-   struct list_head sb_pivotroot;
-   struct list_head sb_set_mnt_opts;
-   struct list_head sb_clone_mnt_opts;
-   struct list_head sb_parse_opts_str;
-   struct list_head dentry_init_security;
-   struct list_head dentry_create_files_as;
+enum security_hook_index {
+   LSM_binder_set_context_mgr,
+   LSM_binder_transaction,
+   LSM_binder_transfer_binder,
+   LSM_binder_transfer_file,
+   LSM_ptrace_access_check,
+   LSM_ptrace_traceme,
+   LSM_capget,
+   LSM_capset,
+   LSM_capable,
+   LSM_quotactl,
+   LSM_quota_on,
+   LSM_syslog,
+   LSM_settime,
+   

[PATCH v10 0/3] mm: security: ro protection for dynamic data

2017-07-10 Thread Igor Stoppa
Hi,
please consider this patch-set for inclusion.

This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a memory pool is turned into R/O,
all the memory that is part of it, will become R/O.

A R/O pool can be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

However the data might need to be released, as part of module unloading.
To do this, the memory must first be freed, then the pool can be destroyed.

An example is provided, showing how to turn into a boot-time option the
writable state of the security hooks.
Prior to this patch, it was a compile-time option.

This is made possible, thanks to Tetsuo Handa's rework of the hooks
structure (included in the patchset).

Changes since the v9 version:
- drop page flag to mark pmalloc pages and use page->private & bit
  as followup to Jerome Glisse's advice to use existing fields.
- introduce non-API header mm/pmalloc_usercopy.h for usercopy test

Question still open:
- should it be possibile to unprotect a pool for rewrite?

The only cases found for this topic are:
- protecting the LSM header structure between creation and insertion of a
  security module that was not built as part of the kernel
  (but the module can protect the headers after it has loaded)

- unloading SELinux from RedHat, if the system has booted, but no policy
  has been loaded yet - this feature is going away, according to Casey.

Regarding the last point, there was a comment from Christoph Hellwig,
for which I asked for clarifications, but it's still pending:

https://marc.info/?l=linux-mm=149863848120692=2


Notes:

- The patch is larg-ish, but I was not sure what criteria to use for
  splitting it. If it helps the reviewing, please do let me know how I
  should split it and I will comply.
- I had to rebase Tetsuo Handa's patch because it didn't apply cleanly
  anymore, I would appreciate an ACK to that or a revised patch, whatever 
  comes easier.

Igor Stoppa (2):
  Protectable memory support
  Make LSM Writable Hooks a command line option

Tetsuo Handa (1):
  LSM: Convert security_hook_heads into explicit array of struct
list_head

 arch/Kconfig  |   1 +
 include/linux/lsm_hooks.h | 420 +++---
 include/linux/pmalloc.h   | 127 ++
 lib/Kconfig   |   1 +
 mm/Makefile   |   1 +
 mm/pmalloc.c  | 372 
 mm/pmalloc.h  |  17 ++
 mm/pmalloc_usercopy.h |  38 +
 mm/usercopy.c |  23 ++-
 security/security.c   |  49 --
 10 files changed, 815 insertions(+), 234 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c
 create mode 100644 mm/pmalloc.h
 create mode 100644 mm/pmalloc_usercopy.h

-- 
2.9.3



Re: [PATCH 1/3] Protectable memory support

2017-07-10 Thread Igor Stoppa
On 07/07/17 21:48, Jerome Glisse wrote:

> I believe there is enough unuse field that for vmalloc pages that
> you should find one you can use. Just add some documentation in
> mm_types.h so people are aware of alternate use for the field you
> are using.


I ended up using page->private and the corresponding bit.
Because page-private is an opaque field, specifically reserved for the
allocator, I think it should not be necessary to modify mm_types.h

The reworked patch is here:
https://marc.info/?l=linux-mm=149969928920772=2

--
thanks, igor


Re: [PATCH 1/3] Protectable memory support

2017-07-07 Thread Igor Stoppa
On 06/07/17 19:27, Jerome Glisse wrote:
> On Wed, Jul 05, 2017 at 04:46:26PM +0300, Igor Stoppa wrote:

[...]

>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 6b5818d..acc0723 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -81,6 +81,7 @@ enum pageflags {
>>  PG_active,
>>  PG_waiters, /* Page has waiters, check its waitqueue. Must 
>> be bit #7 and in the same byte as "PG_locked" */
>>  PG_slab,
>> +PG_pmalloc,
>>  PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
>>  PG_arch_1,
>>  PG_reserved,
>> @@ -274,6 +275,7 @@ PAGEFLAG(Active, active, PF_HEAD) 
>> __CLEARPAGEFLAG(Active, active, PF_HEAD)
>>  TESTCLEARFLAG(Active, active, PF_HEAD)
>>  __PAGEFLAG(Slab, slab, PF_NO_TAIL)
>>  __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
>> +__PAGEFLAG(Pmalloc, pmalloc, PF_NO_TAIL)
>>  PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems 
>> */
>>  
>>  /* Xen */
> 
> 
> So i don't think we want to waste a page flag on this. The struct 
> page flags field is already full AFAIK (see page-flags-layout.h)

okay, I do not have any specific need to have a page flag, if there is
an equally effective way to identify pages that are served by pmalloc.
I just replicated what seemed to be the typical way.

> Moreover there is easier way to tag such page. So my understanding
> is that pmalloc() is always suppose to be in vmalloc area. 

At least for now, yes.
I need to have some sort of memory-provider backend.
I tried to use a dedicated memory zone and kmalloc [1] but it was
explained to me that it would have been a bad idea.
So I defaulted to vmalloc and so far it didn't rise any objection.

> From
> the look of it all you do is check that there is a valid page behind
> the vmalloc vaddr and you check for the PG_malloc flag of that page.
> 
> Why do you need to check the PG_malloc flag for the page ? Isn't the
> fact that there is a page behind the vmalloc vaddr enough ? If not
> enough wouldn't checking the pte flags of the page enough ? ie if
> the page is read only inside vmalloc than it would be for sure some
> pmalloc area.

I had similar discussion with Kees Cook [2].
The reason why he asked me to differentiate between pmalloc and vmalloc
is that, from security perspective, there is a certain amount of
information associated to the fact that a page was obtained through
pmalloc. Checking only for pmalloc, would discard such information and
relax the constraint enforced from hardened user copy.

> Other way to distinguish between regular vmalloc and pmalloc can be
> to carveout a region of vmalloc for pmalloc purpose. Issue is that
> it might be hard to find right size for such carveout.

Yes, I considered that, but I'd prefer to avoid it, because then I
either have to fix the maximum size of such region or start managing the
creation of pools of pools. I'm not a big fan of such idea.

> Yet another way is to use some of the free struct page fields ie
> when a page is allocated for vmalloc i think most of struct page
> fields are unuse (mapping, index, lru, ...). It would be better
> to use those rather than adding a page flag.

Like introducing an unnamed union? Some sort of vmalloc_page_subtype?
If that is what you are proposing, I agree that it would work in a
similar fashion as what I have now, but without introducing the overhead
of the extra page flag.

@Kees: would this be ok from a hardened usercopy perspective?

> Everything else looks good to me, thought i am unsure on how much
> useful such feature is but i am not familiar too much with security
> side of thing.

The other 2 patches from the patchset give an example of how to turn a
compile time decision (locking down after init or not the lsm hooks)
into a boot time option.

I also want to move the SE Linux policy db to use pmalloc as allocator,
once pmalloc is merged.

But it seemed better to first get pmalloc merged and only after start
the policy db rework.

thanks, igor

[1] https://lkml.org/lkml/2017/5/4/517
[2] https://lkml.org/lkml/2017/5/23/1406


[PATCH 1/3] Protectable memory support

2017-07-05 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 arch/Kconfig   |   1 +
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h| 127 +++
 include/trace/events/mmflags.h |   1 +
 lib/Kconfig|   1 +
 mm/Makefile|   1 +
 mm/pmalloc.c   | 356 +
 mm/usercopy.c  |  24 +--
 8 files changed, 504 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 6c00e5b..9d16b51 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -228,6 +228,7 @@ config GENERIC_IDLE_POLL_SETUP
 
 # Select if arch has all set_memory_ro/rw/x/nx() functions in asm/cacheflush.h
 config ARCH_HAS_SET_MEMORY
+   select GENERIC_ALLOCATOR
bool
 
 # Select if arch init_task initializer is different to init/init_task.c
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..acc0723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -81,6 +81,7 @@ enum pageflags {
PG_active,
PG_waiters, /* Page has waiters, check its waitqueue. Must 
be bit #7 and in the same byte as "PG_locked" */
PG_slab,
+   PG_pmalloc,
PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
@@ -274,6 +275,7 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, 
active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
+__PAGEFLAG(Pmalloc, pmalloc, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)/* Used by some filesystems 
*/
 
 /* Xen */
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..a374e5e
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,127 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.sto...@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+#include 
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/*
+ * Library for dynamic allocation of pools of memory that can be,
+ * after initialization, marked as read-only.
+ *
+ * This is intended to complement __read_only_after_init, for those cases
+ * where either it is not possible to know the initialization value before
+ * init is completed, or the amount of data is variable and can be
+ * determined only at runtime.
+ *
+ * ***WARNING***
+ * The user of the API is expected to synchronize:
+ * 1) allocation
+ * 2) writes to the allocated memory
+ * 3) write protection of the pool
+ * 4) freeing of the allocated memory
+ * 5) destruction of the pool
+ *
+ * For a non threaded scenario, this type of locking is not even required.
+ *
+ * Even if the library were to provided support for the locking, point 2)
+ * would still depend on the user to remember taking the lock.
+ *
+ */
+
+
+/**
+ * pmalloc_create_pool - create a new protectable memory pool -
+ * @name: the name of the pool, must be unique
+ * @min_alloc_order: log2 of the minimum allocation size obtainable
+ *   from the pool
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pma

[PATCH v9 0/3] mm: security: ro protection for dynamic data

2017-07-05 Thread Igor Stoppa
Hi,
please consider this patch-set for inclusion.

This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a memory pool is turned into R/O,
all the memory that is part of it, will become R/O.

A R/O pool can be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

However the data might need to be released, as part of module unloading.
To do this, the memory must first be freed, then the pool can be destroyed.

An example is provided, showing how to turn into a boot-time option the
writable state of the security hooks.
Prior to this patch, it was a compile-time option.

This is made possible, thanks to Tetsuo Handa's rework of the hooks
structure (included in the patchset).

Changes since the v8 version:
- do not abuse devres, but manage the pools in a normal list
- added one sysfs attribute, showing the number of chnks in each pool

Question still open:
- should it be possibile to unprotect a pool for rewrite?

The only cases found for this topic are:
- protecting the LSM header structure between creation and insertion of a
  security module that was not built as part of the kernel
  (but the module can protect the headers after it has loaded)

- unloading SELinux from RedHat, if the system has booted, but no policy
  has been loaded yet - this feature is going away, according to Casey.

Regarding the last point, there was a comment from Christoph Hellwig,
for which I asked for clarifications, but it's still pending:

https://marc.info/?l=linux-mm=149863848120692=2


Notes:

- The patch is larg-ish, but I was not sure what criteria to use for
  splitting it. If it helps the reviewing, please do let me know how I
  should split it and I will comply.
- I had to rebase Tetsuo Handa's patch because it didn't apply cleanly
  anymore, I would appreciate an ACK to that or a revised patch, whatever 
  comes easier.


Igor Stoppa (2):
  Protectable memory support
  Make LSM Writable Hooks a command line option

Tetsuo Handa (1):
  LSM: Convert security_hook_heads into explicit array of struct
list_head

 arch/Kconfig   |   1 +
 include/linux/lsm_hooks.h  | 420 -
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h| 127 +
 include/trace/events/mmflags.h |   1 +
 lib/Kconfig|   1 +
 mm/Makefile|   1 +
 mm/pmalloc.c   | 356 ++
 mm/usercopy.c  |  24 ++-
 security/security.c|  49 +++--
 10 files changed, 748 insertions(+), 234 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

-- 
2.9.3



[PATCH 2/3] LSM: Convert security_hook_heads into explicit array of struct list_head

2017-07-05 Thread Igor Stoppa
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

Commit 3dfc9b02864b19f4 ("LSM: Initialize security_hook_heads upon
registration.") treats "struct security_hook_heads" as an implicit array
of "struct list_head" so that we can eliminate code for static
initialization. Although we haven't encountered compilers which do not
treat sizeof(security_hook_heads) != sizeof(struct list_head) *
(sizeof(security_hook_heads) / sizeof(struct list_head)), Casey does not
like the assumption that a structure of N elements can be assumed to be
the same as an array of N elements.

Now that Kees found that randstruct complains about such casting

  security/security.c: In function 'security_init':
  security/security.c:59:20: note: found mismatched op0 struct pointer
types: 'struct list_head' and 'struct security_hook_heads'

struct list_head *list = (struct list_head *) _hook_heads;

and Christoph thinks that we should fix it rather than make randstruct
whitelist it, this patch fixes it.

It would be possible to revert commit 3dfc9b02864b19f4, but this patch
converts security_hook_heads into an explicit array of struct list_head
by introducing an enum, due to reasons explained below.

Igor proposed a sealable memory allocator, and the LSM hooks
("struct security_hook_heads security_hook_heads" and
"struct security_hook_list ...[]") will benefit from that allocator via
protection using set_memory_ro()/set_memory_rw(), and that allocator
will remove CONFIG_SECURITY_WRITABLE_HOOKS config option. Thus, we will
likely be moving to that direction.

This means that these structures will be allocated at run time using
that allocator, and therefore the address of these structures will be
determined at run time rather than compile time.

But currently, LSM_HOOK_INIT() macro depends on the address of
security_hook_heads being known at compile time. If we use an enum
so that LSM_HOOK_INIT() macro does not need to know absolute address of
security_hook_heads, it will help us to use that allocator for LSM hooks.

As a result of introducing an enum, security_hook_heads becomes a local
variable. In order to pass 80 columns check by scripts/checkpatch.pl ,
rename security_hook_heads to hook_heads.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Rebased-by: Igor Stoppa <igor.sto...@huawei.com>
Cc: Kees Cook <keesc...@chromium.org>
Cc: Paul Moore <p...@paul-moore.com>
Cc: Stephen Smalley <s...@tycho.nsa.gov>
Cc: Casey Schaufler <ca...@schaufler-ca.com>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Igor Stoppa <igor.sto...@huawei.com>
Cc: Christoph Hellwig <h...@infradead.org>
---
 include/linux/lsm_hooks.h | 420 +++---
 security/security.c   |  31 ++--
 2 files changed, 227 insertions(+), 224 deletions(-)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 3cc9d77..32f30fa 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1694,225 +1694,226 @@ union security_list_options {
 #endif /* CONFIG_AUDIT */
 };
 
-struct security_hook_heads {
-   struct list_head binder_set_context_mgr;
-   struct list_head binder_transaction;
-   struct list_head binder_transfer_binder;
-   struct list_head binder_transfer_file;
-   struct list_head ptrace_access_check;
-   struct list_head ptrace_traceme;
-   struct list_head capget;
-   struct list_head capset;
-   struct list_head capable;
-   struct list_head quotactl;
-   struct list_head quota_on;
-   struct list_head syslog;
-   struct list_head settime;
-   struct list_head vm_enough_memory;
-   struct list_head bprm_set_creds;
-   struct list_head bprm_check_security;
-   struct list_head bprm_secureexec;
-   struct list_head bprm_committing_creds;
-   struct list_head bprm_committed_creds;
-   struct list_head sb_alloc_security;
-   struct list_head sb_free_security;
-   struct list_head sb_copy_data;
-   struct list_head sb_remount;
-   struct list_head sb_kern_mount;
-   struct list_head sb_show_options;
-   struct list_head sb_statfs;
-   struct list_head sb_mount;
-   struct list_head sb_umount;
-   struct list_head sb_pivotroot;
-   struct list_head sb_set_mnt_opts;
-   struct list_head sb_clone_mnt_opts;
-   struct list_head sb_parse_opts_str;
-   struct list_head dentry_init_security;
-   struct list_head dentry_create_files_as;
+enum security_hook_index {
+   LSM_binder_set_context_mgr,
+   LSM_binder_transaction,
+   LSM_binder_transfer_binder,
+   LSM_binder_transfer_file,
+   LSM_ptrace_access_check,
+   LSM_ptrace_traceme,
+   LSM_capget,
+   LSM_capset,
+   LSM_capable,
+   LSM_quotactl,
+   LSM_quota_on,
+   LSM_syslog,
+   LSM_settime,
+   

[PATCH 3/3] Make LSM Writable Hooks a command line option

2017-07-05 Thread Igor Stoppa
This patch shows how it is possible to take advantage of pmalloc:
instead of using the build-time option __lsm_ro_after_init, to decide if
it is possible to keep the hooks modifiable, now this becomes a
boot-time decision, based on the kernel command line.

This patch relies on:

"Convert security_hook_heads into explicit array of struct list_head"
Author: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

to break free from the static constraint imposed by the previous
hardening model, based on __ro_after_init.

The default value is disabled, unless SE Linux debugging is turned on.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 security/security.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index 44c47b6..c7b4670 100644
--- a/security/security.c
+++ b/security/security.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MAX_LSM_EVM_XATTR  2
@@ -34,10 +35,19 @@
 /* Maximum number of letters for an LSM name string */
 #define SECURITY_NAME_MAX  10
 
-static struct list_head hook_heads[LSM_MAX_HOOK_INDEX]
-   __lsm_ro_after_init;
 static ATOMIC_NOTIFIER_HEAD(lsm_notifier_chain);
 
+static int dynamic_lsm = IS_ENABLED(CONFIG_SECURITY_SELINUX_DISABLE);
+
+static __init int set_dynamic_lsm(char *str)
+{
+   get_option(, _lsm);
+   return 0;
+}
+early_param("dynamic_lsm", set_dynamic_lsm);
+
+static struct list_head *hook_heads;
+static struct gen_pool *sec_pool;
 char *lsm_names;
 /* Boot-time LSM user choice */
 static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] =
@@ -62,6 +72,11 @@ int __init security_init(void)
 {
enum security_hook_index i;
 
+   sec_pool = pmalloc_create_pool("security", PMALLOC_DEFAULT_ALLOC_ORDER);
+   BUG_ON(!sec_pool);
+   hook_heads = pmalloc(sec_pool,
+sizeof(struct list_head) * LSM_MAX_HOOK_INDEX);
+   BUG_ON(!hook_heads);
for (i = 0; i < LSM_MAX_HOOK_INDEX; i++)
INIT_LIST_HEAD(_heads[i]);
pr_info("Security Framework initialized\n");
@@ -77,7 +92,8 @@ int __init security_init(void)
 * Load all the remaining security modules.
 */
do_security_initcalls();
-
+   if (!dynamic_lsm)
+   pmalloc_protect_pool(sec_pool);
return 0;
 }
 
-- 
2.9.3



[PATCH 1/3] Protectable memory support

2017-06-27 Thread Igor Stoppa
From: Igor Stoppa <igor.sto...@gmail.com>

The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 arch/Kconfig   |   1 +
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h| 111 ++
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 lib/Kconfig|   1 +
 lib/genalloc.c |   4 +-
 mm/Makefile|   1 +
 mm/pmalloc.c   | 341 +
 mm/usercopy.c  |  24 +--
 10 files changed, 477 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 6c00e5b..9d16b51 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -228,6 +228,7 @@ config GENERIC_IDLE_POLL_SETUP
 
 # Select if arch has all set_memory_ro/rw/x/nx() functions in asm/cacheflush.h
 config ARCH_HAS_SET_MEMORY
+   select GENERIC_ALLOCATOR
bool
 
 # Select if arch init_task initializer is different to init/init_task.c
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..acc0723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -81,6 +81,7 @@ enum pageflags {
PG_active,
PG_waiters, /* Page has waiters, check its waitqueue. Must 
be bit #7 and in the same byte as "PG_locked" */
PG_slab,
+   PG_pmalloc,
PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
@@ -274,6 +275,7 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, 
active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
+__PAGEFLAG(Pmalloc, pmalloc, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)/* Used by some filesystems 
*/
 
 /* Xen */
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..0d65f83
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,111 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.sto...@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+#include 
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/**
+ * pmalloc_create_pool - create a new protectable memory pool -
+ * @name: the name of the pool, must be unique
+ * @min_alloc_order: log2 of the minimum allocation size obtainable
+ *   from the pool
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ *
+ * Returns a pointer to the new pool, upon succes, otherwise a NULL.
+ */
+struct gen_pool *pmalloc_create_pool(const char *name,
+int min_alloc_order);
+
+
+/**
+ * pmalloc_get_pool - get a pool handler, from its name
+ * @name: the name of the pool sought after.
+ *
+ * Returns a pointer to the pool, upon succes, otherwise a NULL.
+ */
+struct gen_pool *pmalloc_get_pool(const char *name);
+
+
+
+/**
+ * pmalloc - allocate protectable memory from a pool
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ *
+ * Allocates memory from an unprotected pool. If the pool doesn't have
+ * enoug

[PATCH v8 0/3] mm: LSM: ro protection for dynamic data

2017-06-27 Thread Igor Stoppa
Hi,
please consider this patch-set for inclusion.

This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a pool is made R/O, all the memory
that is part of it, will become R/O.

A R/O pool can be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

An example is provided, showing how to turn into a boot-time option the
writable state of the security hooks.
Prior to this patch, it was a compile-time option.

This is made possible, thanks to Tetsuo Handa's rework of the hooks
structure (included in the patchset).

Changes since the v6 version:
- complete rewrite, using the genalloc lib (suggested by Laura Abbott)
- added sysfs interface for tracking of active pools

Changes since the v7 version:
- replaced the use of devices with kobjects for showing info on sysfs


The only question still open is if there should be a possibility for
unprotecting a memory pool in other cases than destruction.

The only cases found for this topic are:
- protecting the LSM header structure between creation and insertion of a
  security module that was not built as part of the kernel
  (but the module can protect the headers after it has loaded)

- unloading SELinux from RedHat, if the system has booted, but no policy
  has been loaded yet - this feature is going away, according to Casey.


Note:

The patch is larg-ish, but I was not sure what criteria to use for
splitting it.
If it helps the reviewing, please do let me know how I should split it
and I will comply.



Igor Stoppa (2):
  Protectable memory support
  Make LSM Writable Hooks a command line option

Tetsuo Handa (1):
  LSM: Convert security_hook_heads into explicit array of struct
list_head

 arch/Kconfig   |   1 +
 include/linux/lsm_hooks.h  | 420 -
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h| 111 +++
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 lib/Kconfig|   1 +
 lib/genalloc.c |   4 +-
 mm/Makefile|   1 +
 mm/pmalloc.c   | 341 +
 mm/usercopy.c  |  24 ++-
 security/security.c|  49 +++--
 12 files changed, 721 insertions(+), 236 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

-- 
2.9.3



[PATCH 1/1] Remove hardcoding of ___GFP_xxx bitmasks

2017-04-26 Thread Igor Stoppa
The bitmasks used for ___GFP_xxx can be defined in terms of an enum,
which doesn't require manual updates to its values.

As bonus, __GFP_BITS_SHIFT is automatically kept consistent.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 include/linux/gfp.h | 82 +++--
 1 file changed, 55 insertions(+), 27 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0fe0b62..2f894c5 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -14,33 +14,62 @@ struct vm_area_struct;
  * include/trace/events/mmflags.h and tools/perf/builtin-kmem.c
  */
 
+enum gfp_bitmask_shift {
+   __GFP_DMA_SHIFT = 0,
+   __GFP_HIGHMEM_SHIFT,
+   __GFP_DMA32_SHIFT,
+   __GFP_MOVABLE_SHIFT,
+   __GFP_RECLAIMABLE_SHIFT,
+   __GFP_HIGH_SHIFT,
+   __GFP_IO_SHIFT,
+   __GFP_FS_SHIFT,
+   __GFP_COLD_SHIFT,
+   __GFP_NOWARN_SHIFT,
+   __GFP_REPEAT_SHIFT,
+   __GFP_NOFAIL_SHIFT,
+   __GFP_NORETRY_SHIFT,
+   __GFP_MEMALLOC_SHIFT,
+   __GFP_COMP_SHIFT,
+   __GFP_ZERO_SHIFT,
+   __GFP_NOMEMALLOC_SHIFT,
+   __GFP_HARDWALL_SHIFT,
+   __GFP_THISNODE_SHIFT,
+   __GFP_ATOMIC_SHIFT,
+   __GFP_ACCOUNT_SHIFT,
+   __GFP_NOTRACK_SHIFT,
+   __GFP_DIRECT_RECLAIM_SHIFT,
+   __GFP_WRITE_SHIFT,
+   __GFP_KSWAPD_RECLAIM_SHIFT,
+   __GFP_BITS_SHIFT
+};
+
+
 /* Plain integer GFP bitmasks. Do not use this directly. */
-#define ___GFP_DMA 0x01u
-#define ___GFP_HIGHMEM 0x02u
-#define ___GFP_DMA32   0x04u
-#define ___GFP_MOVABLE 0x08u
-#define ___GFP_RECLAIMABLE 0x10u
-#define ___GFP_HIGH0x20u
-#define ___GFP_IO  0x40u
-#define ___GFP_FS  0x80u
-#define ___GFP_COLD0x100u
-#define ___GFP_NOWARN  0x200u
-#define ___GFP_REPEAT  0x400u
-#define ___GFP_NOFAIL  0x800u
-#define ___GFP_NORETRY 0x1000u
-#define ___GFP_MEMALLOC0x2000u
-#define ___GFP_COMP0x4000u
-#define ___GFP_ZERO0x8000u
-#define ___GFP_NOMEMALLOC  0x1u
-#define ___GFP_HARDWALL0x2u
-#define ___GFP_THISNODE0x4u
-#define ___GFP_ATOMIC  0x8u
-#define ___GFP_ACCOUNT 0x10u
-#define ___GFP_NOTRACK 0x20u
-#define ___GFP_DIRECT_RECLAIM  0x40u
-#define ___GFP_WRITE   0x80u
-#define ___GFP_KSWAPD_RECLAIM  0x100u
-/* If the above are modified, __GFP_BITS_SHIFT may need updating */
+#define ___GFP_DMA (1u << __GFP_DMA_SHIFT)
+#define ___GFP_HIGHMEM (1u << __GFP_HIGHMEM_SHIFT)
+#define ___GFP_DMA32   (1u << __GFP_DMA32_SHIFT)
+#define ___GFP_MOVABLE (1u << __GFP_MOVABLE_SHIFT)
+#define ___GFP_RECLAIMABLE (1u << __GFP_RECLAIMABLE_SHIFT)
+#define ___GFP_HIGH(1u << __GFP_HIGH_SHIFT)
+#define ___GFP_IO  (1u << __GFP_IO_SHIFT)
+#define ___GFP_FS  (1u << __GFP_FS_SHIFT)
+#define ___GFP_COLD(1u << __GFP_COLD_SHIFT)
+#define ___GFP_NOWARN  (1u << __GFP_NOWARN_SHIFT)
+#define ___GFP_REPEAT  (1u << __GFP_REPEAT_SHIFT)
+#define ___GFP_NOFAIL  (1u << __GFP_NOFAIL_SHIFT)
+#define ___GFP_NORETRY (1u << __GFP_NORETRY_SHIFT)
+#define ___GFP_MEMALLOC(1u << __GFP_MEMALLOC_SHIFT)
+#define ___GFP_COMP(1u << __GFP_COMP_SHIFT)
+#define ___GFP_ZERO(1u << __GFP_ZERO_SHIFT)
+#define ___GFP_NOMEMALLOC  (1u << __GFP_NOMEMALLOC_SHIFT)
+#define ___GFP_HARDWALL(1u << __GFP_HARDWALL_SHIFT)
+#define ___GFP_THISNODE(1u << __GFP_THISNODE_SHIFT)
+#define ___GFP_ATOMIC  (1u << __GFP_ATOMIC_SHIFT)
+#define ___GFP_ACCOUNT (1u << __GFP_ACCOUNT_SHIFT)
+#define ___GFP_NOTRACK (1u << __GFP_NOTRACK_SHIFT)
+#define ___GFP_DIRECT_RECLAIM  (1u << __GFP_DIRECT_RECLAIM_SHIFT)
+#define ___GFP_WRITE   (1u << __GFP_WRITE_SHIFT)
+#define ___GFP_KSWAPD_RECLAIM  (1u << __GFP_KSWAPD_RECLAIM_SHIFT)
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -180,7 +209,6 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 25
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
-- 
2.9.3



[PATCH 0/1] mm: Improve consistency of ___GFP_xxx masks

2017-04-26 Thread Igor Stoppa
The GFP bitmasks and the __GFP_BITS_SHIFT defines are expressed as
hardcoded constants.
This can be expressed in a more consistent way by relying on an enum of
shift positions.

Igor Stoppa (1):
  Remove hardcoding of ___GFP_xxx bitmasks

 include/linux/gfp.h | 82 +++--
 1 file changed, 55 insertions(+), 27 deletions(-)

-- 
2.9.3



Re: [PATCH 1/1] Remove hardcoding of ___GFP_xxx bitmasks

2017-04-26 Thread Igor Stoppa


On 26/04/17 17:47, Michal Hocko wrote:
> On Wed 26-04-17 16:35:49, Igor Stoppa wrote:
>> The bitmasks used for ___GFP_xxx can be defined in terms of an enum,
>> which doesn't require manual updates to its values.
> 
> GFP masks are rarely updated so why is this worth doing?

I have plans for that [1] - yeah, I should have not written only to ml -
but I thought there was sufficient value in this patch to be sent alone.

I got into this part of the code because (if I understood correctly)
there are no spare bits available from the 32bits mask that is currently
in use.

Adding a new zone, therefore, would cause the bumping to a 64bits type.
If the zone is not strictly needed, some people might prefer to have the
option to stick to 32 bits.

Which in turn would mean more #ifdefs.

Using the enum should provide the same flexibility with a more limited
number of #ifdefs in the code.

But I really thought that there is a value in the change per-se.
Regardless of what other patches might follow.


>> As bonus, __GFP_BITS_SHIFT is automatically kept consistent.
> 
> this alone doesn't sound like a huge win to me, to be honest. We already
> have ___GFP_$FOO and __GFP_FOO you are adding __GFP_$FOO_SHIFT. This is
> too much IMHO.

I do not like the #defines being floating and potentially inconsistent
with the rest, when they are supposed to represent all the individual
bits in a bitmask.
One might argue that an error will be detected fairly soon, but to me
using an enum to automatically manage the values and counter of items
seems preferable.

> Also the current mm tree has ___GFP_NOLOCKDEP which is not addressed
> here so I suspect you have based your change on the Linus tree.

I used your tree from kernel.org - I asked yesterday on #mm if it was a
good idea and was told that it should be ok, so I did it, but I can redo
the patch with mm.


If you prefer to have this patch only as part of the larger patchset,
I'm also fine with it.
Also, if you could reply to [1], that would be greatly appreciated.

Maybe I'm starting from some wrong assumption or there is a better way
to achieve what I want.


thanks, igor

[1] http://marc.info/?l=linux-mm=149276346722464=2


Re: [PATCH 1/1] Remove hardcoding of ___GFP_xxx bitmasks

2017-04-27 Thread Igor Stoppa


On 27/04/17 16:41, Michal Hocko wrote:
> On Wed 26-04-17 18:29:08, Igor Stoppa wrote:
> [...]
>> If you prefer to have this patch only as part of the larger patchset,
>> I'm also fine with it.
> 
> I agree that the situation is not ideal. If a larger set of changes
> would benefit from this change then it would clearly add arguments...

Ok, then I'll send it out as part of the larger RFC set.


>> Also, if you could reply to [1], that would be greatly appreciated.
> 
> I will try to get to it but from a quick glance, yet-another-zone will
> hit a lot of opposition...

The most basic questions, that I hope can be answered with Yes/No =) are:

- should a new zone be added after DMA32?

- should I try hard to keep the mask fitting a 32bit word - at least for
hose who do not use the new zone - or is it ok to just stretch it to 64
bits?



If you could answer these, then I'll have a better idea of what I need
to do to.

TIA, igor



Question on ___GFP_NOLOCKDEP - Was: Re: [PATCH 1/1] Remove hardcoding of ___GFP_xxx bitmasks

2017-04-27 Thread Igor Stoppa
On 26/04/17 18:29, Igor Stoppa wrote:

> On 26/04/17 17:47, Michal Hocko wrote:

[...]

>> Also the current mm tree has ___GFP_NOLOCKDEP which is not addressed
>> here so I suspect you have based your change on the Linus tree.

> I used your tree from kernel.org

I found it, I was using master, instead of auto-latest (is it correct?)
But now I see something that I do not understand (apologies if I'm
asking something obvious).

First there is:

[...]
#define ___GFP_WRITE0x80u
#define ___GFP_KSWAPD_RECLAIM   0x100u
#ifdef CONFIG_LOCKDEP
#define ___GFP_NOLOCKDEP0x400u
#else
#define ___GFP_NOLOCKDEP0
#endif

Then:

/* Room for N __GFP_FOO bits */
#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))



Shouldn't it be either:
___GFP_NOLOCKDEP0x200u

or:

#define __GFP_BITS_SHIFT (25 + 2 * IS_ENABLED(CONFIG_LOCKDEP))


thanks, igor


Question on ___GFP_NOLOCKDEP - Was: Re: [PATCH 1/1] Remove hardcoding of ___GFP_xxx bitmasks

2017-04-27 Thread Igor Stoppa
On 26/04/17 18:29, Igor Stoppa wrote:

> On 26/04/17 17:47, Michal Hocko wrote:

[...]

>> Also the current mm tree has ___GFP_NOLOCKDEP which is not addressed
>> here so I suspect you have based your change on the Linus tree.

> I used your tree from kernel.org

I found it, I was using master, instead of auto-latest (is it correct?)
But now I see something that I do not understand (apologies if I'm
asking something obvious).

First there is:

[...]
#define ___GFP_WRITE0x80u
#define ___GFP_KSWAPD_RECLAIM   0x100u
#ifdef CONFIG_LOCKDEP
#define ___GFP_NOLOCKDEP0x400u
#else
#define ___GFP_NOLOCKDEP0
#endif

Then:

/* Room for N __GFP_FOO bits */
#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))



Shouldn't it be either:
___GFP_NOLOCKDEP0x200u

or:

#define __GFP_BITS_SHIFT (25 + 2 * IS_ENABLED(CONFIG_LOCKDEP))


thanks, igor


Re: [PATCH 1/1] Remove hardcoding of ___GFP_xxx bitmasks

2017-04-28 Thread Igor Stoppa


On 28/04/17 10:40, Michal Hocko wrote:

> Do not add a new zone, really. What you seem to be looking for is an
> allocator on top of the page/memblock allocator which does write
> protection on top. I understand that you would like to avoid object
> management duplication but I am not really sure how much you can re-use
> what slab allocators do already, anyway. I will respond to the original
> thread to not mix things together.

I'm writing an alternative different proposal, let's call it last attempt.

Should be ready in a few minutes.

thanks, igor


Generic approach to customizable zones - was: Re: [PATCH v7 0/7] Introduce ZONE_CMA

2017-04-28 Thread Igor Stoppa
On 27/04/17 18:06, Michal Hocko wrote:
> On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:

[...]

>> Yes, it requires one more bit for a new zone and it's handled by the patch.
> 
> I am pretty sure that you are aware that consuming new page flag bits
> is usually a no-go and something we try to avoid as much as possible
> because we are in a great shortage there. So there really have to be a
> _strong_ reason if we go that way. My current understanding that the
> whole zone concept is more about a more convenient implementation rather
> than a fundamental change which will solve unsolvable problems with the
> current approach. More on that below.

Since I am in a similar situation, I think it's better if I join this
conversation instead of going through the same in a separate thread.

In this regard, I have a few observations (are they correct?):

* not everyone seems to be interested in having all the current
  zones active simultaneously

* some zones are even not so meaningful on certain architectures or
  platforms

* some architectures/platforms that are 64 bits would have no penalty
  in dealing with a larger data type.

So I wonder, would anybody be against this:

* within the 32bits constraint, define some optional zones

* decouple the specific position of a bit from the zone it represents;
  iow: if the zone is enabled, ensure that it gets a bit in the mask,
  but do not make promises about which one it is, provided that the
  corresponding macros work properly

* ensure that if one selects more optional zones than there are bits
  available (in the case of a 32bits mask), an error is produced at
  compile time

* if one is happy to have a 64bits type, allow for as many zones as
  it's possible to fit, or anyway more than what is possible with
  the 32 bit mask.

I think I can re-factor the code so that there is no runtime performance
degradation, if there is no immediate objection to what I described. Or
maybe I failed to notice some obvious pitfall?

>From what I see, there seems to be a lot of interest in using functions
like Kmalloc / vmalloc, with the ability of specifying pseudo-custom
areas from where they should tap into.

Why not, as long as those who do not need it are not negatively impacted?

I understand that if the association between bits and zones is fixed,
then suddenly bits become very precious stuff, but if they could be used
in a more efficient way, then maybe they could be used more liberally.

The alternative is to keep getting requests about new zones and turning
them away because they do not pass the bar of being extremely critical,
even if indeed they would simplify people's life.


The change shouldn't be too ugly, if I do something along these lines of
the pseudo code below.
Note: the #ifdefs would be mainly concentrated in the declaration part.

enum gfp_zone_shift {
#if IS_ENABLED(CONFIG_ZONE_DMA)
/*I haven't checked if this is the correct name, but it gives the idea*/
ZONE_DMA_SHIFT = 0,
#endif
#if IS_ENABLED(CONFIG_ZONE_HIGHMEM)
ZONE_HIGHMEM_SHIFT,
#endif
#if IS_ENABLED(CONFIG_ZONE_DMA32)
ZONE_DMA32_SHIFT,
#endif
#if IS_ENABLED(CONFIG_ZONE_xxx)
ZONE_xxx,
#endif
   NON_OPTIONAL_ZONE_SHIFT,
   ...
   USED_ZONES_NUMBER,
   ZONE_MOVABLE_SHIFT = USED_ZONES_NUMBER,
   ...
};

#if USED_ZONES_NUMBER < MAX_ZONES_32BITS
typedef gfp_zones_t uint32_t
#elif IS_ENABLED(CONFIG_ZONES_64BITS
typedef gfp_zones_t uint64_t
#else
#error
#endif

The type should be adjusted in other places where it is used, but I
didn't find too many occurrences.

#define __ZONE_DMA \
  (((gfp_zones_t)IS_ENABLED(CONFIG_ZONE_DMA)) << \
   (ZONE_DMA_SHIFT - 0))

[rinse and repeat]

Code referring to these optional zones can be sandboxed in

#if IS_ENABLED(CONFIG_ZONE_DMA)

inline function do_something_dma() {
   
}

#else
#define do_something_dma()
#endif


Or equivalent, effectively removing many #ifdefs from the main code of
functions like those called by kmalloc.


So, would this approach stand a chance?


thanks, igor


Re: [PATCH 1/1] Remove hardcoding of ___GFP_xxx bitmasks

2017-04-28 Thread Igor Stoppa
On 28/04/17 10:43, Igor Stoppa wrote:

[...]

> I'm writing an alternative different proposal, let's call it last attempt.
> 
> Should be ready in a few minutes.

Here: http://marc.info/?l=linux-mm=149336675129967=2

--
thanks, igor



Re: Generic approach to customizable zones - was: Re: [PATCH v7 0/7] Introduce ZONE_CMA

2017-04-28 Thread Igor Stoppa


On 28/04/17 11:36, Michal Hocko wrote:
> I didn't read this thoughly yet because I will be travelling shortly

ok, thanks for bearing with me =)

> but
> this point alone just made ask, because it seems there is some
> misunderstanding

It is possible, so far I did some changes, but I have not completed the
whole conversion.

> On Fri 28-04-17 11:04:27, Igor Stoppa wrote:
> [...]
>> * if one is happy to have a 64bits type, allow for as many zones as
>>   it's possible to fit, or anyway more than what is possible with
>>   the 32 bit mask.
> 
> zones are currently placed in struct page::flags. And that already is
> 64b size on 64b arches. 

Ok, the issues I had so fare were related to the enum for zones being
treated as 32b.

> And we do not really have any room spare there.
> We encode page flags, zone id, numa_nid/sparse section_nr there. How can
> you add more without enlarging the struct page itself or using external
> means to store the same information (page_ext comes to mind)?

Then I'll be conservative and assume I can't, unless I can prove otherwise.

There is still the possibility I mentioned of loosely coupling DMA,
DMA32 and HIGHMEM with the bits currently reserved for them, right?

If my system doesn't use those zones as such, because it doesn't
have/need them, those bits are wasted for me. Otoh someone else is
probably not interested in what I'm after but needs one or more of those
zones.

Making the meaning of the bits configurable should still be a viable
option. It's not altering their amount, just their purpose on a specific
build.

> Even if
> the later would be possible then note thatpage_zone() is used in many
> performance sensitive paths and making it perform well with special
> casing would be far from trivial.


If the solution I propose is acceptable, I'm willing to bite the bullet
and go for implementing the conversion.

In my case I really would like to be able to use kmalloc, because it
would provide an easy path to convert also other portions of the kernel,
besides SE Linux.

I suspect I would encounter overall far less resistance if the type of
change I propose is limited to:

s/GFP_KERNEL/GFP_LOCKABLE/

And if I can guarrantee that GFP_LOCKABLE falls back to GFP_KERNEL when
the "lockable" feature is not enabled.


--
thanks, igor


Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-09 Thread Igor Stoppa


On 09/08/17 02:15, Jerome Glisse wrote:
> On Tue, Aug 08, 2017 at 03:59:36PM +0300, Igor Stoppa wrote:

[...]

>> I am tempted to add
>>
>> #define VM_PMALLOC   0x0100

[...]

> VM_PMALLOC sounds fine to me also adding a comment there pointing to
> pmalloc documentation would be a good thing to do. The above are flags
> that are use only inside vmalloc context and so there is no issue
> here of conflicting with other potential user.

ok, will do

>>
>> Unless it's acceptable to check the private field in the page struct.
>> It would bear the pmalloc magic number.
> 
> I thought you wanted to do:
>   check struct page mapping field
>   check vmap->flags for VM_PMALLOC
> 
> bool is_pmalloc(unsigned long addr)
> {
> struct page *page;
> struct vm_struct *vm_struct;
> 
> if (!is_vmalloc_addr(addr))
> return false;
> page = vmalloc_to_page(addr);
> if (!page)
> return false;
> if (page->mapping != pmalloc_magic_key)

page->private  ?
I thought mapping would not work in the cases you mentioned?

> return false;
> 
> vm_struct = find_vm_area(addr);
> if (!vm_struct)
> return false;
> 
> return vm_struct->flags & VM_PMALLOC;
> }
> 
> Did you change your plan ?

No, the code I have is almost 1:1 what you wrote.
Apart from mapping <=> private

In my previous mail I referred to page->private.

Maybe I was not very clear in what I wrote, but I'm almost 100% aligned
with your snippet.

>> I'm thinking to use a pointer to one of pmalloc data items, as signature.
> 
> What ever is easier for you. Note that dereferencing such pointer before
> asserting this is really a pmalloc page would be hazardous.

Yes, it's not even needed in this scenario.
It was just a way to ensure that it would be a value that cannot be come
out accidentally as pointer value, since it is already taken.

--
igor


[RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-02 Thread Igor Stoppa
Hi,
while I am working to another example of using pmalloc [1],
it was pointed out to me that:

1) I had introduced a bug when I switched to using a field of the page
structure [2]

2) I was also committing a layer violation in the way I was tagging the
pages.

I am seeking help to understand what would be the correct way to do the
tagging.

Here are snippets describing the problems:


1) from pmalloc.c:

...

+static const unsigned long pmalloc_signature = (unsigned
long)_mutex;

...

+int __pmalloc_tag_pages(void *base, const size_t size, const bool set_tag)
+{
+   void *end = base + size - 1;
+
+   do {
+   struct page *page;
+
+   if (!is_vmalloc_addr(base))
+   return -EINVAL;
+   page = vmalloc_to_page(base);
+   if (set_tag) {
+   BUG_ON(page_private(page) || page->private);
+   set_page_private(page, 1);
+   page->private = pmalloc_signature;
+   } else {
+   BUG_ON(!(page_private(page) &&
+page->private == pmalloc_signature));
+   set_page_private(page, 0);
+   page->private = 0;
+   }
+   base += PAGE_SIZE;
+   } while ((PAGE_MASK & (unsigned long)base) <=
+(PAGE_MASK & (unsigned long)end));
+   return 0;
+}

...

+static const char msg[] = "Not a valid Pmalloc object.";
+const char *pmalloc_check_range(const void *ptr, unsigned long n)
+{
+   unsigned long p;
+
+   p = (unsigned long)ptr;
+   n = p + n - 1;
+   for (; (PAGE_MASK & p) <= (PAGE_MASK & n); p += PAGE_SIZE) {
+   struct page *page;
+
+   if (!is_vmalloc_addr((void *)p))
+   return msg;
+   page = vmalloc_to_page((void *)p);
+   if (!(page && page_private(page) &&
+ page->private == pmalloc_signature))
+   return msg;
+   }
+   return NULL;
+}


The problem here comes from the way I am using page->private:
the fact that the page is marked as private means only that someone is
using it, and the way it is used could create (spoiler: it happens) a
collision with pmalloc_signature, which can generate false positives.

A way to ensure that the address really belongs to pmalloc would be to
pre-screen it, against either the signature or some magic number and,
if such test is passed, then compare the address against those really
available in the pmalloc pools.

This would be slower, but it would be limited only to those cases where
the signature/magic number matches and the answer is likely to be true.

2) However, both the current (incorrect) implementation and the one I am
considering, are abusing something that should be used otherwise (see
the following snippet):

from include/linux/mm_types.h:

struct page {
...
  union {
unsigned long private;  /* Mapping-private opaque data:
 * usually used for buffer_heads
 * if PagePrivate set; used for
 * swp_entry_t if PageSwapCache;
 * indicates order in the buddy
 * system if PG_buddy is set.
 */
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
#endif
struct kmem_cache *slab_cache;  /* SL[AU]B: Pointer to slab */
};
...
}


The "private" field is meant for mapping-private opaque data, which is
not how I am using it.

Yet it seems the least harmful field to choose.
Is this acceptable?
Otherwise, what would be the best course of action?


thanks, igor


[1] https://lkml.org/lkml/2017/7/10/400
[2] https://lkml.org/lkml/2017/7/6/573


Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-08 Thread Igor Stoppa
On 07/08/17 22:12, Jerome Glisse wrote:
> On Mon, Aug 07, 2017 at 05:13:00PM +0300, Igor Stoppa wrote:

[...]

>> I have an updated version of the old proposal:
>>
>> * put a magic number in the private field, during initialization of
>> pmalloc pages
>>
>> * during hardened usercopy verification, when I have to assess if a page
>> is of pmalloc type, compare the private field against the magic number
>>
>> * if and only if the private field matches the magic number, then invoke
>> find_vm_area(), so that the slowness affects only a possibly limited
>> amount of false positives.
> 
> This all sounds good to me.

ok, I still have one doubt wrt defining the flag.
Where should I do it?

vmalloc.h has the following:

/* bits in flags of vmalloc's vm_struct below */
#define VM_IOREMAP  0x0001  /* ioremap() and friends
*/
#define VM_ALLOC0x0002  /* vmalloc() */
#define VM_MAP  0x0004  /* vmap()ed pages */
#define VM_USERMAP  0x0008  /* suitable for
   remap_vmalloc_range
*/
#define VM_UNINITIALIZED0x0020  /* vm_struct is not
   fully initialized */
#define VM_NO_GUARD 0x0040  /* don't add guard page
*/
#define VM_KASAN0x0080  /* has allocated kasan
shadow memory */
/* bits [20..32] reserved for arch specific ioremap internals */



I am tempted to add

#define VM_PMALLOC  0x0100

But would it be acceptable, to mention pmalloc into vmalloc?

Should I name it VM_PRIVATE bit, instead?

Using VM_PRIVATE would avoid contaminating vmalloc with something that
depends on it (like VM_PMALLOC would do).

But using VM_PRIVATE will likely add tracking issues, if someone else
wants to use the same bit and it's not clear who is the user, if any.

Unless it's acceptable to check the private field in the page struct.
It would bear the pmalloc magic number.

I'm thinking to use a pointer to one of pmalloc data items, as signature.


--
igor


Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-07 Thread Igor Stoppa
On 04/08/17 11:12, Michal Hocko wrote:
> On Fri 04-08-17 11:02:46, Igor Stoppa wrote:

[...]

>> struct page {
>>   /* First double word block */
>>   unsigned long flags;   /* Atomic flags, some possibly
>>   * updated asynchronously */
>> union {
>>  struct address_space *mapping;  /* If low bit clear, points to
>>   * inode address_space, or NULL.
>>   * If page mapped as anonymous
>>   * memory, low bit is set, and
>>   * it points to anon_vma object:
>>   * see PAGE_MAPPING_ANON below.
>>   */
>> ...
>> }
>>
>> mapping seems to be used exclusively in 2 ways, based on the value of
>> its lower bit.
> 
> Not really. The above applies to LRU pages. Please note that Slab pages
> use s_mem and huge pages use compound_mapcount. If vmalloc pages are
> using none of those already you can add a new field there.

Yes, both from reading the code and some experimentation, it seems that
vmalloc is not using either field.

I'll add a vm_area field as you advised.

Is this something I could send as standalone patch?

--
thank you, igor



Re: [RFC] Tagging of vmalloc pages for supporting the pmalloc allocator

2017-08-07 Thread Igor Stoppa


On 07/08/17 16:31, Jerome Glisse wrote:
> On Mon, Aug 07, 2017 at 02:26:21PM +0300, Igor Stoppa wrote:

[...]

>> I'll add a vm_area field as you advised.
>>
>> Is this something I could send as standalone patch?
> 
> Note that vmalloc() is not the only thing that use vmalloc address
> space. There is also vmap() and i know one set of drivers that use
> vmap() and also use the mapping field of struct page namely GPU
> drivers.

Ah, yes, you mentioned this.

> So like i said previously i would store a flag inside vm_struct to
> know if page you are looking at are pmalloc or not.

And I was planning to follow your advice, using one of the flags.
But ...

> Again do you
> need to store something per page ? Would storing it per vm_struct
> not be enough ?

... there was this further comment, about speeding up the access to
vm_area, which seemed good from performance perspective.

---8<--8<--8<--8<--8<---
On 03/08/17 14:48, Michal Hocko wrote:
> On Thu 03-08-17 13:11:45, Igor Stoppa wrote:

[...]

>> But, to reply more specifically to your advice, yes, I think I could
>> add a flag to vm_struct and then retrieve its value, for the address
>> being processed, by passing through find_vm_area().
>
> ... and you can store vm_struct pointer to the struct page there and
> you won't need to do the slow find_vm_area. I haven't checked very
> closely but this should be possible in principle. I guess other
> callers might benefit from this as well.
---8<--8<--8<--8<--8<---

I do not strictly need to modify the page struct, but it seems it might
harm performance, if it is added on the path of hardened usercopy.

I have an updated version of the old proposal:

* put a magic number in the private field, during initialization of
pmalloc pages

* during hardened usercopy verification, when I have to assess if a page
is of pmalloc type, compare the private field against the magic number

* if and only if the private field matches the magic number, then invoke
find_vm_area(), so that the slowness affects only a possibly limited
amount of false positives.


--
igor


[RFC] memory allocations in genalloc

2017-08-17 Thread Igor Stoppa
Foreword:
If I should direct this message to someone else, please let me know.
I couldn't get a clear idea, by looking at both MAINTAINERS and git blame.



Hi,

I'm currently trying to convert the SE Linux policy db into using a
protectable memory allocator (pmalloc) that I have developed.

Such allocator is based on genalloc: I had come up with an
implementation that was pretty similar to what genalloc already does, so
it was pointed out to me that I could have a look at it.

And, indeed, it seemed a perfect choice.

But ... when freeing memory, genalloc wants that the caller also states
how large each specific memory allocation is.

This, per se, is not an issue, although genalloc doesn't seen to check
if the memory being freed is really matching a previous allocation request.

However, this design doesn't sit well with the use case I have in mind.

In particular, when the SE Linux policy db is populated, the creation of
one or more specific entry of the db might fail.
In this case, the memory previously allocated for said entry, is
released with kfree, which doesn't need to know the size of the chunk
being freed.

I would like to add similar capability to genalloc.

genalloc already uses bitmaps, to track what words are allocated (1) and
which are free (0)

What I would like to do is to add another bitmap, which would track the
beginning of each individual allocation (1 on the first allocation unit
of each allocation, 0 otherwise).

Such enhancement would enable also the detection of calls to free with
incorrect / misaligned addresses - right now it is possible to
successfully free a memory area that overlaps the interface of two
adjacent allocations, without fully covering either of them.

Would this change be acceptable?
Is there any better way to achieve what I want?


---

I have also a question wrt the use of spinlocks in genalloc.
Why a spinlock?

Freeing a chunk of memory previously allocated with vmalloc requires
invoking vfree_atomic, instead of vfree, because the list of chunks is
walked with the spinlock held, and vfree can sleep.

Why not using a mutex?


--
TIA, igor


Re: [kernel-hardening] [RFC] memory allocations in genalloc

2017-08-18 Thread Igor Stoppa
Hi,

On 18/08/17 16:57, Laura Abbott wrote:
> Again, if you have a specific patch or
> proposal this would be easier to review.


yes, I'm preparing it and will send it out soon,
but it was somehow surprising to me that it was chosen to implement free
with the size parameter.

It made me think that I was overlooking some obvious reason behind the
choice :-S

--
thanks for answering, igor


RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-03 Thread Igor Stoppa
Hello,

please review my (longish) line of thoughts, below.

I've restructured them so that they should be easier to follow.


Observations


* it is currently possible, by using prefix "__read_only", to have the
linker place a static variable into a special memory region, which will
become write-protected at the end of the init phase.

* the purpose is to write-protect data which is not expected to change,
ever, after it has been initialized.

* The mechanism used for locking down the memory region is to program
the MMU to trap writes to said region. It is fairly efficient and
HW-backed, so it doesn't introduce any major overhead, but the MMU deals
only with pages or supersets of pages, hence the need to collect all the
soon-to-be-read-only data - and only that - into the "special region".
The "__read_only" modifier is the admission ticket.

* the write-protecting feature helps supporting memory integrity in
general and can also help spotting rogue writes, whatever their origin
might be: uninitialized or expired pointers, wrong pointer arithmetic, etc.



Problem
---

The feature is available only for *static* data - it will not work with
something like a linked list that is put together during init, for example.



Wish


My starting point are the policy DB of SE Linux and the LSM Hooks, but
eventually I would like to extend the protection also to other
subsystems, in a way that can be merged into mainline.



Analysis


* the solution I come up with has to be as little invasive as possible,
at least for what concerns the various subsystems whose integrity I want
to enhance.

* In most, if not all, the cases that could be enhanced, the code will
be calling kmalloc/vmalloc, indicating GFP_KERNEL as the desired type of
memory.

* I suspect/hope that the various maintainer won't object too much if my
changes are limited to replacing GFP_KERNEL with some other macro, for
example what I previously called GFP_LOCKABLE, provided I can ensure that:

  -1) no penalty is introduced, at least when the extra protection
  feature is not enabled, iow nobody has to suffer from my changes.
  This means that GFP_LOCKABLE should fall back to GFP_KERNEL, when
  it's not enabled.

  -2) when the extra protection feature is enabled, the code still
  works as expected, as long as the data identified for this
  enhancement is really unmodified after init.

* In my quest for improved memory integrity, I will deal with very
different memory size being allocated, so if I start writing my own
memory allocator, starting from a page-aligned chunk of normal memory,
at best I will end up with a replica of kmalloc, at worst with something
buggy. Either way, it will be extremely harder to push other subsystems
to use it.
I probably wouldn't like it either, if I was a maintainer.

* While I do not strictly need a new memory zone, memory zones are what
kmalloc understands at the moment: AFAIK, it is not possible to tell
kmalloc from which memory pool it should fish out the memory, other than
having a reference to a memory zone.
If it was possible to aim kmalloc at arbitrary memory pools, probably we
would not be having this exchange right now. And probably there would
not be so many other folks trying to have their memory zone of interest
being merged. However I suspect this solution would be sub-optimal for
the normal use cases, because there would be the extra overhead of
passing the reference to the memory pool, instead of encoding it into
bitfields, together with other information.

* there are very slim chances (to be optimistic :) that I can get away
with having my custom zone merged, because others are trying with
similar proposals and they get rejected, so maybe I can have better luck
if I propose something that can also work for others.

* currently memory zones are mapped 1:1 to bits in crowded a bitmask,
but not all these zones are really needed in a typical real system, some
are kept for backward compatibility and supporting distros, which cannot
know upfront the quirks of the HW they will be running on.


Conclusions
---

* the solution that seems to be more likely to succeed is to remove the
1:1 mapping between optional zones and their respective bits.

* the bits previously assigned to the optional zones would become
available for mapping whatever zone a system integrator wants to support.


Cons:
There would be still a hard constraint on the maximum number of zones
available simultaneously, so one will have to choose which of the
optional zones to enable, and be ready to deal with own zone
disappearing (ex: fall back to normal memory and give up some/all
functionality)

Pros:
* No bit would go to waste: those who want to have own custom zone could
make a better use of the allocated-but-not-necessary-to-them bits.
* There would be a standard way for people to add non-standard zones.
* It doesn't alter the hot paths that are critical to efficient memory
handling.

So 

Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-10 Thread Igor Stoppa
On 10/05/17 11:05, Michal Hocko wrote:
> On Fri 05-05-17 13:42:27, Igor Stoppa wrote:

[...]

>> ... in the case I have in mind, I have various, heterogeneous chunks of
>> data, coming from various subsystems, not necessarily page aligned.
>> And, even if they were page aligned, most likely they would be far
>> smaller than a page, even a 4k page.
> 
> This aspect of various sizes makes the SLAB allocator not optimal
> because it operates on caches (pools of pages) which manage objects of
> the same size.

Indeed, that's why I wasn't too excited about caches, but probably I was
not able to explain sufficiently well why in the RFC :-/

> You could use the maximum size of all objects and waste
> some memory but you would have to know this max in advance which would
> make this approach less practical. You could create more caches of
> course but that still requires to know those sizes in advance.

Yes, and even generic per-architecture or even per board profiling
wouldn't necessarily do much good: taking SE Linux as example, one could
have two identical boards with almost identical binaries installed, only
differing in the rules/policy.
That difference alone could easily lead to very different size
requirements for the sealable page pool.

> So it smells like a dedicated allocator which operates on a pool of
> pages might be a better option in the end.

ok

> This depends on what you
> expect from the allocator. NUMA awareness? Very effective hotpath? Very
> good fragmentation avoidance? CPU cache awareness? Special alignment
> requirements? Reasonable free()? Etc...

>From the perspective of selling the feature to as many subsystems as
possible, I'd say that as primary requirement, it shouldn't affect
runtime performance.
I suppose (but it's just my guess) that trading milliseconds-scale
boot-time slowdown, for additional integrity is acceptable in the vast
majority of cases.

The only alignment requirements that I can think of, are coming from the
MMU capability of dealing only with physical pages, when it comes to
write-protect them.

> To me it seems that this being an initialization mostly thingy a simple
> allocator which manages a pool of pages (one set of sealed and one for
> allocations) 

Shouldn't also the set of pages used for keeping track of the others be
sealed? Once one is ro, also the other should not change.

> and which only appends new objects as they fit to unsealed
> pages would be sufficient for starter.

Any "free" that might happen during the initialization transient, would
actually result in an untracked gap, right?

What about the size of the pool of pages?
No predefined size, instead request a new page, when the memory
remaining from the page currently in use is not enough to fit the latest
allocation request?

There are also two aspect we discussed earlier:

- livepatch: how to deal with it? Identify the page it wants to modify
and temporarily un-protect it?

- modules: unloading and reloading modules will eventually lead to
permanently lost pages, in increasing number.
Loading/unloading repeatedly the same module is probably not so common,
with a major exception being USB, where almost anything can show up.
And disappear.
This seems like a major showstopper for the linear allocator you propose.

My reasoning in pursuing the kmalloc approach was that it is already
equipped with mechanisms for dealing with these sort of cases, where
memory can be fragmented.
I also wouldn't risk introducing bugs with my homebrew allocator ...

The initial thought was that there could be a master toggle to
seal/unseal all the memory affected.

But you were not too excited, iirc :-D
Alternatively, kmalloc could be enhanced to unseal only the pages it
wants to modify.

I don't think much can be done for data that is placed together, in the
same page with something that needs to be altered.
But what is outside of that page could still enjoy the protection from
the seal.

--
igor


Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-10 Thread Igor Stoppa
On 10/05/17 14:43, Michal Hocko wrote:
> On Wed 10-05-17 11:57:42, Igor Stoppa wrote:
>> On 10/05/17 11:05, Michal Hocko wrote:
> [...]
>>> To me it seems that this being an initialization mostly thingy a simple
>>> allocator which manages a pool of pages (one set of sealed and one for
>>> allocations) 
>>
>> Shouldn't also the set of pages used for keeping track of the others be
>> sealed? Once one is ro, also the other should not change.
> 
> Heh, that really depends how much consistency and robustness you want to
> achieve. It is really hard to defend against targeted attacks against
> the allocator metadata when a code is running in the kernel.

Taking the trouble to implement the sealing, then anything that doesn't
have a justification for staying R/W is fair game for sealing, IMHO.

>>> and which only appends new objects as they fit to unsealed
>>> pages would be sufficient for starter.
>>
>> Any "free" that might happen during the initialization transient, would
>> actually result in an untracked gap, right?
> 
> yes. And once the whole page is free it would get unsealed and returned
> to the (page) allocator.

Which means that there must be some way to track the freeing.
I intentionally omitted it, because I wasn't sure it would still be
compatible with the idea of a simple linear allocator.

> This approach would inevitably lead to internal
> fragmentation but reducing that would require a pool which is shared for
> objects with the common life cycle which is quite hard with requirements
> you have (you would have to convey the allocation context to all users
> somehow).

What if the users were unaware of most of the context and would only use
some flag, say GFP_SEAL?
Shouldn't the allocator be the only one aware of the context?
Context being the actual set of pages used.

Other idea: for each logical group of objects having same lifecycle,
define a pool, then do linear allocation within the pool for the
respective logical group.

Still some way would be needed to track the utilization of each page,
but it would ensure that when a logical group is discarded, all its
related pages are freed.

>> What about the size of the pool of pages?
> 
> I wouldn't see that as a big deal. New pages would be allocated as
> needed.

ok

[...]

>> - modules: unloading and reloading modules will eventually lead to
>> permanently lost pages, in increasing number.
> 
> Each module should free all objects that were allocated on its behalf
> and that should result in pages being freed as well

Only if the objects are enforced to be contiguous and the start is at
the beginning of a page, which seems to go in the direction of having a
memory pool for each module.

>> Loading/unloading repeatedly the same module is probably not so common,
>> with a major exception being USB, where almost anything can show up.
>> And disappear.
>> This seems like a major showstopper for the linear allocator you propose.
> 
> I am not sure I understand. If such a module kept allocations behind it
> would be a memory leak no matter what.

What I had in mind is that, with a global linear allocator _without_
support for returning "freed" pages, there would be a memory consumption
progressively increasing.

But even if the module frees correctly its allocations and they are
tracked correctly, it's still possible that some page doesn't get
returned, unless the module had started using data from the beginning of
a brand new page and nothing else but that module used it.

So it really looks like we are discussing a per-module (linear) allocator.

Probably that's what you meant all the time and I just realized it now ...

>> My reasoning in pursuing the kmalloc approach was that it is already
>> equipped with mechanisms for dealing with these sort of cases, where
>> memory can be fragmented.
> 
> Yeah, but kmalloc is optimized for a completely different usecase. You
> can reuse same pages again and again while you clearly cannot do the
> same once you seal a page and make it read only.

No, but during the allocation transient, I could.

Cons: less protection for what is already in the page.
Pros: tighter packing.

> Well unless you want to
> open time windows when the page stops being RO or use a different
> mapping for the allocator.

Yes, I was proposing to temporarily make the specific page RW.

> But try to consider how many features of the slab allocator you are
> actually going to need wrt. to tweaks it would have to implement to
> support this new use case. Maybe duplicating general purpose caches and
> creating specialized explicitly is a viable path. I haven't tried
> it.
> 
>> I also wouldn't risk introducing bugs with my homebrew allocator ...
>>

Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-09 Thread Igor Stoppa
On 08/05/17 18:25, Laura Abbott wrote:
> On 05/05/2017 03:42 AM, Igor Stoppa wrote:
>> On 04/05/17 19:49, Laura Abbott wrote:

[...]

> PAGE_SIZE is still 4K/16K/64K but the underlying page table mappings
> may use larger mappings (2MB, 32M, 512M, etc.). The ARM architecture
> has a break-before-make requirement which requires old mappings be
> fully torn down and invalidated to avoid TLB conflicts. This is nearly
> impossible to do correctly on live page tables so the current policy
> is to not break down larger mappings.

ok, but if a system integrator chooses to have the mapping set to 512M
(let's consider a case that is definitely unoptimized), this granularity
will apply to anything that needs to be marked as R/O and is allocated
through linear mapping.

If the issue inherently sits with linear allocation, then maybe the
correct approach is to confirm if linear allocation is really needed, in
the first place.
Maybe not, at least for the type of data I have in mind.

However, that would require changes in the users of the interface,
rather than the interface itself.

I don't see this change as a problem, in general, but OTOH I do not know
yet if there are legitimate reasons to use kmalloc for any
post-init-read-only data.

Of course, if you have any proposal that would solve this problem with
large pages, I'd be interested to know.

Also, for me to better understand the magnitude of the problem, do you
know if there is any specific scenario where larger mappings would be
used/recommended?

The major reason for using larger mapping that I can think of, is to
improve the efficiency of the TLB when under pressure, however I
wouldn't be able to judge how much this can affect the overall
performance a big iron or in a small device.

Do you know if there is any similar case that has to deal with locking
down large pages?
Doesn't the kernel text have potentially a similar risks of leaving
large amount of memory unused?
Is rodata only virtual?

> I'd rather see this designed as being mandatory from the start and then
> provide a mechanism to turn it off if necessary. The uptake and
> coverage from opt-in features tends to be very low based on past experience.

I have nothing against such wish, actually I'd love it, but I'm not sure
I have the answer for it.

Yet, a partial solution is better than nothing, if there is no truly
flexible one.

--
igor




[PATCH 2/3] LSM: Convert security_hook_heads into explicit array of struct list_head

2017-06-26 Thread Igor Stoppa
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

Commit 3dfc9b02864b19f4 ("LSM: Initialize security_hook_heads upon
registration.") treats "struct security_hook_heads" as an implicit array
of "struct list_head" so that we can eliminate code for static
initialization. Although we haven't encountered compilers which do not
treat sizeof(security_hook_heads) != sizeof(struct list_head) *
(sizeof(security_hook_heads) / sizeof(struct list_head)), Casey does not
like the assumption that a structure of N elements can be assumed to be
the same as an array of N elements.

Now that Kees found that randstruct complains about such casting

  security/security.c: In function 'security_init':
  security/security.c:59:20: note: found mismatched op0 struct pointer
types: 'struct list_head' and 'struct security_hook_heads'

struct list_head *list = (struct list_head *) _hook_heads;

and Christoph thinks that we should fix it rather than make randstruct
whitelist it, this patch fixes it.

It would be possible to revert commit 3dfc9b02864b19f4, but this patch
converts security_hook_heads into an explicit array of struct list_head
by introducing an enum, due to reasons explained below.

Igor proposed a sealable memory allocator, and the LSM hooks
("struct security_hook_heads security_hook_heads" and
"struct security_hook_list ...[]") will benefit from that allocator via
protection using set_memory_ro()/set_memory_rw(), and that allocator
will remove CONFIG_SECURITY_WRITABLE_HOOKS config option. Thus, we will
likely be moving to that direction.

This means that these structures will be allocated at run time using
that allocator, and therefore the address of these structures will be
determined at run time rather than compile time.

But currently, LSM_HOOK_INIT() macro depends on the address of
security_hook_heads being known at compile time. If we use an enum
so that LSM_HOOK_INIT() macro does not need to know absolute address of
security_hook_heads, it will help us to use that allocator for LSM hooks.

As a result of introducing an enum, security_hook_heads becomes a local
variable. In order to pass 80 columns check by scripts/checkpatch.pl ,
rename security_hook_heads to hook_heads.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Rebased-by: Igor Stoppa <igor.sto...@huawei.com>
Cc: Kees Cook <keesc...@chromium.org>
Cc: Paul Moore <p...@paul-moore.com>
Cc: Stephen Smalley <s...@tycho.nsa.gov>
Cc: Casey Schaufler <ca...@schaufler-ca.com>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Igor Stoppa <igor.sto...@huawei.com>
Cc: Christoph Hellwig <h...@infradead.org>
---
 include/linux/lsm_hooks.h | 420 +++---
 security/security.c   |  31 ++--
 2 files changed, 227 insertions(+), 224 deletions(-)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 3cc9d77..32f30fa 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1694,225 +1694,226 @@ union security_list_options {
 #endif /* CONFIG_AUDIT */
 };
 
-struct security_hook_heads {
-   struct list_head binder_set_context_mgr;
-   struct list_head binder_transaction;
-   struct list_head binder_transfer_binder;
-   struct list_head binder_transfer_file;
-   struct list_head ptrace_access_check;
-   struct list_head ptrace_traceme;
-   struct list_head capget;
-   struct list_head capset;
-   struct list_head capable;
-   struct list_head quotactl;
-   struct list_head quota_on;
-   struct list_head syslog;
-   struct list_head settime;
-   struct list_head vm_enough_memory;
-   struct list_head bprm_set_creds;
-   struct list_head bprm_check_security;
-   struct list_head bprm_secureexec;
-   struct list_head bprm_committing_creds;
-   struct list_head bprm_committed_creds;
-   struct list_head sb_alloc_security;
-   struct list_head sb_free_security;
-   struct list_head sb_copy_data;
-   struct list_head sb_remount;
-   struct list_head sb_kern_mount;
-   struct list_head sb_show_options;
-   struct list_head sb_statfs;
-   struct list_head sb_mount;
-   struct list_head sb_umount;
-   struct list_head sb_pivotroot;
-   struct list_head sb_set_mnt_opts;
-   struct list_head sb_clone_mnt_opts;
-   struct list_head sb_parse_opts_str;
-   struct list_head dentry_init_security;
-   struct list_head dentry_create_files_as;
+enum security_hook_index {
+   LSM_binder_set_context_mgr,
+   LSM_binder_transaction,
+   LSM_binder_transfer_binder,
+   LSM_binder_transfer_file,
+   LSM_ptrace_access_check,
+   LSM_ptrace_traceme,
+   LSM_capget,
+   LSM_capset,
+   LSM_capable,
+   LSM_quotactl,
+   LSM_quota_on,
+   LSM_syslog,
+   LSM_settime,
+   

[PATCH 3/3] Make LSM Writable Hooks a command line option

2017-06-26 Thread Igor Stoppa
From: Igor Stoppa <igor.sto...@gmail.com>

This patch shows how it is possible to take advantage of pmalloc:
instead of using the build-time option __lsm_ro_after_init, to decide if
it is possible to keep the hooks modifiable, now this becomes a
boot-time decision, based on the kernel command line.

This patch relies on:

"Convert security_hook_heads into explicit array of struct list_head"
Author: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

to break free from the static constraint imposed by the previous
hardening model, based on __ro_after_init.

The default value is disabled, unless SE Linux debugging is turned on.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 security/security.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index 44c47b6..c7b4670 100644
--- a/security/security.c
+++ b/security/security.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MAX_LSM_EVM_XATTR  2
@@ -34,10 +35,19 @@
 /* Maximum number of letters for an LSM name string */
 #define SECURITY_NAME_MAX  10
 
-static struct list_head hook_heads[LSM_MAX_HOOK_INDEX]
-   __lsm_ro_after_init;
 static ATOMIC_NOTIFIER_HEAD(lsm_notifier_chain);
 
+static int dynamic_lsm = IS_ENABLED(CONFIG_SECURITY_SELINUX_DISABLE);
+
+static __init int set_dynamic_lsm(char *str)
+{
+   get_option(, _lsm);
+   return 0;
+}
+early_param("dynamic_lsm", set_dynamic_lsm);
+
+static struct list_head *hook_heads;
+static struct gen_pool *sec_pool;
 char *lsm_names;
 /* Boot-time LSM user choice */
 static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] =
@@ -62,6 +72,11 @@ int __init security_init(void)
 {
enum security_hook_index i;
 
+   sec_pool = pmalloc_create_pool("security", PMALLOC_DEFAULT_ALLOC_ORDER);
+   BUG_ON(!sec_pool);
+   hook_heads = pmalloc(sec_pool,
+sizeof(struct list_head) * LSM_MAX_HOOK_INDEX);
+   BUG_ON(!hook_heads);
for (i = 0; i < LSM_MAX_HOOK_INDEX; i++)
INIT_LIST_HEAD(_heads[i]);
pr_info("Security Framework initialized\n");
@@ -77,7 +92,8 @@ int __init security_init(void)
 * Load all the remaining security modules.
 */
do_security_initcalls();
-
+   if (!dynamic_lsm)
+   pmalloc_protect_pool(sec_pool);
return 0;
 }
 
-- 
2.9.3



[PATCH 1/3] Protectable memory support

2017-06-26 Thread Igor Stoppa
From: Igor Stoppa <igor.sto...@gmail.com>

The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 arch/Kconfig   |   1 +
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h| 111 +
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 lib/Kconfig|   1 +
 lib/genalloc.c |   4 +-
 mm/Makefile|   1 +
 mm/pmalloc.c   | 346 +
 mm/usercopy.c  |  24 +--
 10 files changed, 482 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 6c00e5b..9d16b51 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -228,6 +228,7 @@ config GENERIC_IDLE_POLL_SETUP
 
 # Select if arch has all set_memory_ro/rw/x/nx() functions in asm/cacheflush.h
 config ARCH_HAS_SET_MEMORY
+   select GENERIC_ALLOCATOR
bool
 
 # Select if arch init_task initializer is different to init/init_task.c
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..acc0723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -81,6 +81,7 @@ enum pageflags {
PG_active,
PG_waiters, /* Page has waiters, check its waitqueue. Must 
be bit #7 and in the same byte as "PG_locked" */
PG_slab,
+   PG_pmalloc,
PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
@@ -274,6 +275,7 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, 
active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
+__PAGEFLAG(Pmalloc, pmalloc, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)/* Used by some filesystems 
*/
 
 /* Xen */
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..0d65f83
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,111 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.sto...@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+#include 
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/**
+ * pmalloc_create_pool - create a new protectable memory pool -
+ * @name: the name of the pool, must be unique
+ * @min_alloc_order: log2 of the minimum allocation size obtainable
+ *   from the pool
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ *
+ * Returns a pointer to the new pool, upon succes, otherwise a NULL.
+ */
+struct gen_pool *pmalloc_create_pool(const char *name,
+int min_alloc_order);
+
+
+/**
+ * pmalloc_get_pool - get a pool handler, from its name
+ * @name: the name of the pool sought after.
+ *
+ * Returns a pointer to the pool, upon succes, otherwise a NULL.
+ */
+struct gen_pool *pmalloc_get_pool(const char *name);
+
+
+
+/**
+ * pmalloc - allocate protectable memory from a pool
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ *
+ * Allocates memory from an unprotected pool. If the pool doesn't have
+ * enough memor

[PATCH v7 0/3] ro protection for dynamic data

2017-06-26 Thread Igor Stoppa
Hi,
please consider for inclusion.

This patch introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a pool is made R/O, all the memory
that is part of it, will become R/O.

A R/O pool can be destroyed to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications, after initialization.

An example is provided, showing how to turn into a boot-time option the
writable state of the security hooks.
Prior to this patch, it was a compile-time option.

This is made possible, thanks to Tetsuo Handa's rework of the hooks
structure (included in the patchset).

Changes since the v6 version:
- complete rewrite, to use the genalloc library
- added sysfs interface for tracking of active pools

The only question still open is if there should be a possibility for
unprotecting a memory pool in other cases than destruction.

The only cases found for this topic are:
- protecting the LSM header structure between creation and insertion of a
  security module that was not built as part of the kernel
  (but the module can protect the headers after it has loaded)

- unloading SELinux from RedHat, if the system has booted, but no policy
  has been loaded yet - this feature is going away, according to Casey.


Igor Stoppa (2):
  Protectable memory support
  Make LSM Writable Hooks a command line option

Tetsuo Handa (1):
  LSM: Convert security_hook_heads into explicit array of struct
list_head

 arch/Kconfig   |   1 +
 include/linux/lsm_hooks.h  | 420 -
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h| 111 +++
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 lib/Kconfig|   1 +
 lib/genalloc.c |   4 +-
 mm/Makefile|   1 +
 mm/pmalloc.c   | 346 +
 mm/usercopy.c  |  24 ++-
 security/security.c|  49 +++--
 12 files changed, 726 insertions(+), 236 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

-- 
2.9.3



[PATCH 2/3] LSM: Convert security_hook_heads into explicit array of struct list_head

2017-06-27 Thread Igor Stoppa
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

Commit 3dfc9b02864b19f4 ("LSM: Initialize security_hook_heads upon
registration.") treats "struct security_hook_heads" as an implicit array
of "struct list_head" so that we can eliminate code for static
initialization. Although we haven't encountered compilers which do not
treat sizeof(security_hook_heads) != sizeof(struct list_head) *
(sizeof(security_hook_heads) / sizeof(struct list_head)), Casey does not
like the assumption that a structure of N elements can be assumed to be
the same as an array of N elements.

Now that Kees found that randstruct complains about such casting

  security/security.c: In function 'security_init':
  security/security.c:59:20: note: found mismatched op0 struct pointer
types: 'struct list_head' and 'struct security_hook_heads'

struct list_head *list = (struct list_head *) _hook_heads;

and Christoph thinks that we should fix it rather than make randstruct
whitelist it, this patch fixes it.

It would be possible to revert commit 3dfc9b02864b19f4, but this patch
converts security_hook_heads into an explicit array of struct list_head
by introducing an enum, due to reasons explained below.

Igor proposed a sealable memory allocator, and the LSM hooks
("struct security_hook_heads security_hook_heads" and
"struct security_hook_list ...[]") will benefit from that allocator via
protection using set_memory_ro()/set_memory_rw(), and that allocator
will remove CONFIG_SECURITY_WRITABLE_HOOKS config option. Thus, we will
likely be moving to that direction.

This means that these structures will be allocated at run time using
that allocator, and therefore the address of these structures will be
determined at run time rather than compile time.

But currently, LSM_HOOK_INIT() macro depends on the address of
security_hook_heads being known at compile time. If we use an enum
so that LSM_HOOK_INIT() macro does not need to know absolute address of
security_hook_heads, it will help us to use that allocator for LSM hooks.

As a result of introducing an enum, security_hook_heads becomes a local
variable. In order to pass 80 columns check by scripts/checkpatch.pl ,
rename security_hook_heads to hook_heads.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Rebased-by: Igor Stoppa <igor.sto...@huawei.com>
Cc: Kees Cook <keesc...@chromium.org>
Cc: Paul Moore <p...@paul-moore.com>
Cc: Stephen Smalley <s...@tycho.nsa.gov>
Cc: Casey Schaufler <ca...@schaufler-ca.com>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Igor Stoppa <igor.sto...@huawei.com>
Cc: Christoph Hellwig <h...@infradead.org>
---
 include/linux/lsm_hooks.h | 420 +++---
 security/security.c   |  31 ++--
 2 files changed, 227 insertions(+), 224 deletions(-)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 3cc9d77..32f30fa 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1694,225 +1694,226 @@ union security_list_options {
 #endif /* CONFIG_AUDIT */
 };
 
-struct security_hook_heads {
-   struct list_head binder_set_context_mgr;
-   struct list_head binder_transaction;
-   struct list_head binder_transfer_binder;
-   struct list_head binder_transfer_file;
-   struct list_head ptrace_access_check;
-   struct list_head ptrace_traceme;
-   struct list_head capget;
-   struct list_head capset;
-   struct list_head capable;
-   struct list_head quotactl;
-   struct list_head quota_on;
-   struct list_head syslog;
-   struct list_head settime;
-   struct list_head vm_enough_memory;
-   struct list_head bprm_set_creds;
-   struct list_head bprm_check_security;
-   struct list_head bprm_secureexec;
-   struct list_head bprm_committing_creds;
-   struct list_head bprm_committed_creds;
-   struct list_head sb_alloc_security;
-   struct list_head sb_free_security;
-   struct list_head sb_copy_data;
-   struct list_head sb_remount;
-   struct list_head sb_kern_mount;
-   struct list_head sb_show_options;
-   struct list_head sb_statfs;
-   struct list_head sb_mount;
-   struct list_head sb_umount;
-   struct list_head sb_pivotroot;
-   struct list_head sb_set_mnt_opts;
-   struct list_head sb_clone_mnt_opts;
-   struct list_head sb_parse_opts_str;
-   struct list_head dentry_init_security;
-   struct list_head dentry_create_files_as;
+enum security_hook_index {
+   LSM_binder_set_context_mgr,
+   LSM_binder_transaction,
+   LSM_binder_transfer_binder,
+   LSM_binder_transfer_file,
+   LSM_ptrace_access_check,
+   LSM_ptrace_traceme,
+   LSM_capget,
+   LSM_capset,
+   LSM_capable,
+   LSM_quotactl,
+   LSM_quota_on,
+   LSM_syslog,
+   LSM_settime,
+   

[PATCH 3/3] Make LSM Writable Hooks a command line option

2017-06-27 Thread Igor Stoppa
From: Igor Stoppa <igor.sto...@gmail.com>

This patch shows how it is possible to take advantage of pmalloc:
instead of using the build-time option __lsm_ro_after_init, to decide if
it is possible to keep the hooks modifiable, now this becomes a
boot-time decision, based on the kernel command line.

This patch relies on:

"Convert security_hook_heads into explicit array of struct list_head"
Author: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

to break free from the static constraint imposed by the previous
hardening model, based on __ro_after_init.

The default value is disabled, unless SE Linux debugging is turned on.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 security/security.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index 44c47b6..c7b4670 100644
--- a/security/security.c
+++ b/security/security.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MAX_LSM_EVM_XATTR  2
@@ -34,10 +35,19 @@
 /* Maximum number of letters for an LSM name string */
 #define SECURITY_NAME_MAX  10
 
-static struct list_head hook_heads[LSM_MAX_HOOK_INDEX]
-   __lsm_ro_after_init;
 static ATOMIC_NOTIFIER_HEAD(lsm_notifier_chain);
 
+static int dynamic_lsm = IS_ENABLED(CONFIG_SECURITY_SELINUX_DISABLE);
+
+static __init int set_dynamic_lsm(char *str)
+{
+   get_option(, _lsm);
+   return 0;
+}
+early_param("dynamic_lsm", set_dynamic_lsm);
+
+static struct list_head *hook_heads;
+static struct gen_pool *sec_pool;
 char *lsm_names;
 /* Boot-time LSM user choice */
 static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] =
@@ -62,6 +72,11 @@ int __init security_init(void)
 {
enum security_hook_index i;
 
+   sec_pool = pmalloc_create_pool("security", PMALLOC_DEFAULT_ALLOC_ORDER);
+   BUG_ON(!sec_pool);
+   hook_heads = pmalloc(sec_pool,
+sizeof(struct list_head) * LSM_MAX_HOOK_INDEX);
+   BUG_ON(!hook_heads);
for (i = 0; i < LSM_MAX_HOOK_INDEX; i++)
INIT_LIST_HEAD(_heads[i]);
pr_info("Security Framework initialized\n");
@@ -77,7 +92,8 @@ int __init security_init(void)
 * Load all the remaining security modules.
 */
do_security_initcalls();
-
+   if (!dynamic_lsm)
+   pmalloc_protect_pool(sec_pool);
return 0;
 }
 
-- 
2.9.3



[PATCH 1/1] Sealable memory support

2017-05-19 Thread Igor Stoppa
Dynamically allocated variables can be made read only,
after they have been initialized, provided that they reside in memory
pages devoid of any RW data.

The implementation supplies means to create independent pools of memory,
which can be individually created, sealed/unsealed and destroyed.

A global pool is made available for those kernel modules that do not
need to manage an independent pool.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 mm/Makefile  |   2 +-
 mm/smalloc.c | 200 +++
 mm/smalloc.h |  61 ++
 3 files changed, 262 insertions(+), 1 deletion(-)
 create mode 100644 mm/smalloc.c
 create mode 100644 mm/smalloc.h

diff --git a/mm/Makefile b/mm/Makefile
index 026f6a8..737c42a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o \
   mm_init.o mmu_context.o percpu.o slab_common.o \
   compaction.o vmacache.o swap_slots.o \
   interval_tree.o list_lru.o workingset.o \
-  debug.o $(mmu-y)
+  debug.o smalloc.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/smalloc.c b/mm/smalloc.c
new file mode 100644
index 000..fa04cc5
--- /dev/null
+++ b/mm/smalloc.c
@@ -0,0 +1,200 @@
+/*
+ * smalloc.c: Sealable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.sto...@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+
+#include 
+#include 
+#include "smalloc.h"
+
+#define page_roundup(size) (((size) + !(size) - 1 + PAGE_SIZE) & PAGE_MASK)
+
+#define pages_nr(size) (page_roundup(size) / PAGE_SIZE)
+
+static struct smalloc_pool *global_pool;
+
+struct smalloc_node *__smalloc_create_node(unsigned long words)
+{
+   struct smalloc_node *node;
+   unsigned long size;
+
+   /* Calculate the size to ask from vmalloc, page aligned. */
+   size = page_roundup(NODE_HEADER_SIZE + words * sizeof(align_t));
+   node = vmalloc(size);
+   if (!node) {
+   pr_err("No memory for allocating smalloc node.");
+   return NULL;
+   }
+   /* Initialize the node.*/
+   INIT_LIST_HEAD(>list);
+   node->free = node->data;
+   node->available_words = (size - NODE_HEADER_SIZE) / sizeof(align_t);
+   return node;
+}
+
+static __always_inline
+void *node_alloc(struct smalloc_node *node, unsigned long words)
+{
+   register align_t *old_free = node->free;
+
+   node->available_words -= words;
+   node->free += words;
+   return old_free;
+}
+
+void *smalloc(unsigned long size, struct smalloc_pool *pool)
+{
+   struct list_head *pos;
+   struct smalloc_node *node;
+   void *ptr;
+   unsigned long words;
+
+   /* If no pool specified, use the global one. */
+   if (!pool)
+   pool = global_pool;
+
+   mutex_lock(>lock);
+
+   /* If the pool is sealed, then return NULL. */
+   if (pool->seal == SMALLOC_SEALED) {
+   mutex_unlock(>lock);
+   return NULL;
+   }
+
+   /* Calculate minimum number of words required. */
+   words = (size + sizeof(align_t) - 1) / sizeof(align_t);
+
+   /* Look for slot that is large enough, in the existing pool.*/
+   list_for_each(pos, >list) {
+   node = list_entry(pos, struct smalloc_node, list);
+   if (node->available_words >= words) {
+   ptr = node_alloc(node, words);
+   mutex_unlock(>lock);
+   return ptr;
+   }
+   }
+
+   /* No slot found, get a new chunk of virtual memory. */
+   node = __smalloc_create_node(words);
+   if (!node) {
+   mutex_unlock(>lock);
+   return NULL;
+   }
+
+   list_add(>list, >list);
+   ptr = node_alloc(node, words);
+   mutex_unlock(>lock);
+   return ptr;
+}
+
+static __always_inline
+unsigned long get_node_size(struct smalloc_node *node)
+{
+   if (!node)
+   return 0;
+   return page_roundupvoid *)node->free) - (void *)node) +
+   node->available_words * sizeof(align_t));
+}
+
+static __always_inline
+unsigned long get_node_pages_nr(struct smalloc_node *node)
+{
+   return pages_nr(get_node_size(node));
+}
+void smalloc_seal_set(enum seal_t seal, struct smalloc_pool *pool)
+{
+   struct list_head *pos;
+   struct smalloc_node *node;
+
+   if (!pool)
+   pool = global_pool;
+   mutex_lock(>l

[RFC v3]mm: ro protection for data allocated dynamically

2017-05-19 Thread Igor Stoppa
Not all the data allocated dynamically needs to be altered frequently.
In some cases, it might be written just once, at initialization.

This RFC has the goal of improving memory integrity, by explicitly
making said data write-protected.

A reference implementation is provided.

During the previous 2 rounds, some concerns/questions were risen.
This iteration should address msot of them, if not all.

Basic idea behind the implementation: on systems with MMU, the MMU
supports associating various types of attribute to memory pages.

One of them is being read-only.
The MMU will cause an exception upon attempts to alter a read-only page.
This mechanism is already in use for protecting: kernel text and
constant data.
Relatively recently, it has become possible to have also statically
allocated data to become read-only, with the __ro_after_init annotation.

However nothing is done for variables allocated dynamically.

The catch for re-using the same mechanism, is that soon-to-be read only
variables must be grouped in dedicated memory pages, without any rw data
falling in the same range.

This can be achieved with a dedicated allocator.

The implementation proposed allows to create memory pools.
Each pool can be treated independently from the others, allowing fine
grained control about what data can be overwritten.

A pool is a kernel linked list, where the head contains a mutex used for
accessing the list, and the elements are nodes, providing the memory
actually used.

When a pool receives an allocation request for which it doesn't have
enough memory already available, it obtains a set of contiguous virtual
pages (node) that is large enough to cover the request being processed.
Such memory is likely to be significantly larger than what was required.
The slack is used for fulfilling further allocation requests, provided
that they fit the space available.

The pool ends up being a list of nodes, where each node contains a
request that, at the time it was received, could not be satisfied by
using the exisitng nodes, plus other requests that happened to fit in the
slack. Such requests handle each node as an individual linear pool.

When it's time to seal/unseal a pool, each element (node) of the list is
visited and the range of pages it comprises is passed ot set_memory_ro/rw.

Freeing memory is supported at pool level: if for some reason one or more
memory requests must be discarded, at some point, they are simply ignored.
Upon the pool tear down, then nodes are removed one by one and the
corresponding memory range freed for good with vfree.

This approach avoids the extra coplexity of tracking individual
allocations, yet it allows to control claim back pages when not needed
anymore (i.e. module unloading.)

The same design also supports isolation between different kernel modules:
each module can allocae one or more pools, to obtain the desired level of
granularity when managing portions of its data that need different handling.

The price for this flexibility is that some more slack is produced.
The exact amount depends on the sizes of allocations performed and in
which order they arrive.

Modules that do not want/need all of this flexibility can use the default
global pool provided by the allocator.

This pool is intended to provide consistency with __ro_after_init and
therefore would be sealed at the same time.

Some observations/questions:

* the backend of the memory allocation is done by using vmalloc.
  Is here any better way? the bpf uses module_alloc but that seems not
  exactly its purpose.

* because of the vmalloc backend, this is not suitable for cases where
  it is really needed to have physically contiguous memory regions,
  however the type of data that would use this interface is likely to
  not require interaction with HW devices that could rise such need.

* the allocator supports defining a preferred alignment (currently set
  to 8 bytes, using uint64_t) - is it useful/desirable?
  If yes, is it the correct granularity (global)?

* to get the size of the padded header of a node, the current code uses
  __align(align_t) and it seems to work, but is it correct?

* examples of uses for this new allcoator:
  - LSM Hooks
  - policy database of SE Linux (several different structure types)

Igor Stoppa (1):
  Sealable memory support

 mm/Makefile  |   2 +-
 mm/smalloc.c | 200 +++
 mm/smalloc.h |  61 ++
 3 files changed, 262 insertions(+), 1 deletion(-)
 create mode 100644 mm/smalloc.c
 create mode 100644 mm/smalloc.h

-- 
2.9.3



Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-19 Thread Igor Stoppa
Hello,

On 10/05/17 18:45, Dave Hansen wrote:
> On 05/10/2017 08:19 AM, Igor Stoppa wrote:
>> So I'd like to play a little what-if scenario:
>> what if I was to support exclusively virtual memory and convert to it
>> everything that might need sealing?
> 
> Because of the issues related to fracturing large pages, you might have
> had to go this route eventually anyway.  Changing the kernel linear map
> isn't nice.
> 
> FWIW, you could test this scheme by just converting all the users to
> vmalloc() and seeing what breaks.  They'd all end up rounding up all
> their allocations to PAGE_SIZE, but that'd be fine for testing.

Apologies for the long hiatus, it took me some time to figure out
a solution that could somehow address all the comments I got till this
point.

It's here [1], I preferred to start one new thread, since the proposal
has in practice changed significantly, even if in spirit it's still the
same.

It should also take care of the potential waste of space you mentioned
wrt the round up to PAGE_SIZE.

> Could you point out 5 or 10 places in the kernel that you want to convert?

Right now I can only repeat what I said initially:
- the linked list used to implement LSM hooks
- SE linux structures used to implement the policy DB, it should be
  about 5 data types

Next week, I'll address the 2 cases I listed, then I'll look for more,
but I think it should not be difficult to find customers for this.

BTW, I forgot to mention that I tested the code against both SLAB and
SLUB and it seems to work fine.

So far I've used QEMU x86-64 as test environment.

--
igor


[1] https://marc.info/?l=linux-mm=149519044015956=2


Re: [kernel-hardening] [PATCH 1/1] Sealable memory support

2017-05-22 Thread Igor Stoppa
On 20/05/17 11:51, Greg KH wrote:
> On Fri, May 19, 2017 at 01:38:11PM +0300, Igor Stoppa wrote:
>> Dynamically allocated variables can be made read only,

[...]

> This is really nice, do you have a follow-on patch showing how any of
> the kernel can be changed to use this new subsystem?

Yes, actually I started from the need of turning into R/O some data
structures in both LSM Hooks and SE Linux.

> Without that, it
> might be hard to get this approved (we don't like adding new apis
> without users.)

Yes, I just wanted to give an early preview of the current
implementation, since it is significantly different from what I
initially proposed. So I was looking for early feedback.

Right now, I'm fixing it up and adding some more structured debugging.

Then I'll re-submit it together with the LSM Hooks example.

---
thanks, igor


Re: [PATCH] LSM: Make security_hook_heads a local variable.

2017-05-22 Thread Igor Stoppa
On 22/05/17 18:09, Casey Schaufler wrote:
> On 5/22/2017 7:03 AM, Christoph Hellwig wrote:

[...]

>> But even with those we can still chain
>> them together with a list with external linkage.
> 
> I gave up that approach in 2012. Too many unnecessary calls to
> null functions, and massive function vectors with a tiny number
> of non-null entries. From a data structure standpoint, it was
> just wrong. The list scheme is exactly right for the task at
> hand.

I understand this as a green light, for me to continue with the plan of
using LSM Hooks as example for making dynamically allocated data become
read-only, using also Tetsuo's patch (thanks, btw).

Is that correct?

---
thanks, igor


Re: [PATCH 1/1] Sealable memory support

2017-05-24 Thread Igor Stoppa


On 23/05/17 23:11, Kees Cook wrote:
> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.sto...@huawei.com> wrote:

[...]

> I would want hardened usercopy support as a requirement for using
> smalloc(). Without it, we're regressing the over-read protection that
> already exists for slab objects, if kernel code switched from slab to
> smalloc. It should be very similar to the existing slab checks. "Is
> this a smalloc object? Have we read beyond the end of a given object?"
> etc. The metadata is all there, except for an efficient way to mark a
> page as a smalloc page, but I think that just requires a new Page$Name
> bit test, as done for slab.

ok

[...]

> I meant this:
> 
> CPU 1 CPU 2
> create
> alloc
> write
> seal
> ...
> unseal
> write
> write
> seal
> 
> The CPU 2 write would be, for example, an attacker using a
> vulnerability to attempt to write to memory in the sealed area. All it
> would need to do to succeed would be to trigger an action in the
> kernel that would do a "legitimate" write (which requires the unseal),
> and race it. Unsealing should be CPU-local, if the API is going to
> support this kind of access.

I see.
If the CPU1 were to forcibly halt anything that can race with it, then
it would be sure that there was no interference.
A reactive approach could be, instead, to re-validate the content after
the sealing, assuming that it is possible.

[...]

> I am more concerned about _any_ unseal after initial seal. And even
> then, it'd be nice to keep things CPU-local. My concerns are related
> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
> which is kind of like this, but focused on the .data section, not
> dynamic memory. It has similar concerns about CPU-locality.
> Additionally, even writing to memory and then making it read-only
> later runs risks (see threads about BPF JIT races vs making things
> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
> doesn't change the risk this series is fixing: races with attacker
> writes during assignment but before read-only marking).

If you are talking about an attacker, rather than protection against
accidental overwrites, how hashing can be enough?
Couldn't the attacker compromise that too?

> So, while smalloc would hugely reduce the window an attacker has
> available to change data contents, this API doesn't eliminate it. (To
> eliminate it, there would need to be a CPU-local page permission view
> that let only the current CPU to the page, and then restore it to
> read-only to match the global read-only view.)

That or, if one is ready to take the hit, freeze every other possible
attack vector. But I'm not sure this could be justifiable.

[...]

> Ah! In that case, sure. This isn't what the proposed API provided,
> though, so let's adjust it to only perform the unseal at destroy time.
> That makes it much saner, IMO. "Write once" dynamic allocations, or
> "read-only after seal". woalloc? :P yay naming

For now I'm still using smalloc.
Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
Other options might require some re-formatting.

[...]

> I think a shared global pool would need to be eliminated for true
> write-once semantics.

ok

[...]

>> I'd rather not add extra locking to something that doesn't need it:
>> Allocate - write - seal - read, read, read, ... - unseal - destroy.
> 
> Yup, I would be totally fine with this. It still has a race between
> allocate and seal, but it's a huge improvement over the existing state
> of the world where all dynamic memory is writable. :)

Great!


[...]

> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
> could just drop the runtime disabling of SELinux, we'd be fine.

I am not sure I understand this point.
If the kernel is properly configured, the master toggle variable
disappears, right?
Or do you mean the disabling through modifications of the linked list of
the hooks?

[...]


> Hm, I just meant add a char[] to the metadata and pass it in during
> create(). Then it's possible to report which smalloc cache is being
> examined during hardened usercopy checks.

Ok, that is not a big deal.
wrt this, I have spent some time writing a debug module, which currently
dumps into a debugfs entry a bunch of info about the various pools.
I could split it across multiple entries, using the label to generate
their names.

[...]

> It seems like smalloc pools could also be refcounted?

I am not sure what you mean.
What do you want to count?
Number of pools? Nodes per pool? Allocations per node?

And what for?

At least in the case of tearing down a pool, when a module is unloaded,
nobody needs to free anything that was allocated with smalloc.
Th

Re: [PATCH 1/1] Sealable memory support

2017-05-23 Thread Igor Stoppa
On 23/05/17 00:38, Kees Cook wrote:
> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa <igor.sto...@huawei.com> wrote:

[...]

> For the first bit of bikeshedding, should this really be called
> seal/unseal? My mind is probably just broken from having read TPM
> documentation, but this isn't really "sealing" as I'd understand it
> (it's not tied to a credential, for example). It's "only" rw/ro.
> Perhaps "protect/unprotect" or just simply "readonly/writable", and
> call the base function "romalloc"?

I was not aware of the specific mean of "seal", in this context.
The term was implicitly proposed by Michal Hocko, while discussing about
the mechanism and I liked it more than what I was using initially:
"lockable".

tbh I like the sound of "smalloc" better than "romalloc"

But this is really the least of my worries :-P

> This is fundamentally a heap allocator, with linked lists, etc. I'd
> like to see as much attention as possible given to hardening it
> against attacks, especially adding redzoning around the metadata at
> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled.

My initial goal was to provide something that is useful without
affecting performance.

You seem to be pushing for a more extreme approach.
While I have nothing against it and I actually agree that it can be
useful, I would not make it mandatory.

More on this later.

> And as
> part of that, I'd like hardened usercopy to grow knowledge of these
> allocations so we can bounds-check objects. Right now, mm/usercopy.c
> just looks at PageSlab(page) to decide if it should do slab checks. I
> think adding a check for this type of object would be very important
> there.

I am not familiar with this and I need to study it, however I still
think that if there is a significant trade-off in terms of performance
vs resilience, it should be optional, for those who want it.

Maybe there could be a master toggle for these options, if it makes
sense to group them logically. How does this sound?

> The ro/rw granularity here is the _entire_ pool, not a specific
> allocation (or page containing the allocation). I'm concerned that
> makes this very open to race conditions where, especially in the
> global pool, one thread can be trying to write to ro data in a pool
> and another has made the pool writable.

I have the impression we are thinking to something different.
Close, but different enough.

First of all, how using a mutex can create races?
Do you mean with respect of other resources that might be held by
competing users of the pool?

That, imho, is a locking problem that cannot be solved here.
You can try to mitigate it, by reducing the chances it will happen, but
basically you are trying to make do with an user of the API that is not
implementing locking correctly.
I'd say that it's better to fix the user.

If you meant something else, I really didn't get it :-)

More about the frequency of the access: you seem to expect very often
seal/unseal - or lock/unlock, while I don't.

What I envision as primary case for unlocking is the tear down of the
entire pool, for example preceding the unloading of the module.

Think about __ro_after_init : once it is r/o, it stays r/o.

Which also means that there would be possibly a busy transient, when
allocations are made. That is true.
But only for shared pools. If a module uses one pool, I would expect the
initialization of the module to be mostly sequential.
So, no chance of races. Do you agree?

Furthermore, if a module _really_ wants to do parallel allocation, then
maybe it's simpler and cleaner to have one pool per "thread" or whatever
is the mechanism used.


The global pool is mostly there for completing the offering of
__ro_after_init. If one wants ot use it, it's available.
It saves the trouble of having own pool, it means competing with the
rest of the kernel for locking.

If it seems a bad idea, it can be easily removed.

[...]

>> +#include "smalloc.h"
> 
> Shouldn't this just be  ?

yes, I have fixed it in the next revision

[...]

>> +   /* If no pool specified, use the global one. */
>> +   if (!pool)
>> +   pool = global_pool;
> 
> It should be impossible, but this should check for global_pool == NULL too, 
> IMO.

I'm curious: why do you think it could happen?
I'd rather throw an exception, if I cannot initialize global_pool.

[...]

>> +   if (pool->seal == seal) {
>> +   mutex_unlock(>lock);
>> +   return;
> 
> I actually think this should be a BUG condition, since this means a
> mismatched seal/unseal happened. The pool should never be left
> writable at rest, a user should create a pool, write to it, seal. 

On this I agree.

> Any
> updates should unseal, write, seal. To a

[PATCH 1/4] LSM: Convert security_hook_heads into explicit array of struct list_head

2017-06-07 Thread Igor Stoppa
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

Commit 3dfc9b02864b19f4 ("LSM: Initialize security_hook_heads upon
registration.") treats "struct security_hook_heads" as an implicit array
of "struct list_head" so that we can eliminate code for static
initialization. Although we haven't encountered compilers which do not
treat sizeof(security_hook_heads) != sizeof(struct list_head) *
(sizeof(security_hook_heads) / sizeof(struct list_head)), Casey does not
like the assumption that a structure of N elements can be assumed to be
the same as an array of N elements.

Now that Kees found that randstruct complains about such casting

  security/security.c: In function 'security_init':
  security/security.c:59:20: note: found mismatched op0 struct pointer types: 
'struct list_head' and 'struct security_hook_heads'

struct list_head *list = (struct list_head *) _hook_heads;

and Christoph thinks that we should fix it rather than make randstruct
whitelist it, this patch fixes it.

It would be possible to revert commit 3dfc9b02864b19f4, but this patch
converts security_hook_heads into an explicit array of struct list_head
by introducing an enum, due to reasons explained below.

Igor proposed a sealable memory allocator, and the LSM hooks
("struct security_hook_heads security_hook_heads" and
"struct security_hook_list ...[]") will benefit from that allocator via
protection using set_memory_ro()/set_memory_rw(), and that allocator
will remove CONFIG_SECURITY_WRITABLE_HOOKS config option. Thus, we will
likely be moving to that direction.

This means that these structures will be allocated at run time using
that allocator, and therefore the address of these structures will be
determined at run time rather than compile time.

But currently, LSM_HOOK_INIT() macro depends on the address of
security_hook_heads being known at compile time. If we use an enum
so that LSM_HOOK_INIT() macro does not need to know absolute address of
security_hook_heads, it will help us to use that allocator for LSM hooks.

As a result of introducing an enum, security_hook_heads becomes a local
variable. In order to pass 80 columns check by scripts/checkpatch.pl ,
rename security_hook_heads to hook_heads.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Cc: Kees Cook <keesc...@chromium.org>
Cc: Paul Moore <p...@paul-moore.com>
Cc: Stephen Smalley <s...@tycho.nsa.gov>
Cc: Casey Schaufler <ca...@schaufler-ca.com>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Igor Stoppa <igor.sto...@huawei.com>
Cc: Christoph Hellwig <h...@infradead.org>
---
 include/linux/lsm_hooks.h | 412 +++---
 security/security.c   |  31 ++--
 2 files changed, 223 insertions(+), 220 deletions(-)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 080f34e..ac22be3 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1663,219 +1663,220 @@ union security_list_options {
 #endif /* CONFIG_AUDIT */
 };
 
-struct security_hook_heads {
-   struct list_head binder_set_context_mgr;
-   struct list_head binder_transaction;
-   struct list_head binder_transfer_binder;
-   struct list_head binder_transfer_file;
-   struct list_head ptrace_access_check;
-   struct list_head ptrace_traceme;
-   struct list_head capget;
-   struct list_head capset;
-   struct list_head capable;
-   struct list_head quotactl;
-   struct list_head quota_on;
-   struct list_head syslog;
-   struct list_head settime;
-   struct list_head vm_enough_memory;
-   struct list_head bprm_set_creds;
-   struct list_head bprm_check_security;
-   struct list_head bprm_secureexec;
-   struct list_head bprm_committing_creds;
-   struct list_head bprm_committed_creds;
-   struct list_head sb_alloc_security;
-   struct list_head sb_free_security;
-   struct list_head sb_copy_data;
-   struct list_head sb_remount;
-   struct list_head sb_kern_mount;
-   struct list_head sb_show_options;
-   struct list_head sb_statfs;
-   struct list_head sb_mount;
-   struct list_head sb_umount;
-   struct list_head sb_pivotroot;
-   struct list_head sb_set_mnt_opts;
-   struct list_head sb_clone_mnt_opts;
-   struct list_head sb_parse_opts_str;
-   struct list_head dentry_init_security;
-   struct list_head dentry_create_files_as;
+enum security_hook_index {
+   LSM_binder_set_context_mgr,
+   LSM_binder_transaction,
+   LSM_binder_transfer_binder,
+   LSM_binder_transfer_file,
+   LSM_ptrace_access_check,
+   LSM_ptrace_traceme,
+   LSM_capget,
+   LSM_capset,
+   LSM_capable,
+   LSM_quotactl,
+   LSM_quota_on,
+   LSM_syslog,
+   LSM_settime,
+   LSM_vm_enough_memory,
+   LSM_bprm_set_creds,
+   LSM_bprm_check_se

[PATCH 2/4] Protectable Memory Allocator

2017-06-07 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain only data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, however this is not fit too well the case of dynamically
allocated ones.

Dynamic allocation does not provide, currently, means for grouping
variables in memory pages that would contain exclusively data that can
be made read only.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the memory requested (over various iterations) is initialized,
the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h|  20 
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 mm/Makefile|   1 +
 mm/pmalloc.c   | 226 +
 mm/usercopy.c  |  24 +++--
 7 files changed, 267 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..acc0723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -81,6 +81,7 @@ enum pageflags {
PG_active,
PG_waiters, /* Page has waiters, check its waitqueue. Must 
be bit #7 and in the same byte as "PG_locked" */
PG_slab,
+   PG_pmalloc,
PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
@@ -274,6 +275,7 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, 
active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
+__PAGEFLAG(Pmalloc, pmalloc, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)/* Used by some filesystems 
*/
 
 /* Xen */
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..83d3557
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,20 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.sto...@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+
+struct pmalloc_pool *pmalloc_create_pool(const char *name);
+void *pmalloc(unsigned long size, struct pmalloc_pool *pool);
+int pmalloc_protect_pool(struct pmalloc_pool *pool);
+int pmalloc_destroy_pool(struct pmalloc_pool *pool);
+#endif
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94..41d1587 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -91,6 +91,7 @@
{1UL << PG_lru, "lru"   },  \
{1UL << PG_active,  "active"},  \
{1UL << PG_slab,"slab"  },  \
+   {1UL << PG_pmalloc, "pmalloc"   },  \
{1UL << PG_owner_priv_1,"owner_priv_1"  },  \
{1UL << PG_arch_1,  "arch_1"},  \
{1UL << PG_reserved,"reserved"  },  \
diff --git a/init/main.c b/init/main.c
index f866510..7850887 100644
--- a/init/main.c
+++ b/init/main.c
@@ -485,6 +485,7 @@ static void __init mm_init(void)
ioremap_huge_init();
 }
 
+extern int __init pmalloc_init(void);
 asmlinkage __visible void __init start_kernel(void)
 {
char *command_line;
@@ -653,6 +654,7 @@ asmlinkage __visible void __init start_kernel(void)
proc_caches_init();
buffer_init();
key_init();
+   pmalloc_init();
security_init();
dbg_late_init();
vfs_caches_init();
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a8..b47dcf8 100644
--- a/mm/Makefile
+

[PATCH 4/4] Make LSM Writable Hooks a command line option

2017-06-07 Thread Igor Stoppa
This patch shows how it is possible to take advantage of pmalloc:
instead of using the build-time option __lsm_ro_after_init, to decide if
it is possible to keep the hooks modifiable, now this becomes a
boot-time decision, based on the kernel command line.

This patch relies on:

"Convert security_hook_heads into explicit array of struct list_head"
Author: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

to break free from the static constraint imposed by the previous
hardening model, based on __ro_after_init.

The default value is disabled, unless SE Linux debugging is turned on.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 security/security.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index c492f68..9b8b478 100644
--- a/security/security.c
+++ b/security/security.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MAX_LSM_EVM_XATTR  2
@@ -33,8 +34,17 @@
 /* Maximum number of letters for an LSM name string */
 #define SECURITY_NAME_MAX  10
 
-static struct list_head hook_heads[LSM_MAX_HOOK_INDEX]
-   __lsm_ro_after_init;
+static int dynamic_lsm = IS_ENABLED(CONFIG_SECURITY_SELINUX_DISABLE);
+
+static __init int set_dynamic_lsm(char *str)
+{
+   get_option(, _lsm);
+   return 0;
+}
+early_param("dynamic_lsm", set_dynamic_lsm);
+
+static struct list_head *hook_heads;
+static struct pmalloc_pool *sec_pool;
 char *lsm_names;
 /* Boot-time LSM user choice */
 static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] =
@@ -59,6 +69,11 @@ int __init security_init(void)
 {
enum security_hook_index i;
 
+   sec_pool = pmalloc_create_pool("security");
+   BUG_ON(!sec_pool);
+   hook_heads = pmalloc(sizeof(struct list_head) * LSM_MAX_HOOK_INDEX,
+sec_pool);
+   BUG_ON(!hook_heads);
for (i = 0; i < LSM_MAX_HOOK_INDEX; i++)
INIT_LIST_HEAD(_heads[i]);
pr_info("Security Framework initialized\n");
@@ -74,7 +89,8 @@ int __init security_init(void)
 * Load all the remaining security modules.
 */
do_security_initcalls();
-
+   if (!dynamic_lsm)
+   pmalloc_protect_pool(sec_pool);
return 0;
 }
 
-- 
2.9.3



[PATCH v6 0/4] ro protection for dynamic data

2017-06-07 Thread Igor Stoppa
Hi,
please consider for inclusion.

This patchset introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a pool is made R/O, all the memory
that is part of it, will become R/O.

A R/O pool can be destroyed to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications, after initialization.

An example is provided, showing how to turn into a boot-time option the
writable state of the security hooks.
Prior to this patch, it was a compile-time option.

This is made possible, thanks to Tetsuo Handa's rework of the hooks
structure (included in the patchset).

Changes since the v5 version:
- use unsigned long for  __PMALLOC_ALIGNED alignment
- fixed a regression where size in bytes was used instead of size in words
- tightened the update of the pools list during removal, by doing
  earlier the decrement of the atomic counter

The only question still open is if there should be a possibility for
unprotecting a memory pool in other cases than destruction.

The only cases found for this topic are:
- protecting the LSM header structure between creation and insertion of a
  security module that was not built as part of the kernel
  (but the module can protect the headers after it has loaded)

- unloading SELinux from RedHat, if the system has booted, but no policy
  has been loaded yet - this feature is going away, according to Casey.


Igor Stoppa (3):
  Protectable Memory Allocator
  Protectable Memory Allocator - Debug interface
  Make LSM Writable Hooks a command line option

Tetsuo Handa (1):
  LSM: Convert security_hook_heads into explicit array of struct
list_head

 include/linux/lsm_hooks.h  | 412 -
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h|  20 ++
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 mm/Kconfig |  11 ++
 mm/Makefile|   1 +
 mm/pmalloc.c   | 339 +
 mm/usercopy.c  |  24 ++-
 security/security.c|  49 +++--
 10 files changed, 631 insertions(+), 230 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

-- 
2.9.3



[PATCH 3/4] Protectable Memory Allocator - Debug interface

2017-06-07 Thread Igor Stoppa
Debugfs interface: it creates the file

/sys/kernel/debug/pmalloc/pools

which exposes statistics about all the pools and memory nodes in use.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 mm/Kconfig   |  11 ++
 mm/pmalloc.c | 113 +++
 2 files changed, 124 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index beb7a45..dfbdc07 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -539,6 +539,17 @@ config CMA_AREAS
 
  If unsure, leave the default value "7".
 
+config PMALLOC_DEBUG
+bool "Protectable Memory Allocator debugging"
+depends on DEBUG_KERNEL
+default y
+help
+  Debugfs support for dumping information about memory pools.
+  It shows internal stats: free/used/total space, protection
+  status, data overhead, etc.
+
+  If unsure, say "y".
+
 config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
index 8050dea..09ce7f3 100644
--- a/mm/pmalloc.c
+++ b/mm/pmalloc.c
@@ -224,3 +224,116 @@ int __init pmalloc_init(void)
atomic_set(_data->pools_count, 0);
return 0;
 }
+
+#ifdef CONFIG_PMALLOC_DEBUG
+#include 
+static struct dentry *pmalloc_root;
+
+static void *__pmalloc_seq_start(struct seq_file *s, loff_t *pos)
+{
+   if (*pos)
+   return NULL;
+   return pos;
+}
+
+static void *__pmalloc_seq_next(struct seq_file *s, void *v, loff_t *pos)
+{
+   return NULL;
+}
+
+static void __pmalloc_seq_stop(struct seq_file *s, void *v)
+{
+}
+
+static __always_inline
+void __seq_printf_node(struct seq_file *s, struct pmalloc_node *node)
+{
+   unsigned long total_space, node_pages, end_of_node,
+ used_space, available_space;
+   int total_words, used_words, available_words;
+
+   used_words = atomic_read(>used_words);
+   total_words = node->total_words;
+   available_words = total_words - used_words;
+   used_space = used_words * WORD_SIZE;
+   total_space = total_words * WORD_SIZE;
+   available_space = total_space - used_space;
+   node_pages = (total_space + HEADER_SIZE) / PAGE_SIZE;
+   end_of_node = total_space + HEADER_SIZE + (unsigned long) node;
+   seq_printf(s, " - node:\t\t%pK\n", node);
+   seq_printf(s, "   - start of data ptr:\t%pK\n", node->data);
+   seq_printf(s, "   - end of node ptr:\t%pK\n", (void *)end_of_node);
+   seq_printf(s, "   - total words:\t%d\n", total_words);
+   seq_printf(s, "   - used words:\t%d\n", used_words);
+   seq_printf(s, "   - available words:\t%d\n", available_words);
+   seq_printf(s, "   - pages:\t\t%lu\n", node_pages);
+   seq_printf(s, "   - total space:\t%lu\n", total_space);
+   seq_printf(s, "   - used space:\t%lu\n", used_space);
+   seq_printf(s, "   - available space:\t%lu\n", available_space);
+}
+
+static __always_inline
+void __seq_printf_pool(struct seq_file *s, struct pmalloc_pool *pool)
+{
+   struct pmalloc_node *node;
+
+   seq_printf(s, "pool:\t\t\t%pK\n", pool);
+   seq_printf(s, " - name:\t\t%s\n", pool->name);
+   seq_printf(s, " - protected:\t\t%u\n", atomic_read(>protected));
+   seq_printf(s, " - nodes count:\t\t%u\n",
+  atomic_read(>nodes_count));
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(node, >nodes_list_head, nodes_list)
+   __seq_printf_node(s, node);
+   rcu_read_unlock();
+}
+
+static int __pmalloc_seq_show(struct seq_file *s, void *v)
+{
+   struct pmalloc_pool *pool;
+
+   seq_printf(s, "pools count:\t\t%u\n",
+  atomic_read(_data->pools_count));
+   seq_printf(s, "page size:\t\t%lu\n", PAGE_SIZE);
+   seq_printf(s, "word size:\t\t%lu\n", WORD_SIZE);
+   seq_printf(s, "node header size:\t%lu\n", HEADER_SIZE);
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(pool, _data->pools_list_head,
+pools_list)
+   __seq_printf_pool(s, pool);
+   rcu_read_unlock();
+   return 0;
+}
+
+static const struct seq_operations pmalloc_seq_ops = {
+   .start = __pmalloc_seq_start,
+   .next  = __pmalloc_seq_next,
+   .stop  = __pmalloc_seq_stop,
+   .show  = __pmalloc_seq_show,
+};
+
+static int __pmalloc_open(struct inode *inode, struct file *file)
+{
+   return seq_open(file, _seq_ops);
+}
+
+static const struct file_operations pmalloc_file_ops = {
+   .owner   = THIS_MODULE,
+   .open= __pmalloc_open,
+   .read= seq_read,
+   .llseek  = seq_lseek,
+   .release = seq_release
+};
+

Re: [PATCH v2] LSM: Convert security_hook_heads into explicit array of struct list_head

2017-05-31 Thread Igor Stoppa
On 30/05/17 13:32, James Morris wrote:

> This seems like pointless churn in security-critical code in anticipation 
> of features which are still in development and may not be adopted.
> 
> Is there a compelling reason to merge this now? (And I don't mean worrying 
> about non-existent compliers).

I propose to take this patch as part of those I will be submitting.
It took me some unplanned time to add support for hardened user copy,
but now it's done - at least to a point that I can test it without failures.

So I'm back on track to provide an example of the smalloc api and I can
also use Tetsuo's work (thanks again, btw).
This patch would be sandwiched between the smalloc ones and the LSM rework.

It can get merged when the rest (hopefully) is merged.

But I have a more prosaic question: since smalloc is affecting the
memory subsystem, can it still be merged through the security tree?

---
thanks, igor


Re: [kernel-hardening] [PATCH 3/5] Protectable Memory Allocator - Debug interface

2017-06-06 Thread Igor Stoppa


On 05/06/17 23:24, Jann Horn wrote:
> On Mon, Jun 5, 2017 at 9:22 PM, Igor Stoppa <igor.sto...@huawei.com> wrote:
>> Debugfs interface: it creates a file

[...]

> You should probably be using %pK to hide the kernel pointers.

ok, will do

---
igor


Re: [PATCH 4/5] Make LSM Writable Hooks a command line option

2017-06-06 Thread Igor Stoppa
On 05/06/17 23:50, Tetsuo Handa wrote:
> Casey Schaufler wrote:

[...]

>> I don't care for calling this "security debug". Making
>> the lists writable after init isn't about development,
>> it's about (Tetsuo's desire for) dynamic module loading.
>> I would prefer "dynamic_module_lists" our something else
>> more descriptive.
> 
> Maybe dynamic_lsm ?

ok, apologies for misunderstanding, I'll fix it.

I am not sure I understood what exactly the use case is:
-1) loading off-tree modules
-2) loading and unloading modules
-3) something else ?

I'm asking this because I now wonder if I should provide means for
protecting the heads later on (which still can make sense for case 1).

Or if it's expected that things will stay fluid and this dynamic loading
is matched by unloading, therefore the heads must stay writable (case 2)

[...]

>>> +   if (!sec_pool)
>>> +   goto error_pool;
>>
>> Excessive gotoing - return -ENOMEM instead.
> 
> But does it make sense to continue?
> hook_heads == NULL and we will oops as soon as
> call_void_hook() or call_int_hook() is called for the first time.

Shouldn't the caller check for result? -ENOMEM gives it a chance to do
so. I can replace the goto.

---
igor



Re: [PATCH 4/5] Make LSM Writable Hooks a command line option

2017-06-06 Thread Igor Stoppa
On 06/06/17 13:54, Tetsuo Handa wrote:

[...]

> "Loading modules which are not compiled as built-in" is correct.
> My use case is to allow users to use LSM modules as loadable kernel
> modules which distributors do not compile as built-in.

Ok, so I suppose someone should eventually lock down the header, after
the additional modules are loaded.

Who decides when enough is enough, meaning that all the needed modules
are loaded?
Should I provide an interface to user-space? A sysfs entry?

[...]

> Unloading LSM modules is dangerous. Only SELinux allows unloading
> at the risk of triggering an oops. If we insert delay while removing
> list elements, we can easily observe oops due to free function being
> called without corresponding allocation function.

Ok. But even in this case, the sys proposal would still work.
It would just stay unused.


--
igor


Re: [PATCH 4/5] Make LSM Writable Hooks a command line option

2017-06-06 Thread Igor Stoppa
On 06/06/17 17:36, Tetsuo Handa wrote:
> Igor Stoppa wrote:
>> For the case at hand, would it work if there was a non-API call that you
>> could use until the API is properly expanded?
> 
> Kernel command line switching (i.e. this patch) is fine for my use cases.
> 
> SELinux folks might want
> 
> -static int security_debug;
> +static int security_debug = IS_ENABLED(CONFIG_SECURITY_SELINUX_DISABLE);

ok, thanks, I will add this

> so that those who are using SELINUX=disabled in /etc/selinux/config won't
> get oops upon boot by default. If "unlock the pool" were available,
> SELINUX=enforcing users would be happy. Maybe two modes for rw/ro transition 
> helps.
> 
>   oneway rw -> ro transition mode: can't be made rw again by calling "unlock 
> the pool" API
>   twoway rw <-> ro transition mode: can be made rw again by calling "unlock 
> the pool" API

This was in the first cut of the API, but I was told that it would
require further rework, to make it ok for upstream, so we agreed to do
first the lockdown/destroy only part and the the rewrite.

Is there really a valid use case for unloading SE Linux?
Or any other security module.

--
igor


Re: [PATCH 2/4] Protectable Memory Allocator

2017-06-19 Thread Igor Stoppa
On 09/06/17 21:56, Laura Abbott wrote:
> On 06/07/2017 05:35 AM, Igor Stoppa wrote:

[...]

> The pool logic looks remarkably similar to genalloc (lib/genalloc.c).
> It's not a perfect 1-to-1 mapping but it's close enough to be worth
> a look.

Indeed. I have prepared a new incarnation of pmalloc, based on genalloc.
There are a couple of things that I would like to adjust in genalloc,
but I'll discuss this in the new submission.

>> +
>> +const char msg[] = "Not a valid Pmalloc object.";
>> +const char *__pmalloc_check_object(const void *ptr, unsigned long n)
>> +{
>> +unsigned long p;
>> +
>> +p = (unsigned long)ptr;
>> +n = p + n - 1;
>> +for (; (PAGE_MASK & p) <= (PAGE_MASK & n); p += PAGE_SIZE) {
>> +if (is_vmalloc_addr((void *)p)) {
>> +struct page *page;
>> +
>> +page = vmalloc_to_page((void *)p);
>> +if (!(page && PagePmalloc(page)))
>> +return msg;
>> +}
> 
> Should this be an error if is_vmalloc_addr returns false?

Yes, if this function is called, at least the beginning of the range
*is* a vmalloc address and therefore the rest should be a vmalloc
address as well.

thanks, igor


Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-04 Thread Igor Stoppa
Hi,
I suspect this was accidentally a Reply-To instead of a Reply-All,
so I'm putting back the CCs that were dropped.

On 03/05/17 21:41, Dave Hansen wrote:
> On 05/03/2017 05:06 AM, Igor Stoppa wrote:
>> My starting point are the policy DB of SE Linux and the LSM Hooks, but
>> eventually I would like to extend the protection also to other
>> subsystems, in a way that can be merged into mainline.
> 
> Have you given any thought to just having a set of specialized slabs?

No, the idea of the RFC was to get this sort of comments about options I
might have missed :-)

> Today, for instance, we have a separate set of kmalloc() slabs for DMA:
> dma-kmalloc-{4096,2048,...}.  It should be quite possible to have
> another set for your post-init-read-only protected data.

I will definitely investigate it and report back, thanks.
But In the meanwhile I'd appreciate further clarifications.
Please see below ...

> This doesn't take care of vmalloc(), but I have the feeling that
> implementing this for vmalloc() isn't going to be horribly difficult.

ok

>> * The mechanism used for locking down the memory region is to program
>> the MMU to trap writes to said region. It is fairly efficient and
>> HW-backed, so it doesn't introduce any major overhead,
> 
> I'd take a bit of an issue with this statement.  It *will* fracture
> large pages unless you manage to pack all of these allocations entirely
> within a large page.  This is problematic because we use the largest
> size available, and that's 1GB on x86.

I am not sure I fully understand this part.
I am probably missing some point about the way kmalloc works.

I get the problem you describe, but I do not understand why it should
happen.
Going back for a moment to my original idea of the zone, as a physical
address range, why wouldn't it be possible to define it as one large page?

Btw, I do not expect to have much memory occupation, in terms of sheer
size, although there might be many small "variables" scattered across
the code. That's where I hope using kmalloc, instead of a custom made
allocator can make a difference, in terms of optimal occupation.

> IOW, if you scatter these things throughout the address space, you may
> end up fracturing/demoting enough large pages to cause major overhead
> refilling the TLB.

But why would I?
Or, better, what would cause it, unless I take special care?

Or, let me put it differently: my goal is to not fracture more pages
than needed.
It will probably require some profiling to figure out what is the
ballpark of the memory footprint.

I might have overlooked some aspect of this, but the overall goal
is to have a memory range (I won't call it zone, to avoid referring to a
specific implementation) which is as tightly packed as possible, stuffed
with all the data that is expected to become read-only.

> Note that this only applies for kmalloc() allocations, *not* vmalloc()
> since kmalloc() uses the kernel linear map and vmalloc() uses it own,
> separate mappings.

Yes.

---
thanks, igor


Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-05 Thread Igor Stoppa


On 04/05/17 17:01, Michal Hocko wrote:
> On Thu 04-05-17 16:37:55, Igor Stoppa wrote:

[...]

>> The disadvantage is that anything can happen, undetected, while the seal
>> is lifted.
> 
> Yes and I think this makes it basically pointless

ok, this goes a bit beyond what I had in mind initially, but I see your
point

[...]

> Just to make my proposal more clear. I suggest the following workflow
> 
> cache = kmem_cache_create(foo, object_size, ..., SLAB_SEAL);
>
> obj = kmem_cache_alloc(cache, gfp_mask);
> init_obj(obj)
> [more allocations]
> kmem_cache_seal(cache);

In case one doesn't want the feature, at which point would it be disabled?

* not creating the slab
* not sealing it
* something else?

> All slab pages belonging to the cache would get write protection. All
> new allocations from this cache would go to new slab pages. Later
> kmem_cache_seal will write protect only those new pages.

ok

> The main discomfort with this approach is that you have to create those
> caches in advance, obviously. We could help by creating some general
> purpose caches for common sizes but this sound like an overkill to me.
> The caller will know which objects will need the protection so the
> appropriate cache can be created on demand. But this reall depends on
> potential users...

Yes, I provided a more detailed answer in another branch of this thread.
Right now I can answer only for what I have already looked into: SE
Linux policy DB and LSM Hooks, and they do not seem very large.

I do not expect a large footprint, overall, although there might be some
exception.


--
igor



Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-05 Thread Igor Stoppa
On 04/05/17 19:49, Laura Abbott wrote:
> [adding kernel-hardening since I think there would be interest]

thank you, I overlooked this


> BPF takes the approach of calling set_memory_ro to mark regions as
> read only. I'm certainly over simplifying but it sounds like this
> is mostly a mechanism to have this happen mostly automatically.
> Can you provide any more details about tradeoffs of the two approaches?

I am not sure I understand the question ...
For what I can understand, the bpf is marking as read only something
that spans across various pages, which is fine.
The payload to be protected is already organized in such pages.

But in the case I have in mind, I have various, heterogeneous chunks of
data, coming from various subsystems, not necessarily page aligned.
And, even if they were page aligned, most likely they would be far
smaller than a page, even a 4k page.

The first problem I see, is how to compact them into pages, ensuring
that no rwdata manages to infiltrate the range.

The actual mechanism for marking pages as read only is not relevant at
this point, if I understand your question correctly, since set_memory_ro
is walking the pages it receives as parameter.

> arm and arm64 have the added complexity of using larger
> page sizes on the linear map so dynamic mapping/unmapping generally
> doesn't work. 

Do you mean that a page could be 16MB and therefore it would not be
possible to get a smaller chunk?

> arm64 supports DEBUG_PAGEALLOC by mapping with only
> pages but this is generally only wanted as a debug mechanism.
> I don't know if you've given this any thought at all.

Since the beginning I have thought about this feature as an opt-in
feature. I am aware that it can have drawbacks, but I think it would be
valuable as debugging tool even where it's not feasible to keep it
always-on.

OTOH on certain systems it can be sufficiently appealing to be kept on,
even if it eats up some more memory.

If this doesn't answer your question, could you please detail it more?

---
thanks, igor


Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-05 Thread Igor Stoppa
On 04/05/17 20:24, Dave Hansen wrote:
> On 05/04/2017 07:01 AM, Michal Hocko wrote:
>> Just to make my proposal more clear. I suggest the following workflow
>>
>> cache = kmem_cache_create(foo, object_size, ..., SLAB_SEAL);
>>
>> obj = kmem_cache_alloc(cache, gfp_mask);
>> init_obj(obj)
>> [more allocations]
>> kmem_cache_seal(cache);
>>
>> All slab pages belonging to the cache would get write protection. All
>> new allocations from this cache would go to new slab pages. Later
>> kmem_cache_seal will write protect only those new pages.
> 
> Igor, what sizes of objects are you after here, mostly?

Theoretically, anything, since I have not really looked in details into
all the various subsystems, however, taking a more pragmatical approach
and referring to SE Linux and LSM Hooks, which were my initial target,

For SE Linux, I'm taking as example the policy db [1]:
The sizes are mostly small-ish: from 4-6 bytes to 16-32, overall.
There are some exceptions: the main policydb structure is way larger,
but it's not supposed to be instantiated repeatedly.


For LSM Hooks, the sublists in that hydra which goes under the name of
struct security_hook_heads, which are of type struct security_hook_list,
so a handful of bytes for the generic element [2].


> I ask because slub, at least, doesn't work at all for objects
>> PAGE_SIZE.  It just punts those to the page allocator.  But, you
> _could_ still use vmalloc() for those.


I would be surprised to find many objects that are larger than PAGE_SIZE
and qqualify for post-init-read-only protection,  even if the page size
was only 4kB.

>From that perspective, I'm more concerned about avoiding taking a lot of
pages and leaving them mostly unused.

[1] security/selinux/ss/policydb.h
[2] include/linux/lsm_hooks.h


Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-04 Thread Igor Stoppa
On 04/05/17 14:21, Michal Hocko wrote:
> On Wed 03-05-17 15:06:36, Igor Stoppa wrote:

[...]

>> * In most, if not all, the cases that could be enhanced, the code will
>> be calling kmalloc/vmalloc, indicating GFP_KERNEL as the desired type of
>> memory.
> 
> How do you tell that the seal is active?

The simpler way would be to define the seal as something that is applied
only after late init has concluded.

IOW, if the kernel has already started user-space, the seal is in place.

I do acknowledge that this conflicts with the current implementation of
SE Linux, but it might be possible to extend SE Linux to have a
predefined configuration file that is loaded during kernel init.

In general this is not acceptable, but OTOH IMA does it, so there could
be ground also for advocating similar (optional) behavior for SE Linux.

Should that not be possible,then yes, I should provide some way (ioctl,
sysfs/something else) that can be used to apply the seal.

In such case there should be also some helper function which allows to
confirm that the seal is absent/present.

> I have also asked about the
> life time of these objects in the previous email thread. Do you expect
> those objects get freed one by one or mostly at once? Is this supposed
> to be boot time only or such allocations might happen anytime?

Yes, you did. I didn't mean to ignore the question.
I thought this question would be answered by the current RFC :-(

Alright, here's one more attempt at explaining what I have in mind.
And I might be wrong, so that would explain why it's not clear.

Once the seal is in place, the objects are effectively read-only, so the
lifetime is basically the same as the kernel text.
Since I am after providing the same functionality of
post-init-read-only, but for dynamically allocated data, I would stick
to the same behavior: once the data is read-only, it stays so forever,
or till reset/poweroff, whichever comes first.

I wonder if you are thinking about loadable modules or maybe livepatch.
My proposal, in its current form, is only about what is done when the
kernel initialization is performed. So it would not take those cases
under its umbrella. Actually it might be incompatible with livepatch, if
any of the read-only data is supposed to be updated.

Since it's meant to improve the current level of integrity, I would
prefer to have a progressive approach and address modules/livepatch in a
later phase, if this is not seen as a show stopper.

[...]

> The most immediate suggestion would be to extend SLAB caches with a new
> sealing feature.

Yes, I got few hours ago the same advice also from Dave Hansen (thanks,
btw) [1].

I had just not considered the option.

> Roughly it would mean that once kmem_cache_seal() is
> called on a cache it would changed page tables to used slab pages to RO
> state. This would obviously need some fiddling to make those pages not
> usable for new allocations from sealed pages. It would also mean some
> changes to kfree path but I guess this is doable.

Ok, as it probably has already become evident, I have just started
peeking into the memory subsystem, so this is the sort of guidance I was
hoping I could receive =) - thank you

Question: I see that some pages can be moved around. Would this apply to
the slab-based solution, or can I assume that once I have certain
physical pages sealed, they will not be altered anymore?

>> * While I do not strictly need a new memory zone, memory zones are what
>> kmalloc understands at the moment: AFAIK, it is not possible to tell
>> kmalloc from which memory pool it should fish out the memory, other than
>> having a reference to a memory zone.
> 
> As I've said already. I think that a zone is a completely wrong
> approach. How would it help anyway. It is the allocator on top of the
> page allocator which has to do clever things to support sealing.


Ok, as long as there is a way forward that fits my needs and has the
possibility to be merged upstream, I'm fine with it.

I suppose zones are the first thing one meets when reading the code, so
they are probably the first target that comes to mind.
That's what happened to me.

I will probably come back with further questions, but I can then start
putting together some prototype of what you described.

I am fine with providing a generic solution, but I must make sure that
it works with slub. I suppose what you proposed will do it, right?

TBH, from what little I have been reading so far, I find a bit confusing
the fact that there are some header files referring separately to slab,
slub and slob, but then common code still refers to slab (slab.h slab.c
and slab_common.c, for example)


[1] https://marc.info/?l=linux-kernel=149388596106305=2


---
thanks, igor


Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-04 Thread Igor Stoppa
On 04/05/17 16:11, Michal Hocko wrote:
> On Thu 04-05-17 15:14:10, Igor Stoppa wrote:

> I believe that this is a fundamental question. Sealing sounds useful
> for after-boot usecases as well and it would change the approach
> considerably. Coming up with an ad-hoc solution for the boot only way
> seems like a wrong way to me. And as you've said SELinux which is your
> target already does the thing after the early boot.

I didn't spend too many thoughts on this so far, because the zone-based
approach seemed almost doomed, so I wanted to wait for the evolution of
the discussion :-)

The main question here is granularity, I think.

At least, as first cut, the simpler approach would be to have a master
toggle: when some legitimate operation needs to happen, the seal is
lifted across the entire range, then it is put back in place, once the
operation has concluded.

Simplicity is the main advantage.

The disadvantage is that anything can happen, undetected, while the seal
is lifted.
OTOH the amount of code that could backfire should be fairly limited, so
it doesn't seem a huge issue to me.

The alternative would be to somehow know what a write will change and
make only the appropriate page(s) writable. But it seems overkill to me.
Especially because in some cases, with huge pages, everything would fit
anyway in one page.

One more option that comes to mind - but I do not know how realistic it
would be - is to have multiple slabs, to be used for different purposes.
Ex: one for the monolithic kernel and one for modules.
It wouldn't help for livepatch, though, as it can modify both, so both
would have to be unprotected.

But live-patching is potentially a far less frequent event than module
loading/unloading (thinking about USB gadgets, for example).

[...]

> Slab pages are not migrateable currently. Even if they start being
> migrateable it would be an opt-in because that requires pointers tracking
> to make sure they are updated properly.

ok

[...]

> I haven't researched that too deeply. In principle both SLAB and SLUB
> maintain slab pages in a similar way so I do not see any fundamental
> problems.


good, then I could proceed with the prototype, if there are no further
objections/questions and we agree that, implementation aside, there are
no obvious fundamental problems preventing the merge


---
thanks, igor



Re: RFC v2: post-init-read-only protection for data allocated dynamically

2017-05-05 Thread Igor Stoppa
On 04/05/17 17:30, Dave Hansen wrote:
> On 05/04/2017 01:17 AM, Igor Stoppa wrote:
>> Or, let me put it differently: my goal is to not fracture more pages
>> than needed.
>> It will probably require some profiling to figure out what is the
>> ballpark of the memory footprint.
> 
> This is easy to say, but hard to do.  What if someone loads a different
> set of LSMs, or uses a very different configuration?  How could this
> possibly work generally without vastly over-reserving in most cases?

I am probably making some implicit assumptions.
Let me try to make them explicit and let's see if they survive public
scrutiny, and btw while writing it, I see that it probably won't :-S

Observations


* The memory that might need sealing is less or equal to the total
  r/w memory - whatever that might be.

* In practice only a subset of the r/w memory will qualify for sealing.

* The over-reserving might be abysmal, in terms of percentage of
  actually used memory, but it might not affect too much the system, in
  absolute terms.

* On my machine (Ubuntu 16.10 64bit):

  ~$ dmesg |grep Memory
  [0.00] Memory: 32662956K/33474640K available (8848K kernel
  code, 1441K rwdata, 3828K rodata, 1552K init, 1296K bss, 811684K
  reserved, 0K cma-reserved)

* This is the memory at boot, I am not sure what would be the right way
to get the same info at runtime.


Speculations


* after loading enough modules, the rwdata is 2-3 times larger

* the amount of rwdata that can be converted to rodata is 50%;
  this is purely a working assumption I am making, as I have no
  measurement yet and needs to be revised.

* on a system like mine, it would mean 2-3 MB


Conclusions
---

* 2-3 MB with possibly 50% of utilization might be an acceptable
compromise for a distro - as user I probably wouldn't mind too much.

* if someone is not happy with the distro defaults, every major distro
provides means to reconfigure and rebuild its kernel (the expectation is
that the only distro users who are not happy are those who would
probably reconfigure the kernel anyway, like a data center)

* non distro-users, like mobile, embedded, IoT, etc would do
optimizations and tweaking also without this feature mandating it.

--

In my defense, I can only say that my idea for this feature was to make
it as opt-in, where if one chooses to enable it, it is known upfront
what it will entail.
Now we are talking about distros, with the feature enabled by default.

TBH I am not sure there even is a truly generic solution, because we are
talking about dynamically allocated data, where the amount is not known
upfront (if it was, probably the data would be static).


I have the impression that it's a situation like:
- efficient memory occupation
- no need for profiling
- non fragmented pages

Choose 2 of them.


Of course, there might be a better way, but I haven't found it yet,
other than the usual way out: make it a command line option and let the
unhappy user modify the command line that the bootloader passes to the
kernel.

[...]

> I'm starting with the assumption that a new zone isn't feasible. :)

I really have no bias: I have a problem and I am trying to solve it.
I think the improvement could be useful also for others.

If the problem can be solved in a better way than what I proposed, it is
still good for me.

---
igor


Re: [PATCH v7 0/7] Introduce ZONE_CMA

2017-05-02 Thread Igor Stoppa
On 02/05/17 16:03, Michal Hocko wrote:

> I can imagine that we could make ZONE_CMA configurable in a way that
> only very well defined use cases would be supported so that we can save
> page flags space. But this alone sounds like a maintainability nightmare
> to me. Especially when I consider ZONE_DMA situation. There is simply
> not an easy way to find out whether my HW really needs DMA zone or
> not. Most probably not but it still is configured and hidden behind
> config ZONE_DMA
> bool "DMA memory allocation support" if EXPERT
> default y
> help
>   DMA memory allocation support allows devices with less than 32-bit
>   addressing to allocate within the first 16MB of address space.
>   Disable if no such devices will be used.
> 
>   If unsure, say Y.
> 
> Are we really ready to add another thing like that? How are distribution
> kernels going to handle that?

In practice there are 2 quite opposite scenarios:

- distros that try to cater to (almost) everyone and are constrained in
what they can leave out

- ad-hoc builds (like Android, but also IoT) where the HW is *very* well
known upfront, because it's probably even impossible to make any change
that doesn't involved a rework station.

So maybe the answer is to not have only EXPERT, but rather DISTRO/CUSTOM
with the implications these can bring.

A generic build would assume to be a DISTRO type, but something else, of
more embedded persuasion, could do otherwise.

ZONE_DMA / ZONE_DMA32 actually seem to be perfect candidates for being
replaced by something else, when unused, as I proposed on Friday:

http://marc.info/?l=linux-mm=149337033630993=2


It might still be that only some cases would be upstreamable, even after
these changes.

But at least some of those might be useful also for non-Android/ non-IoT
scenarios.


---
igor


Re: [PATCH 1/1] Sealable memory support

2017-05-31 Thread Igor Stoppa
On 28/05/17 21:23, Kees Cook wrote:
> On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa <igor.sto...@huawei.com> wrote:

[...]

>> If the CPU1 were to forcibly halt anything that can race with it, then
>> it would be sure that there was no interference.
> 
> Correct. This is actually what ARM does for doing kernel memory
> writing when poking stuff for kprobes, etc. It's rather dramatic,
> though. :)

ok

>> A reactive approach could be, instead, to re-validate the content after
>> the sealing, assuming that it is possible.
> 
> I would prefer to avoid this, as that allows an attacker to still have
> made the changes (which could even result in them then disabling the
> re-validation during the attack).

ok


[...]

>> If you are talking about an attacker, rather than protection against
>> accidental overwrites, how hashing can be enough?
>> Couldn't the attacker compromise that too?
> 
> In theory, yes, though the goal was to dedicate a register to the
> hash, which would make it hard/impossible for an attacker to reach.
> (The BPF JIT situation is just an example of this kind of race,
> though. I'm still in favor of reducing the write window to init-time
> from full run-time.)

It looks like some compromise is needed between reliability and
performance ...

[...]

> I would expect other people would NAK using "stop all other CPUs"
> approach. Better to have the CPU-local writes.

ok, that could be a feature for the follow-up?

[...]

> Yeah, I don't have any better idea for names. :)

still clinging to smalloc for now, how about pmalloc,
as "protected malloc"?


> We might be talking past each-other. Right now, the LSM is marked with
> __ro_after_init, which will make all the list structures, entries, etc
> read-only already. There is one case where this is not true, and
> that's for CONFIG_SECURITY_WRITABLE_HOOKS for
> CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of
> SELinux. 

On this I'll just shut up and provide the patch. It seems easier :-)

> Are you talking about the SELinux policy installed during
> early boot? Which things did you want to use smalloc() on?

The policy DB.
And of course the policy cache has to go, as in: it must be possible to
disable it completely, in favor of re-evaluating the policies, when needed.

I'm pretty sure that it is absolutely needed in certain cases, but in
others I think the re-evaluation is negligible in comparison to the
overall duration of the use cases affected.

So it should be ok to let the system integrator gauge speed (if any) vs
resilience, by enablying/disabling the SE Linux policy cache.


>>> It seems like smalloc pools could also be refcounted?
> 
> I meant things that point into an smalloc() pool could keep a refcount
> and when nothing was pointing into it, it could be destroyed. (i.e.
> using refcount garbage collection instead of explicit destruction.)

ok, but it depends on the level of paranoia that we are talking about.

Counting smalloc vs sfree is easy.

Verifying that the free invocations are actually aligned with previous
allocations is a bit more onerous.

Validating that the free is done from the same entity that invoked the
smalloc might be hard.

>> And what for?
> 
> It might be easier to reason about later if allocations get complex.
> It's certainly not required for the first version of this.

good :-)
more clarifications are needed

[...]

> I don't really have an opinion on this. It might be more readable with
> a named structure?

Eventually I got rid of the macro.
The structure is intentionally unnamed to stress the grouping, without
adding a layer of indirection. But maybe it could be removed. I need to
test it.

>> One more thing: how should I tie this allocator to the rest?
>> I have verified that is seems to work with both SLUB and SLAB.
>> Can I make it depend on either of them being enabled?
> 
> It seems totally unrelated. The only relationship I see would be
> interaction with hardened usercopy. In a perfect world, none of the
> smalloc pools would be allowed to be copied to/from userspace, which
> would make integration really easy: if smalloc_pool(ptr) return NOPE;
> :P

I went down the harder path and implemented the check.

>> Should it be optionally enabled?
>> What to default to, if it's not enabled? vmalloc?
> 
> I don't see any reason to make it optional.

great \o/

So now I'm back to the LSM use case.
It shouldn't take too long.

--
thanks, igor


Re: [PATCH 2/5] Protectable Memory Allocator

2017-06-06 Thread Igor Stoppa
On 06/06/17 09:25, Christoph Hellwig wrote:
> On Tue, Jun 06, 2017 at 01:44:32PM +0900, Tetsuo Handa wrote:

[..]

>> As far as I know, not all CONFIG_MMU=y architectures provide
>> set_memory_ro()/set_memory_rw(). You need to provide fallback for
>> architectures which do not provide set_memory_ro()/set_memory_rw()
>> or kernels built with CONFIG_MMU=n.
> 
> I think we'll just need to generalize CONFIG_STRICT_MODULE_RWX and/or
> ARCH_HAS_STRICT_MODULE_RWX so there is a symbol to key this off.

Would STRICT_KERNEL_RWX work? It's already present.
If both kernel text and rodata can be protected, so can pmalloc data.

---
igor



Re: [PATCH 2/5] Protectable Memory Allocator

2017-06-06 Thread Igor Stoppa
Hi,
thanks a lot for the review. My answers are in-line below.
I have rearranged your comments because I wasn't sure how to reply to
them inlined.

On 06/06/17 07:44, Tetsuo Handa wrote:
> Igor Stoppa wrote:

[...]

> As far as I know, not all CONFIG_MMU=y architectures provide
> set_memory_ro()/set_memory_rw().

I'll follow up on this in the existing separate thread.

[...]

>> +struct pmalloc_node {
>> +struct hlist_node nodes_list;
>> +atomic_t used_words;
>> +unsigned int total_words;
>> +__PMALLOC_ALIGNED align_t data[];
>> +};
>
> Is this __PMALLOC_ALIGNED needed? Why not use "long" and "BITS_PER_LONG" ?

In an earlier version I actually asked the same question.
It is currently there because I just don't know enough about various
architectures. The idea of having "align_t" was that it could be tied
into what is the most desirable alignment for each architecture.
But I'm actually looking for advise on this.


>> +size = ((HEADER_SIZE - 1 + PAGE_SIZE) +
>> +WORD_SIZE * (unsigned long) words) & PAGE_MASK;
> 
>> +req_words = (((int)size) + WORD_SIZE - 1) / WORD_SIZE;
>
> Please use macros for round up/down.

ok

[...]


> + rcu_read_lock();
> + hlist_for_each_entry_rcu(node, >nodes_list_head, nodes_list) {
> + starting_word = atomic_fetch_add(req_words, >used_words);
> + if (starting_word + req_words > node->total_words)
> + atomic_sub(req_words, >used_words);
> + else
> + goto found_node;
> + }
> + rcu_read_unlock();
> 
> You need to check for node != NULL before dereference it.

This was intentionally left out, on the ground that I'm using the kernel
macros for both populating and walking the list.
So, if I understood correctly, there shouldn't be a case where node is
NULL, right?
Unless it has been tampered/damaged. Is that what you mean?


> Also, why rcu_read_lock()/rcu_read_unlock() ? 
> I can't find corresponding synchronize_rcu() etc. in this patch.

oops. Thanks for spotting it.


> pmalloc() won't be hotpath. Enclosing whole using a mutex might be OK.
> If any reason to use rcu, rcu_read_unlock() is missing if came from "goto".

If there are no strong objections, I'd rather fix it and keep it as RCU.
Kees Cook was mentioning the possibility of implementing also
"write seldom" in a similar fashion.
In that case the path is likely to warm up.
It might be premature optimization, but I'd prefer to avoid knowingly
introduce performance issues.
Said this, I agree on the bug you spotted.


>> +const char *__pmalloc_check_object(const void *ptr, unsigned long n)
>> +{
>> +unsigned long p;
>> +
>> +p = (unsigned long)ptr;
>> +n += (unsigned long)ptr;
>> +for (; (PAGE_MASK & p) <= (PAGE_MASK & n); p += PAGE_SIZE) {
>> +if (is_vmalloc_addr((void *)p)) {
>> +struct page *page;
>> +
>> +page = vmalloc_to_page((void *)p);
>> +if (!(page && PagePmalloc(page)))
>> +return msg;
>> +}
>> +}
>> +return NULL;
>> +}
> 
> I feel that n is off-by-one if (ptr + n) % PAGE_SIZE == 0
> according to check_page_span().

Hmm,
let's say PAGE_SIZE is 0x0100 and PAGE MASK 0xFF00

Here are some cases (N = number of pages found):

 ptr   ptr + n   PagesTest  N

0x0005 0x00FF  1 0x <= 0x00FF  true 1
0x0105 0x00FF0x0100 <= 0x00FF  false1

0x0005 0x0100  2 0x <= 0x0100  true 1
0x0100 0x01000x0100 <= 0x0100  true 2
0x0200 0x01000x0200 <= 0x0100  false2

0x0005 0x01FF  2 0x <= 0x0100  true 1
0x0105 0x01FF0x0100 <= 0x0100  true 2
0x0205 0x01FF0x0200 <= 0x0100  false2

It seems to work. If I am missing your point, could you please
use the same format of the example I made, to explain me?

I might be able to understand better.

>> +int __init pmalloc_init(void)
>> +{
>> +pmalloc_data = vmalloc(sizeof(struct pmalloc_data));
>> +if (!pmalloc_data)
>> +return -ENOMEM;
>> +INIT_HLIST_HEAD(_data->pools_list_head);
>> +mutex_init(_data->pools_list_mutex);
>> +atomic_set(_data->pools_count, 0);
>> +return 0;
>> +}
>> +EXPORT_SYMBOL(pmalloc_init);
>
> Why need to call pmalloc_init() from loadable kernel module?
> It has to be called very early stage of boot for only once.

Yes, this is a bug.
Actually I forgot to put in this patch the real 

Re: [PATCH 4/5] Make LSM Writable Hooks a command line option

2017-06-06 Thread Igor Stoppa


On 06/06/17 14:42, Tetsuo Handa wrote:
> Igor Stoppa wrote:
>> Who decides when enough is enough, meaning that all the needed modules
>> are loaded?
>> Should I provide an interface to user-space? A sysfs entry?
> 
> No such interface is needed. Just an API for applying set_memory_rw()
> and set_memory_ro() on LSM hooks is enough.
> 
> security_add_hooks() can call set_memory_rw() before adding hooks and
> call set_memory_ro() after adding hooks. Ditto for security_delete_hooks()
> for SELinux's unregistration.


I think this should be considered part of the 2nd phase "write seldom",
as we agreed with Kees Cook.

Right now the goal was to provide the basic API for:
- create pool
- get memory from pool
- lock the pool
- destroy the pool

And, behind the scene, verify that a memory range falls into Pmalloc pages.


Then would come the "write seldom" part.

The reason for this is that a proper implementation of write seldom
should, imho, make writable only those pages that really need to be
modified. Possibly also add some verification on the call stack about
who is requesting the unlocking.

Therefore I would feel more comfortable in splitting the work into 2 part.

For the case at hand, would it work if there was a non-API call that you
could use until the API is properly expanded?

--
igor


Re: [PATCH 2/5] Protectable Memory Allocator

2017-06-06 Thread Igor Stoppa

On 06/06/17 15:08, Tetsuo Handa wrote:
> Igor Stoppa wrote:
>>>> +struct pmalloc_node {
>>>> +  struct hlist_node nodes_list;
>>>> +  atomic_t used_words;
>>>> +  unsigned int total_words;
>>>> +  __PMALLOC_ALIGNED align_t data[];
>>>> +};
>>>
>>> Is this __PMALLOC_ALIGNED needed? Why not use "long" and "BITS_PER_LONG" ?
>>
>> In an earlier version I actually asked the same question.
>> It is currently there because I just don't know enough about various
>> architectures. The idea of having "align_t" was that it could be tied
>> into what is the most desirable alignment for each architecture.
>> But I'm actually looking for advise on this.
> 
> I think that let the compiler use natural alignment is OK.

On a 64 bit machine the preferred alignment might be either 32 or 64,
depending on the application. How can the compiler choose?


>>> You need to check for node != NULL before dereference it.
>>
>> So, if I understood correctly, there shouldn't be a case where node is
>> NULL, right?
>> Unless it has been tampered/damaged. Is that what you mean?
> 
> I meant to say
> 
> + node = __pmalloc_create_node(req_words);
> // this location.
> + starting_word = atomic_fetch_add(req_words, >used_words);

argh, yes


>>>> +const char *__pmalloc_check_object(const void *ptr, unsigned long n)
>>>> +{
>>>> +  unsigned long p;
>>>> +
>>>> +  p = (unsigned long)ptr;
>>>> +  n += (unsigned long)ptr;
>>>> +  for (; (PAGE_MASK & p) <= (PAGE_MASK & n); p += PAGE_SIZE) {
>>>> +  if (is_vmalloc_addr((void *)p)) {
>>>> +  struct page *page;
>>>> +
>>>> +  page = vmalloc_to_page((void *)p);
>>>> +  if (!(page && PagePmalloc(page)))
>>>> +  return msg;
>>>> +  }
>>>> +  }
>>>> +  return NULL;
>>>> +}
>>>
>>> I feel that n is off-by-one if (ptr + n) % PAGE_SIZE == 0
>>> according to check_page_span().
>>
>> It seems to work. If I am missing your point, could you please
>> use the same format of the example I made, to explain me?
> 
> If ptr == NULL and n == PAGE_SIZE so that (ptr + n) % PAGE_SIZE == 0,
> this loop will access two pages (one page containing p == 0 and another
> page containing p == PAGE_SIZE) when this loop should access only one
> page containing p == 0. When checking n bytes, it's range is 0 to n - 1.

oh, so:

p = (unsigned long) ptr;
n = p + n - 1;


--
igor


[PATCH 2/5] Protectable Memory Allocator

2017-06-05 Thread Igor Stoppa
The MMU available in many systems runnign Linux can often provide R/O
protection to the memory pages it handles.

However, this works efficiently only when said pages contain only data
that does not need to be modified.

This can work well for statically allocated variables, however it doe
not fit too well the case of dynamically allocated ones.

Dynamic allocation does not provide, currently, means for grouping
variables in memory pages that would contain exclusively data that can
be made read only.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the memory requested (over various iterations) is initialized,
the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any no further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h|  20 
 include/trace/events/mmflags.h |   1 +
 mm/Makefile|   2 +-
 mm/pmalloc.c   | 227 +
 mm/usercopy.c  |  24 +++--
 6 files changed, 266 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..acc0723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -81,6 +81,7 @@ enum pageflags {
PG_active,
PG_waiters, /* Page has waiters, check its waitqueue. Must 
be bit #7 and in the same byte as "PG_locked" */
PG_slab,
+   PG_pmalloc,
PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
@@ -274,6 +275,7 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, 
active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
+__PAGEFLAG(Pmalloc, pmalloc, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)/* Used by some filesystems 
*/
 
 /* Xen */
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..83d3557
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,20 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.sto...@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+
+struct pmalloc_pool *pmalloc_create_pool(const char *name);
+void *pmalloc(unsigned long size, struct pmalloc_pool *pool);
+int pmalloc_protect_pool(struct pmalloc_pool *pool);
+int pmalloc_destroy_pool(struct pmalloc_pool *pool);
+#endif
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94..41d1587 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -91,6 +91,7 @@
{1UL << PG_lru, "lru"   },  \
{1UL << PG_active,  "active"},  \
{1UL << PG_slab,"slab"  },  \
+   {1UL << PG_pmalloc, "pmalloc"   },  \
{1UL << PG_owner_priv_1,"owner_priv_1"  },  \
{1UL << PG_arch_1,  "arch_1"},  \
{1UL << PG_reserved,"reserved"  },  \
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a8..79dd99c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -25,7 +25,7 @@ mmu-y := nommu.o
 mmu-$(CONFIG_MMU)  := gup.o highmem.o memory.o mincore.o \
   mlock.o mmap.o mprotect.o mremap.o msync.o \
   page_vma_mapped.o pagewalk.o pgtable-generic.o \
-  rmap.o vmalloc.o
+  rmap.o vmalloc.o pmalloc.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
new file mode 100644
index 000..c73d60c
--- /dev/null
+++ b/mm/pmalloc.c
@@ -0,0 +1,227 @@
+/*
+ * pmalloc.c: Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. 

[PATCH 1/5] LSM: Convert security_hook_heads into explicit array of struct list_head

2017-06-05 Thread Igor Stoppa
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

Commit 3dfc9b02864b19f4 ("LSM: Initialize security_hook_heads upon
registration.") treats "struct security_hook_heads" as an implicit array
of "struct list_head" so that we can eliminate code for static
initialization. Although we haven't encountered compilers which do not
treat sizeof(security_hook_heads) != sizeof(struct list_head) *
(sizeof(security_hook_heads) / sizeof(struct list_head)), Casey does not
like the assumption that a structure of N elements can be assumed to be
the same as an array of N elements.

Now that Kees found that randstruct complains about such casting

  security/security.c: In function 'security_init':
  security/security.c:59:20: note: found mismatched op0 struct pointer types: 
'struct list_head' and 'struct security_hook_heads'

struct list_head *list = (struct list_head *) _hook_heads;

and Christoph thinks that we should fix it rather than make randstruct
whitelist it, this patch fixes it.

It would be possible to revert commit 3dfc9b02864b19f4, but this patch
converts security_hook_heads into an explicit array of struct list_head
by introducing an enum, due to reasons explained below.

Igor proposed a sealable memory allocator, and the LSM hooks
("struct security_hook_heads security_hook_heads" and
"struct security_hook_list ...[]") will benefit from that allocator via
protection using set_memory_ro()/set_memory_rw(), and that allocator
will remove CONFIG_SECURITY_WRITABLE_HOOKS config option. Thus, we will
likely be moving to that direction.

This means that these structures will be allocated at run time using
that allocator, and therefore the address of these structures will be
determined at run time rather than compile time.

But currently, LSM_HOOK_INIT() macro depends on the address of
security_hook_heads being known at compile time. If we use an enum
so that LSM_HOOK_INIT() macro does not need to know absolute address of
security_hook_heads, it will help us to use that allocator for LSM hooks.

As a result of introducing an enum, security_hook_heads becomes a local
variable. In order to pass 80 columns check by scripts/checkpatch.pl ,
rename security_hook_heads to hook_heads.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Cc: Kees Cook <keesc...@chromium.org>
Cc: Paul Moore <p...@paul-moore.com>
Cc: Stephen Smalley <s...@tycho.nsa.gov>
Cc: Casey Schaufler <ca...@schaufler-ca.com>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Igor Stoppa <igor.sto...@huawei.com>
Cc: Christoph Hellwig <h...@infradead.org>
---
 include/linux/lsm_hooks.h | 412 +++---
 security/security.c   |  31 ++--
 2 files changed, 223 insertions(+), 220 deletions(-)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 080f34e..ac22be3 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1663,219 +1663,220 @@ union security_list_options {
 #endif /* CONFIG_AUDIT */
 };
 
-struct security_hook_heads {
-   struct list_head binder_set_context_mgr;
-   struct list_head binder_transaction;
-   struct list_head binder_transfer_binder;
-   struct list_head binder_transfer_file;
-   struct list_head ptrace_access_check;
-   struct list_head ptrace_traceme;
-   struct list_head capget;
-   struct list_head capset;
-   struct list_head capable;
-   struct list_head quotactl;
-   struct list_head quota_on;
-   struct list_head syslog;
-   struct list_head settime;
-   struct list_head vm_enough_memory;
-   struct list_head bprm_set_creds;
-   struct list_head bprm_check_security;
-   struct list_head bprm_secureexec;
-   struct list_head bprm_committing_creds;
-   struct list_head bprm_committed_creds;
-   struct list_head sb_alloc_security;
-   struct list_head sb_free_security;
-   struct list_head sb_copy_data;
-   struct list_head sb_remount;
-   struct list_head sb_kern_mount;
-   struct list_head sb_show_options;
-   struct list_head sb_statfs;
-   struct list_head sb_mount;
-   struct list_head sb_umount;
-   struct list_head sb_pivotroot;
-   struct list_head sb_set_mnt_opts;
-   struct list_head sb_clone_mnt_opts;
-   struct list_head sb_parse_opts_str;
-   struct list_head dentry_init_security;
-   struct list_head dentry_create_files_as;
+enum security_hook_index {
+   LSM_binder_set_context_mgr,
+   LSM_binder_transaction,
+   LSM_binder_transfer_binder,
+   LSM_binder_transfer_file,
+   LSM_ptrace_access_check,
+   LSM_ptrace_traceme,
+   LSM_capget,
+   LSM_capset,
+   LSM_capable,
+   LSM_quotactl,
+   LSM_quota_on,
+   LSM_syslog,
+   LSM_settime,
+   LSM_vm_enough_memory,
+   LSM_bprm_set_creds,
+   LSM_bprm_check_se

[no subject]

2017-06-05 Thread Igor Stoppa
Subject: [RFC v4 PATCH 0/5] NOT FOR MERGE - ro protection for dynamic data

This patchset introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a pool is made R/O, all the memory
that is part of it, will become R/O.

A R/O pool can be destroyed to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional and this feature is meant for data that doesn't need
further modifications, after initialization.

An example is provided, showing how to turn into a boot-time option the
writable state of the security hooks.
Prior to this patch, it was a compile-time option.

This is made possible, thanks to Tetsuo Handa's rewor of the hooks
structure (included in the patchset).

Notes:

* I have performed some preliminary test on qemu x86_64 and the changes
  seem to hold, but more extensive testing is required.

* I'll be AFK for about a week, so I preferred to share this version, even
  if not thoroughly tested, in the hope to get preliminary comments, but
  it is rough around the edges.

Igor Stoppa (4):
  Protectable Memory Allocator
  Protectable Memory Allocator - Debug interface
  Make LSM Writable Hooks a command line option
  NOT FOR MERGE - Protectable Memory Allocator test

Tetsuo Handa (1):
  LSM: Convert security_hook_heads into explicit array of struct
list_head

 include/linux/lsm_hooks.h  | 412 -
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h|  20 ++
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 mm/Kconfig |  11 ++
 mm/Makefile|   4 +-
 mm/pmalloc.c   | 340 ++
 mm/pmalloc_test.c  | 172 +
 mm/usercopy.c  |  24 ++-
 security/security.c|  58 --
 11 files changed, 814 insertions(+), 232 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c
 create mode 100644 mm/pmalloc_test.c

-- 
2.9.3



[PATCH 4/5] Make LSM Writable Hooks a command line option

2017-06-05 Thread Igor Stoppa
This patch shows how it is possible to take advantage of pmalloc:
instead of using the build-time option __lsm_ro_after_init, to decide if
it is possible to keep the hooks modifiable, now this becomes a
boot-time decision, based on the kernel command line.

This patch relies on:

"Convert security_hook_heads into explicit array of struct list_head"
Author: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

to break free from the static constraint imposed by the previous
hardening model, based on __ro_after_init.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 init/main.c |  2 ++
 security/security.c | 29 ++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/init/main.c b/init/main.c
index f866510..7850887 100644
--- a/init/main.c
+++ b/init/main.c
@@ -485,6 +485,7 @@ static void __init mm_init(void)
ioremap_huge_init();
 }
 
+extern int __init pmalloc_init(void);
 asmlinkage __visible void __init start_kernel(void)
 {
char *command_line;
@@ -653,6 +654,7 @@ asmlinkage __visible void __init start_kernel(void)
proc_caches_init();
buffer_init();
key_init();
+   pmalloc_init();
security_init();
dbg_late_init();
vfs_caches_init();
diff --git a/security/security.c b/security/security.c
index c492f68..4285545 100644
--- a/security/security.c
+++ b/security/security.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MAX_LSM_EVM_XATTR  2
@@ -33,8 +34,17 @@
 /* Maximum number of letters for an LSM name string */
 #define SECURITY_NAME_MAX  10
 
-static struct list_head hook_heads[LSM_MAX_HOOK_INDEX]
-   __lsm_ro_after_init;
+static int security_debug;
+
+static __init int set_security_debug(char *str)
+{
+   get_option(, _debug);
+   return 0;
+}
+early_param("security_debug", set_security_debug);
+
+static struct list_head *hook_heads;
+static struct pmalloc_pool *sec_pool;
 char *lsm_names;
 /* Boot-time LSM user choice */
 static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] =
@@ -59,6 +69,13 @@ int __init security_init(void)
 {
enum security_hook_index i;
 
+   sec_pool = pmalloc_create_pool("security");
+   if (!sec_pool)
+   goto error_pool;
+   hook_heads = pmalloc(sizeof(struct list_head) * LSM_MAX_HOOK_INDEX,
+sec_pool);
+   if (!hook_heads)
+   goto error_heads;
for (i = 0; i < LSM_MAX_HOOK_INDEX; i++)
INIT_LIST_HEAD(_heads[i]);
pr_info("Security Framework initialized\n");
@@ -74,8 +91,14 @@ int __init security_init(void)
 * Load all the remaining security modules.
 */
do_security_initcalls();
-
+   if (!security_debug)
+   pmalloc_protect_pool(sec_pool);
return 0;
+
+error_heads:
+   pmalloc_destroy_pool(sec_pool);
+error_pool:
+   return -ENOMEM;
 }
 
 /* Save user chosen LSM */
-- 
2.9.3



[PATCH 3/5] Protectable Memory Allocator - Debug interface

2017-06-05 Thread Igor Stoppa
Debugfs interface: it creates a file

/sys/kernel/debug/pmalloc/pools

which exposes statistics about all the pools and memory nodes in use.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 mm/Kconfig   |  11 ++
 mm/pmalloc.c | 113 +++
 2 files changed, 124 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index beb7a45..dfbdc07 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -539,6 +539,17 @@ config CMA_AREAS
 
  If unsure, leave the default value "7".
 
+config PMALLOC_DEBUG
+bool "Protectable Memory Allocator debugging"
+depends on DEBUG_KERNEL
+default y
+help
+  Debugfs support for dumping information about memory pools.
+  It shows internal stats: free/used/total space, protection
+  status, data overhead, etc.
+
+  If unsure, say "y".
+
 config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
index c73d60c..6dd6bbe 100644
--- a/mm/pmalloc.c
+++ b/mm/pmalloc.c
@@ -225,3 +225,116 @@ int __init pmalloc_init(void)
return 0;
 }
 EXPORT_SYMBOL(pmalloc_init);
+
+#ifdef CONFIG_PMALLOC_DEBUG
+#include 
+static struct dentry *pmalloc_root;
+
+static void *__pmalloc_seq_start(struct seq_file *s, loff_t *pos)
+{
+   if (*pos)
+   return NULL;
+   return pos;
+}
+
+static void *__pmalloc_seq_next(struct seq_file *s, void *v, loff_t *pos)
+{
+   return NULL;
+}
+
+static void __pmalloc_seq_stop(struct seq_file *s, void *v)
+{
+}
+
+static __always_inline
+void __seq_printf_node(struct seq_file *s, struct pmalloc_node *node)
+{
+   unsigned long total_space, node_pages, end_of_node,
+ used_space, available_space;
+   int total_words, used_words, available_words;
+
+   used_words = atomic_read(>used_words);
+   total_words = node->total_words;
+   available_words = total_words - used_words;
+   used_space = used_words * WORD_SIZE;
+   total_space = total_words * WORD_SIZE;
+   available_space = total_space - used_space;
+   node_pages = (total_space + HEADER_SIZE) / PAGE_SIZE;
+   end_of_node = total_space + HEADER_SIZE + (unsigned long) node;
+   seq_printf(s, " - node:\t\t%p\n", node);
+   seq_printf(s, "   - start of data ptr:\t%p\n", node->data);
+   seq_printf(s, "   - end of node ptr:\t%p\n", (void *)end_of_node);
+   seq_printf(s, "   - total words:\t%d\n", total_words);
+   seq_printf(s, "   - used words:\t%d\n", used_words);
+   seq_printf(s, "   - available words:\t%d\n", available_words);
+   seq_printf(s, "   - pages:\t\t%lu\n", node_pages);
+   seq_printf(s, "   - total space:\t%lu\n", total_space);
+   seq_printf(s, "   - used space:\t%lu\n", used_space);
+   seq_printf(s, "   - available space:\t%lu\n", available_space);
+}
+
+static __always_inline
+void __seq_printf_pool(struct seq_file *s, struct pmalloc_pool *pool)
+{
+   struct pmalloc_node *node;
+
+   seq_printf(s, "pool:\t\t\t%p\n", pool);
+   seq_printf(s, " - name:\t\t%s\n", pool->name);
+   seq_printf(s, " - protected:\t\t%u\n", pool->protected);
+   seq_printf(s, " - nodes count:\t\t%u\n",
+  atomic_read(>nodes_count));
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(node, >nodes_list_head, nodes_list)
+   __seq_printf_node(s, node);
+   rcu_read_unlock();
+}
+
+static int __pmalloc_seq_show(struct seq_file *s, void *v)
+{
+   struct pmalloc_pool *pool;
+
+   seq_printf(s, "pools count:\t\t%u\n",
+  atomic_read(_data->pools_count));
+   seq_printf(s, "page size:\t\t%lu\n", PAGE_SIZE);
+   seq_printf(s, "word size:\t\t%lu\n", WORD_SIZE);
+   seq_printf(s, "node header size:\t%lu\n", HEADER_SIZE);
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(pool, _data->pools_list_head,
+pools_list)
+   __seq_printf_pool(s, pool);
+   rcu_read_unlock();
+   return 0;
+}
+
+static const struct seq_operations pmalloc_seq_ops = {
+   .start = __pmalloc_seq_start,
+   .next  = __pmalloc_seq_next,
+   .stop  = __pmalloc_seq_stop,
+   .show  = __pmalloc_seq_show,
+};
+
+static int __pmalloc_open(struct inode *inode, struct file *file)
+{
+   return seq_open(file, _seq_ops);
+}
+
+static const struct file_operations pmalloc_file_ops = {
+   .owner   = THIS_MODULE,
+   .open= __pmalloc_open,
+   .read= seq_read,
+   .llseek  = seq_lseek,
+   .release = seq_release
+};
+
+
+static int __init 

[RFC v5 PATCH 0/4] NOT FOR MERGE - ro protection for dynamic data

2017-06-06 Thread Igor Stoppa
This patchset introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a pool is made R/O, all the memory
that is part of it, will become R/O.

A R/O pool can be destroyed to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications, after initialization.

An example is provided, showing how to turn into a boot-time option the
writable state of the security hooks.
Prior to this patch, it was a compile-time option.

This is made possible, thanks to Tetsuo Handa's rework of the hooks
structure (included in the patchset).

Since the previous version, I have applied fixes for all the issues
discovered that had a clear resolution:

- %p -> pK
- make the feature depend on ARCH_HAS_SET_MEMORY
- fix the range of the page scanning for hardened user copy
- fixed pointer checking for NULL dereferencing


And a couple of issues I found myself:
- return NULL in case someone asks memory from a locked pool
- turn the "protected" flag into atomic type


Still open (at least I didn't get the impression there was a closure):
- need for specific __PMALLOC_ALIGNED ?
- is it really needed to unprotect a pool?
  can't it wait for the implementation of write-seldom?


Igor Stoppa (3):
  Protectable Memory Allocator
  Protectable Memory Allocator - Debug interface
  Make LSM Writable Hooks a command line option

Tetsuo Handa (1):
  LSM: Convert security_hook_heads into explicit array of struct
list_head

 include/linux/lsm_hooks.h  | 412 -
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h|  20 ++
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 mm/Kconfig |  11 ++
 mm/Makefile|   1 +
 mm/pmalloc.c   | 340 ++
 mm/usercopy.c  |  24 ++-
 security/security.c|  49 +++--
 10 files changed, 632 insertions(+), 230 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

-- 
2.9.3



[PATCH 1/4] LSM: Convert security_hook_heads into explicit array of struct list_head

2017-06-06 Thread Igor Stoppa
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

Commit 3dfc9b02864b19f4 ("LSM: Initialize security_hook_heads upon
registration.") treats "struct security_hook_heads" as an implicit array
of "struct list_head" so that we can eliminate code for static
initialization. Although we haven't encountered compilers which do not
treat sizeof(security_hook_heads) != sizeof(struct list_head) *
(sizeof(security_hook_heads) / sizeof(struct list_head)), Casey does not
like the assumption that a structure of N elements can be assumed to be
the same as an array of N elements.

Now that Kees found that randstruct complains about such casting

  security/security.c: In function 'security_init':
  security/security.c:59:20: note: found mismatched op0 struct pointer types: 
'struct list_head' and 'struct security_hook_heads'

struct list_head *list = (struct list_head *) _hook_heads;

and Christoph thinks that we should fix it rather than make randstruct
whitelist it, this patch fixes it.

It would be possible to revert commit 3dfc9b02864b19f4, but this patch
converts security_hook_heads into an explicit array of struct list_head
by introducing an enum, due to reasons explained below.

Igor proposed a sealable memory allocator, and the LSM hooks
("struct security_hook_heads security_hook_heads" and
"struct security_hook_list ...[]") will benefit from that allocator via
protection using set_memory_ro()/set_memory_rw(), and that allocator
will remove CONFIG_SECURITY_WRITABLE_HOOKS config option. Thus, we will
likely be moving to that direction.

This means that these structures will be allocated at run time using
that allocator, and therefore the address of these structures will be
determined at run time rather than compile time.

But currently, LSM_HOOK_INIT() macro depends on the address of
security_hook_heads being known at compile time. If we use an enum
so that LSM_HOOK_INIT() macro does not need to know absolute address of
security_hook_heads, it will help us to use that allocator for LSM hooks.

As a result of introducing an enum, security_hook_heads becomes a local
variable. In order to pass 80 columns check by scripts/checkpatch.pl ,
rename security_hook_heads to hook_heads.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Cc: Kees Cook <keesc...@chromium.org>
Cc: Paul Moore <p...@paul-moore.com>
Cc: Stephen Smalley <s...@tycho.nsa.gov>
Cc: Casey Schaufler <ca...@schaufler-ca.com>
Cc: James Morris <james.l.mor...@oracle.com>
Cc: Igor Stoppa <igor.sto...@huawei.com>
Cc: Christoph Hellwig <h...@infradead.org>
---
 include/linux/lsm_hooks.h | 412 +++---
 security/security.c   |  31 ++--
 2 files changed, 223 insertions(+), 220 deletions(-)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 080f34e..ac22be3 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1663,219 +1663,220 @@ union security_list_options {
 #endif /* CONFIG_AUDIT */
 };
 
-struct security_hook_heads {
-   struct list_head binder_set_context_mgr;
-   struct list_head binder_transaction;
-   struct list_head binder_transfer_binder;
-   struct list_head binder_transfer_file;
-   struct list_head ptrace_access_check;
-   struct list_head ptrace_traceme;
-   struct list_head capget;
-   struct list_head capset;
-   struct list_head capable;
-   struct list_head quotactl;
-   struct list_head quota_on;
-   struct list_head syslog;
-   struct list_head settime;
-   struct list_head vm_enough_memory;
-   struct list_head bprm_set_creds;
-   struct list_head bprm_check_security;
-   struct list_head bprm_secureexec;
-   struct list_head bprm_committing_creds;
-   struct list_head bprm_committed_creds;
-   struct list_head sb_alloc_security;
-   struct list_head sb_free_security;
-   struct list_head sb_copy_data;
-   struct list_head sb_remount;
-   struct list_head sb_kern_mount;
-   struct list_head sb_show_options;
-   struct list_head sb_statfs;
-   struct list_head sb_mount;
-   struct list_head sb_umount;
-   struct list_head sb_pivotroot;
-   struct list_head sb_set_mnt_opts;
-   struct list_head sb_clone_mnt_opts;
-   struct list_head sb_parse_opts_str;
-   struct list_head dentry_init_security;
-   struct list_head dentry_create_files_as;
+enum security_hook_index {
+   LSM_binder_set_context_mgr,
+   LSM_binder_transaction,
+   LSM_binder_transfer_binder,
+   LSM_binder_transfer_file,
+   LSM_ptrace_access_check,
+   LSM_ptrace_traceme,
+   LSM_capget,
+   LSM_capset,
+   LSM_capable,
+   LSM_quotactl,
+   LSM_quota_on,
+   LSM_syslog,
+   LSM_settime,
+   LSM_vm_enough_memory,
+   LSM_bprm_set_creds,
+   LSM_bprm_check_se

[PATCH 2/4] Protectable Memory Allocator

2017-06-06 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain only data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, however this is not fit too well the case of dynamically
allocated ones.

Dynamic allocation does not provide, currently, means for grouping
variables in memory pages that would contain exclusively data that can
be made read only.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the memory requested (over various iterations) is initialized,
the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 include/linux/page-flags.h |   2 +
 include/linux/pmalloc.h|  20 
 include/trace/events/mmflags.h |   1 +
 init/main.c|   2 +
 mm/Makefile|   1 +
 mm/pmalloc.c   | 227 +
 mm/usercopy.c  |  24 +++--
 7 files changed, 268 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..acc0723 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -81,6 +81,7 @@ enum pageflags {
PG_active,
PG_waiters, /* Page has waiters, check its waitqueue. Must 
be bit #7 and in the same byte as "PG_locked" */
PG_slab,
+   PG_pmalloc,
PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
@@ -274,6 +275,7 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, 
active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
+__PAGEFLAG(Pmalloc, pmalloc, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)/* Used by some filesystems 
*/
 
 /* Xen */
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..83d3557
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,20 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <igor.sto...@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+
+struct pmalloc_pool *pmalloc_create_pool(const char *name);
+void *pmalloc(unsigned long size, struct pmalloc_pool *pool);
+int pmalloc_protect_pool(struct pmalloc_pool *pool);
+int pmalloc_destroy_pool(struct pmalloc_pool *pool);
+#endif
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94..41d1587 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -91,6 +91,7 @@
{1UL << PG_lru, "lru"   },  \
{1UL << PG_active,  "active"},  \
{1UL << PG_slab,"slab"  },  \
+   {1UL << PG_pmalloc, "pmalloc"   },  \
{1UL << PG_owner_priv_1,"owner_priv_1"  },  \
{1UL << PG_arch_1,  "arch_1"},  \
{1UL << PG_reserved,"reserved"  },  \
diff --git a/init/main.c b/init/main.c
index f866510..7850887 100644
--- a/init/main.c
+++ b/init/main.c
@@ -485,6 +485,7 @@ static void __init mm_init(void)
ioremap_huge_init();
 }
 
+extern int __init pmalloc_init(void);
 asmlinkage __visible void __init start_kernel(void)
 {
char *command_line;
@@ -653,6 +654,7 @@ asmlinkage __visible void __init start_kernel(void)
proc_caches_init();
buffer_init();
key_init();
+   pmalloc_init();
security_init();
dbg_late_init();
vfs_caches_init();
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a8..b47dcf8 100644
--- a/mm/Makefile
+

[PATCH 3/4] Protectable Memory Allocator - Debug interface

2017-06-06 Thread Igor Stoppa
Debugfs interface: it creates the file

/sys/kernel/debug/pmalloc/pools

which exposes statistics about all the pools and memory nodes in use.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 mm/Kconfig   |  11 ++
 mm/pmalloc.c | 113 +++
 2 files changed, 124 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index beb7a45..dfbdc07 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -539,6 +539,17 @@ config CMA_AREAS
 
  If unsure, leave the default value "7".
 
+config PMALLOC_DEBUG
+bool "Protectable Memory Allocator debugging"
+depends on DEBUG_KERNEL
+default y
+help
+  Debugfs support for dumping information about memory pools.
+  It shows internal stats: free/used/total space, protection
+  status, data overhead, etc.
+
+  If unsure, say "y".
+
 config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
index 4ca1e4a..636169c 100644
--- a/mm/pmalloc.c
+++ b/mm/pmalloc.c
@@ -225,3 +225,116 @@ int __init pmalloc_init(void)
atomic_set(_data->pools_count, 0);
return 0;
 }
+
+#ifdef CONFIG_PMALLOC_DEBUG
+#include 
+static struct dentry *pmalloc_root;
+
+static void *__pmalloc_seq_start(struct seq_file *s, loff_t *pos)
+{
+   if (*pos)
+   return NULL;
+   return pos;
+}
+
+static void *__pmalloc_seq_next(struct seq_file *s, void *v, loff_t *pos)
+{
+   return NULL;
+}
+
+static void __pmalloc_seq_stop(struct seq_file *s, void *v)
+{
+}
+
+static __always_inline
+void __seq_printf_node(struct seq_file *s, struct pmalloc_node *node)
+{
+   unsigned long total_space, node_pages, end_of_node,
+ used_space, available_space;
+   int total_words, used_words, available_words;
+
+   used_words = atomic_read(>used_words);
+   total_words = node->total_words;
+   available_words = total_words - used_words;
+   used_space = used_words * WORD_SIZE;
+   total_space = total_words * WORD_SIZE;
+   available_space = total_space - used_space;
+   node_pages = (total_space + HEADER_SIZE) / PAGE_SIZE;
+   end_of_node = total_space + HEADER_SIZE + (unsigned long) node;
+   seq_printf(s, " - node:\t\t%pK\n", node);
+   seq_printf(s, "   - start of data ptr:\t%pK\n", node->data);
+   seq_printf(s, "   - end of node ptr:\t%pK\n", (void *)end_of_node);
+   seq_printf(s, "   - total words:\t%d\n", total_words);
+   seq_printf(s, "   - used words:\t%d\n", used_words);
+   seq_printf(s, "   - available words:\t%d\n", available_words);
+   seq_printf(s, "   - pages:\t\t%lu\n", node_pages);
+   seq_printf(s, "   - total space:\t%lu\n", total_space);
+   seq_printf(s, "   - used space:\t%lu\n", used_space);
+   seq_printf(s, "   - available space:\t%lu\n", available_space);
+}
+
+static __always_inline
+void __seq_printf_pool(struct seq_file *s, struct pmalloc_pool *pool)
+{
+   struct pmalloc_node *node;
+
+   seq_printf(s, "pool:\t\t\t%pK\n", pool);
+   seq_printf(s, " - name:\t\t%s\n", pool->name);
+   seq_printf(s, " - protected:\t\t%u\n", atomic_read(>protected));
+   seq_printf(s, " - nodes count:\t\t%u\n",
+  atomic_read(>nodes_count));
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(node, >nodes_list_head, nodes_list)
+   __seq_printf_node(s, node);
+   rcu_read_unlock();
+}
+
+static int __pmalloc_seq_show(struct seq_file *s, void *v)
+{
+   struct pmalloc_pool *pool;
+
+   seq_printf(s, "pools count:\t\t%u\n",
+  atomic_read(_data->pools_count));
+   seq_printf(s, "page size:\t\t%lu\n", PAGE_SIZE);
+   seq_printf(s, "word size:\t\t%lu\n", WORD_SIZE);
+   seq_printf(s, "node header size:\t%lu\n", HEADER_SIZE);
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(pool, _data->pools_list_head,
+pools_list)
+   __seq_printf_pool(s, pool);
+   rcu_read_unlock();
+   return 0;
+}
+
+static const struct seq_operations pmalloc_seq_ops = {
+   .start = __pmalloc_seq_start,
+   .next  = __pmalloc_seq_next,
+   .stop  = __pmalloc_seq_stop,
+   .show  = __pmalloc_seq_show,
+};
+
+static int __pmalloc_open(struct inode *inode, struct file *file)
+{
+   return seq_open(file, _seq_ops);
+}
+
+static const struct file_operations pmalloc_file_ops = {
+   .owner   = THIS_MODULE,
+   .open= __pmalloc_open,
+   .read= seq_read,
+   .llseek  = seq_lseek,
+   .release = seq_release
+};
+

[PATCH 4/4] Make LSM Writable Hooks a command line option

2017-06-06 Thread Igor Stoppa
This patch shows how it is possible to take advantage of pmalloc:
instead of using the build-time option __lsm_ro_after_init, to decide if
it is possible to keep the hooks modifiable, now this becomes a
boot-time decision, based on the kernel command line.

This patch relies on:

"Convert security_hook_heads into explicit array of struct list_head"
Author: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>

to break free from the static constraint imposed by the previous
hardening model, based on __ro_after_init.

The default value is disabled, unless SE Linux debugging is turned on.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
CC: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 security/security.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index c492f68..9b8b478 100644
--- a/security/security.c
+++ b/security/security.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MAX_LSM_EVM_XATTR  2
@@ -33,8 +34,17 @@
 /* Maximum number of letters for an LSM name string */
 #define SECURITY_NAME_MAX  10
 
-static struct list_head hook_heads[LSM_MAX_HOOK_INDEX]
-   __lsm_ro_after_init;
+static int dynamic_lsm = IS_ENABLED(CONFIG_SECURITY_SELINUX_DISABLE);
+
+static __init int set_dynamic_lsm(char *str)
+{
+   get_option(, _lsm);
+   return 0;
+}
+early_param("dynamic_lsm", set_dynamic_lsm);
+
+static struct list_head *hook_heads;
+static struct pmalloc_pool *sec_pool;
 char *lsm_names;
 /* Boot-time LSM user choice */
 static __initdata char chosen_lsm[SECURITY_NAME_MAX + 1] =
@@ -59,6 +69,11 @@ int __init security_init(void)
 {
enum security_hook_index i;
 
+   sec_pool = pmalloc_create_pool("security");
+   BUG_ON(!sec_pool);
+   hook_heads = pmalloc(sizeof(struct list_head) * LSM_MAX_HOOK_INDEX,
+sec_pool);
+   BUG_ON(!hook_heads);
for (i = 0; i < LSM_MAX_HOOK_INDEX; i++)
INIT_LIST_HEAD(_heads[i]);
pr_info("Security Framework initialized\n");
@@ -74,7 +89,8 @@ int __init security_init(void)
 * Load all the remaining security modules.
 */
do_security_initcalls();
-
+   if (!dynamic_lsm)
+   pmalloc_protect_pool(sec_pool);
return 0;
 }
 
-- 
2.9.3



Re: [PATCH 3/3] Make LSM Writable Hooks a command line option

2017-06-28 Thread Igor Stoppa
Resending my reply, I mistakenly used the wrong mail account yesterday
and my reply didn't et to the ml.

On 27/06/17 20:51, Christoph Hellwig wrote:
> On Tue, Jun 27, 2017 at 08:33:23PM +0300, Igor Stoppa wrote:

[...]

>> The default value is disabled, unless SE Linux debugging is turned on.
> 
> Can we please just force it to be read-only?

I'm sorry, I'm not quite sure I understand your comment.

I'm trying to replicate the behavior of __lsm_ro_after_init:

line 1967 @ [1]   - Did I get it wrong?

thanks, igor



[1]
https://kernel.googlesource.com/pub/scm/linux/kernel/git/jmorris/linux-security/+/5965453d5e3fb425e6f9d6b4fec403bda3f33107/include/linux/lsm_hooks.h



[PATCH 1/1] Add paretheses to macro parameters. For trivial

2017-11-22 Thread Igor Stoppa
kernel.h: Some macros are not wrapping their parameters with parentheses.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
Cc: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Cc: Guenter Roeck <li...@roeck-us.net>
Cc: Javi Merino <javi.mer...@arm.com>
---
 include/linux/kernel.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 4b484ab..4061f46 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -112,7 +112,7 @@
 /* The `const' in roundup() prevents gcc-3.3 from calling __divdi3 */
 #define roundup(x, y) (\
 {  \
-   const typeof(y) __y = y;\
+   const typeof(y) __y = (y);  \
(((x) + (__y - 1)) / __y) * __y;\
 }  \
 )
@@ -131,8 +131,8 @@
  */
 #define DIV_ROUND_CLOSEST(x, divisor)( \
 {  \
-   typeof(x) __x = x;  \
-   typeof(divisor) __d = divisor;  \
+   typeof(x) __x = (x);\
+   typeof(divisor) __d = (divisor);\
(((typeof(x))-1) > 0 || \
 ((typeof(divisor))-1) > 0 ||   \
 (((__x) > 0) == ((__d) > 0))) ?\
@@ -146,7 +146,7 @@
  */
 #define DIV_ROUND_CLOSEST_ULL(x, divisor)( \
 {  \
-   typeof(divisor) __d = divisor;  \
+   typeof(divisor) __d = (divisor);\
unsigned long long _tmp = (x) + (__d) / 2;  \
do_div(_tmp, __d);  \
_tmp;   \
-- 
2.9.3



[PATCH 0/1] Trivial: Add parentheses to parameters in macros

2017-11-22 Thread Igor Stoppa
Some parameters are used in macros without being surrounded by parentheses.

Igor Stoppa (1):
  Add paretheses to macro parameters. For trivial

 include/linux/kernel.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
2.9.3



[PATCH 1/1] genalloc: track beginning of allocations

2017-12-18 Thread Igor Stoppa
The genalloc library is only capable of tracking if a certain unit of
allocation is in use or not.

It is not capable of discerning where the memory associated to an
allocation request begins and where it ends.

The reason is that units of allocations are tracked by using a bitmap,
where each bit represents that the unit is either allocated (1) or
available (0).

The user of the API must keep track of how much space was requested, if
it ever needs to be freed.

This can cause errors being undetected.
Ex:
* Only a subset of the memory provided to an allocation request is freed
* The memory from a subsequent allocation is freed
* The memory being freed doesn't start at the beginning of an
  allocation.

The bitmap is used because it allows to perform lockless read/write
access, where this is supported by hw through cmpxchg.
Similarly, it is possible to scan the bitmap for a sufficiently long
sequence of zeros, to identify zones available for allocation.

--

This patch doubles the space reserved in the bitmap for each allocation.
By using 2 bits per allocation, it is possible to encode also the
information of where the allocation starts:
(msb to the left, lsb to the right, in the following "dictionary")

11: first allocation unit in the allocation
10: any subsequent allocation unit (if any) in the allocation
00: available allocation unit
01: invalid

Ex, with the same notation as above - MSb...LSb:

 ...100010101011   <-- Read in this direction.
\__|\__|\|\|\__|
   |   | | |   \___ 4 used allocation units
   |   | | \___ 3 empty allocation units
   |   | \_ 1 used allocation unit
   |   \___ 2 used allocation units
   \___ 2 empty allocation units

Because of the encoding, the previous lockless operations are still
possible. The only caveat is to change the parameter of the zero-finding
function which establishes the alignment at which to perform the test
for first zero.
The original value of the parameter is 0, meaning that an allocation can
start at any point in the bitmap, while the new value is 1, meaning that
allocations can start only at even places (bit 0, bit 2, etc.)
The number of zeroes to look for, must therefore be doubled.

When it's time to free the memory associated to an allocation request,
it's a matter of checking if the corresponding allocation unit is really
the beginning of an allocation (both bits are set to 1).
Looking for the ending can also be performed locklessly.
It's sufficient to identify the first mapped allocation unit
that is represented either as free (00) or busy (11).
Even if the allocation status should change in the meanwhile, it doesn't
matter, since it can only transition between free (00) and
first-allocated (11).

The parameter indicating to the *_free() function the size of the space
that should be freed is not currently removed, to facilitate the
transition, but it is verified, whenever it is not zero.
If it is set to zero, then the free function will autonomously decide the
size to be free, by scanning the bitmap.

About the implementation: the patch introduces the concept of "bitmap
entry", which has a 1:1 mapping with allocation units, while the code
being patched has a 1:1 mapping between allocation units and bits.

This means that, now, the bitmap can be extended (by following powers of
2), to track also other properties of the allocations, if ever needed.

Signed-off-by: Igor Stoppa <igor.sto...@huawei.com>
---
 include/linux/genalloc.h |   3 +-
 lib/genalloc.c   | 417 ---
 2 files changed, 289 insertions(+), 131 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 6dfec4d..a8fdabf 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -32,6 +32,7 @@
 
 #include 
 #include 
+#include 
 
 struct device;
 struct device_node;
@@ -75,7 +76,7 @@ struct gen_pool_chunk {
phys_addr_t phys_addr;  /* physical starting address of memory 
chunk */
unsigned long start_addr;   /* start address of memory chunk */
unsigned long end_addr; /* end address of memory chunk 
(inclusive) */
-   unsigned long bits[0];  /* bitmap for allocating memory chunk */
+   unsigned long entries[0];   /* bitmap for allocating memory chunk */
 };
 
 /*
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 144fe6b..13bc8cf 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -36,114 +36,221 @@
 #include 
 #include 
 
+#define ENTRY_ORDER 1UL
+#define ENTRY_MASK ((1UL << ((ENTRY_ORDER) + 1UL)) - 1UL)
+#define ENTRY_HEAD ENTRY_MASK
+#define ENTRY_UNUSED 0UL
+#define BITS_PER_ENTRY (1U << ENTRY_ORDER)
+#define BITS_DIV_ENTRIES(x) ((x) >> ENTRY_ORDER)
+#define ENTRIES_TO_BITS(x) ((x) << ENTRY_ORDER)
+#define BITS_DIV_LONGS(x) ((x) / BITS_PER_LONG)
+#define ENTRIES_DIV_LON

[RFC PATCH 0/1] genalloc: track beginning of allocations

2017-12-18 Thread Igor Stoppa
genalloc could be improved, to know how to separate the memory use by
adjacent allocations

This patch is generated from the effort of introducing in the kernel an
allocator for protectable memory (pmalloc).

However, it seems that the patch could have a value of its own.
It can:
- verify that the freeing of memory is consistent with previous allocations
- relieve the user of the API from tracking the size of each allocation
- enable use cases where generic code can free memory allocations received
  through a pointer (provided that the reference pool is known)

Details about the implementation are provided in the comment for the patch.

I mentioned this idea few months ago, as part of the pmalloc discussion,
but then I did not have time to follow-up immediately, as I had hoped.

This is an implementation of what I had in mind.
It seems to withstand several simple test cases i put together, but it
definitely would need thorough review.


I hope I have added as reviewer all the relevant people.
If I missed someone, please include them to the recipients.


Igor Stoppa (1):
  genalloc: track beginning of allocations

 include/linux/genalloc.h |   3 +-
 lib/genalloc.c   | 417 ---
 2 files changed, 289 insertions(+), 131 deletions(-)

-- 
2.9.3



Re: [RFC 0/3] Safe, dynamically (un)loadable LSMs

2017-12-01 Thread Igor Stoppa


On 30/11/17 04:28, Casey Schaufler wrote:
> On 11/26/2017 2:15 PM, Sargun Dhillon wrote:
>> This patchset introduces safe dynamic LSM support. It does this via
>> SRCU-protected security hooks. It also EXPORT_SYMBOL_GPLs the symbols
>> required to perform runtime loading, and unloading. The patchset is
>> meant to introduce as little overhead as possible when not used.
>> Additionally, the functionality is disabled by default.
> 
> Can you explain the value in being able to unload a security module?
> I can see having a throttle on an active module, but what do you gain
> by being able to unload it? Can it possibly be worth the inevitable
> memory leaks and almost certain dangling pointers? The restrictions on
> a security module that can work safely in this environment are significant.
> I don't see any point in unloading a module that could work with those
> restrictions. The overhead of making it unloadable is likely to exceed
> the overhead of running it.

There is also another aspect: a readily available "unload" functionality
means that if an attacker is capable of modifying some function pointer,
the unload capability provides an easier way to get rid of the module,
compared to having to poke into memory (like it typically happens when
disabling SELinux, where the attacker flips one of the various state
variables that allow to control how strict it is).

Unload capability might be useful during development, but I am not
really sure it would be a good idea in a production system.

--
igor


Re: [RFC: Coding Style] Best way to split a long function declaration with modifiers

2018-05-12 Thread Igor Stoppa

On 12/05/18 18:41, Joe Perches wrote:


I personally like more the former, not to mention that it uses also one
line less, but it seems less common in the sources.
The coding style references do not seem to say anything explicit about
which style to prefer.


thank you, I could provide a patch to the docs for this case, if it's 
not considered too much of a corner case.


--
igor


[RFC: Coding Style] Best way to split a long function declaration with modifiers

2018-05-12 Thread Igor Stoppa

Hi,
I have been wondering if it's ok to break a long (function declaration) 
line in the following way:


static __always_inline
struct foo_bar *__get_foo_bar(type1 parm1, type2 parm2, type3 parm3)


instead of:

static __always_inline struct foo_bar *__get_foo_bar(type1 parm1,
 type2 parm2,
 type3 parm3)


I personally like more the former, not to mention that it uses also one 
line less, but it seems less common in the sources.
The coding style references do not seem to say anything explicit about 
which style to prefer.


And not all the code in the kernel is of the same quality, so finding an 
example doesn't automatically mean that it's a good practice to follow :-)


--
thanks, igor


  1   2   3   4   5   6   7   8   9   >