Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal
On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> On Thu, 16 Mar 2017, Till Smejkal wrote:
> > On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> > > Why do we need yet another mechanism to represent something which looks
> > > like a file instead of simply using existing mechanisms and extend them?
> > 
> > You are right. I also recognized during the discussion with Andy, Chris,
> > Matthew, Luck, Rich and the others that there are already other
> > techniques in the Linux kernel that can achieve the same functionality
> > when combined. As I said also to the others, I will drop the VAS segments
> > for future versions. The first class virtual address space feature was
> > the more interesting part of the patchset anyways.
> 
> While you are at it, could you please drop this 'first class' marketing as
> well? It has zero technical value, really.

Yes of course. I am sorry for the trouble that I caused already.

Thanks
Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal
On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> Why do we need yet another mechanism to represent something which looks
> like a file instead of simply using existing mechanisms and extend them?

You are right. I also recognized during the discussion with Andy, Chris, 
Matthew,
Luck, Rich and the others that there are already other techniques in the Linux 
kernel
that can achieve the same functionality when combined. As I said also to the 
others,
I will drop the VAS segments for future versions. The first class virtual 
address
space feature was the more interesting part of the patchset anyways.

Thanks
Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal
On Wed, 15 Mar 2017, Luck, Tony wrote:
> On Wed, Mar 15, 2017 at 03:02:34PM -0700, Till Smejkal wrote:
> > I don't agree here. VAS segments are basically in-memory files that are 
> > handled by
> > the kernel directly without using a file system. Hence, if an application 
> > uses a VAS
> > segment to store data the same rules apply as if it uses a file. Everything 
> > that it
> > saves in the VAS segment might be accessible by other applications. An 
> > application
> > using VAS segments should be aware of this fact. In addition, the resources 
> > that are
> > represented by a VAS segment are not leaked. As I said, VAS segments are 
> > much like
> > files. Hence, if you don't want to use them any more, delete them. But as 
> > with files,
> > the kernel will not delete them for you (although something like this can 
> > be added).
> 
> So how do they differ from shmget(2), shmat(2), shmdt(2), shmctl(2)?
> 
> Apart from VAS having better names, instead of silly "key_t key" ones.

Unfortunately, I have to admit that the VAS segments don't differ from shm* a 
lot.
The implementation is differently, but the functionality that you can achieve 
with it
is very similar. I am sorry. We should have looked more closely at the whole
functionality that is provided by the shmem subsystem before working on VAS 
segments.

However, VAS segments are not the key part of this patch set. The more 
interesting
functionality in our opinion is the introduction of first class virtual address
spaces and what they can be used for. VAS segments were just another logical 
step for
us (from first class virtual address spaces to first class virtual address space
segments) but since their functionality can be achieved with various other 
already
existing features of the Linux kernel, I will probably drop them in future 
versions
of the patchset.

Thanks
Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal
On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> > One advantage of VAS segments is that they can be globally queried by user 
> > programs
> > which means that VAS segments can be shared by applications that not 
> > necessarily have
> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will 
> > only work
> > if the tasks that share the memory region are related (aka. have a common 
> > parent that
> > initialized the shared mapping). Otherwise, the shared mapping have to be 
> > backed by a
> > file.
> 
> What's wrong with memfd_create()?
> 
> > VAS segments on the other side allow sharing of pure in memory data by
> > arbitrary related tasks without the need of a file. This becomes especially
> > interesting if one combines VAS segments with non-volatile memory since one 
> > can keep
> > data structures in the NVM and still be able to share them between multiple 
> > tasks.
> 
> What's wrong with regular mmap?

I never wanted to say that there is something wrong with regular mmap. We just
figured that with VAS segments you could remove the need to mmap your shared 
data but
instead can keep everything purely in memory.

Unfortunately, I am not at full speed with memfds. Is my understanding correct 
that
if the last user of such a file descriptor closes it, the corresponding memory 
is
freed? Accordingly, memfd cannot be used to keep data in memory while no 
program is
currently using it, can it? To be able to do this you need again some 
representation
of the data in a file? Yes, you can use a tmpfs to keep the file content in 
memory as
well, or some DAX filesystem to keep the file content in NVM, but this always
requires that such filesystems are mounted in the system that the application is
currently running on. VAS segments on the other side would provide a 
functionality to
achieve the same without the need of any mounted filesystem. However, I agree, 
that
this is just a small advantage compared to what can already be achieved with the
existing functionality provided by the Linux kernel. I probably need to revisit 
the
whole idea of first class virtual address space segments before continuing with 
this
pacthset. Thank you very much for the great feedback.

> >> >> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> >> >> and not make it look magically different depending on which process
> >> >> maps it?  If you need a trampoline (which you do, of course), just
> >> >> write a trampoline in regular user code and map it manually.
> >> >
> >> > Did I understand you correctly that you are proposing that the switching 
> >> > thread
> >> > should make sure by itself that its code, stack, … memory regions are 
> >> > properly setup
> >> > in the new AS before/after switching into it? I think, this would make 
> >> > using first
> >> > class virtual address spaces much more difficult for user applications 
> >> > to the extend
> >> > that I am not even sure if they can be used at all. At the moment, 
> >> > switching into a
> >> > VAS is a very simple operation for an application because the kernel 
> >> > will just simply
> >> > do the right thing.
> >>
> >> Yes.  I think that having the same mm_struct look different from
> >> different tasks is problematic.  Getting it right in the arch code is
> >> going to be nasty.  The heuristics of what to share are also tough --
> >> why would text + data + stack or whatever you're doing be adequate?
> >> What if you're in a thread?  What if two tasks have their stacks in
> >> the same place?
> >
> > The different ASes that a task now can have when it uses first class 
> > virtual address
> > spaces are not realized in the kernel by using only one mm_struct per task 
> > that just
> > looks differently but by using multiple mm_structs - one for each AS that 
> > the task
> > can execute in. When a task attaches a first class virtual address space to 
> > itself to
> > be able to use another AS, the kernel adds a temporary mm_struct to this 
> > task that
> > contains the mappings of the first class virtual address space and the one 
> > shared
> > with the task's original AS. If a thread now wants to switch into this 
> > attached first
> > class virtual address space the kernel only changes the 'mm' and 
> > 'active_mm' pointers
> > in the task_struct of the thread to the temporary mm_struct and performs the
> > corresponding mm_switch operation. The original mm_struct of the thread 
> > will not be
> > changed.
> >
> > Accordingly, I do not magically make mm_structs look differently depending 
> > on the
> > task that uses it, but create temporary mm_structs that only contain 
> > mappings to the
> > same memory regions.
> 
> This sounds complicated and fragile.  What happens if a heuristically
> shared region coincides with a region in the "first class address
> space" being selected?

If such a conflict happens, the task cannot use the first class address space 
and the
corresponding system call will return an error. 

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-16 Thread Till Smejkal
On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkal
> <till.smej...@googlemail.com> wrote:
> > On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> >> > One advantage of VAS segments is that they can be globally queried by 
> >> > user programs
> >> > which means that VAS segments can be shared by applications that not 
> >> > necessarily have
> >> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data 
> >> > will only work
> >> > if the tasks that share the memory region are related (aka. have a 
> >> > common parent that
> >> > initialized the shared mapping). Otherwise, the shared mapping have to 
> >> > be backed by a
> >> > file.
> >>
> >> What's wrong with memfd_create()?
> >>
> >> > VAS segments on the other side allow sharing of pure in memory data by
> >> > arbitrary related tasks without the need of a file. This becomes 
> >> > especially
> >> > interesting if one combines VAS segments with non-volatile memory since 
> >> > one can keep
> >> > data structures in the NVM and still be able to share them between 
> >> > multiple tasks.
> >>
> >> What's wrong with regular mmap?
> >
> > I never wanted to say that there is something wrong with regular mmap. We 
> > just
> > figured that with VAS segments you could remove the need to mmap your 
> > shared data but
> > instead can keep everything purely in memory.
> 
> memfd does that.

Yes, that's right. Thanks for giving me the pointer to this. I should have 
researched
more carefully before starting to work at VAS segments.

> > VAS segments on the other side would provide a functionality to
> > achieve the same without the need of any mounted filesystem. However, I 
> > agree, that
> > this is just a small advantage compared to what can already be achieved 
> > with the
> > existing functionality provided by the Linux kernel.
> 
> I see this "small advantage" as "resource leak and security problem".

I don't agree here. VAS segments are basically in-memory files that are handled 
by
the kernel directly without using a file system. Hence, if an application uses 
a VAS
segment to store data the same rules apply as if it uses a file. Everything 
that it
saves in the VAS segment might be accessible by other applications. An 
application
using VAS segments should be aware of this fact. In addition, the resources 
that are
represented by a VAS segment are not leaked. As I said, VAS segments are much 
like
files. Hence, if you don't want to use them any more, delete them. But as with 
files,
the kernel will not delete them for you (although something like this can be 
added).

> >> This sounds complicated and fragile.  What happens if a heuristically
> >> shared region coincides with a region in the "first class address
> >> space" being selected?
> >
> > If such a conflict happens, the task cannot use the first class address 
> > space and the
> > corresponding system call will return an error. However, with the current 
> > available
> > virtual address space size that programs can use, such conflicts are 
> > probably rare.
> 
> A bug that hits 1% of the time is often worse than one that hits 100%
> of the time because debugging it is miserable.

I don't agree that this is a bug at all. If there is a conflict in the memory 
layout
of the ASes the application simply cannot use this first class virtual address 
space.
Every application that wants to use first class virtual address spaces should 
check
for error return values and handle them.

This situation is similar to mapping a file at some special address in memory 
because
the file contains pointer based data structures and the application wants to use
them, but the kernel cannot map the file at this particular position in the
application's AS because there is already a different conflicting mapping. If an
application wants to do such things, it should also handle all the errors that 
can
occur.

Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-14 Thread Till Smejkal
On Tue, 14 Mar 2017, Chris Metcalf wrote:
> On 3/14/2017 12:12 PM, Till Smejkal wrote:
> > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> > > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
> > > <till.smej...@googlemail.com> wrote:
> > > > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> > > > > This sounds rather complicated.  Getting TLB flushing right seems
> > > > > tricky.  Why not just map the same thing into multiple mms?
> > > > This is exactly what happens at the end. The memory region that is 
> > > > described by the
> > > > VAS segment will be mapped in the ASes that use the segment.
> > > So why is this kernel feature better than just doing MAP_SHARED
> > > manually in userspace?
> > One advantage of VAS segments is that they can be globally queried by user 
> > programs
> > which means that VAS segments can be shared by applications that not 
> > necessarily have
> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will 
> > only work
> > if the tasks that share the memory region are related (aka. have a common 
> > parent that
> > initialized the shared mapping). Otherwise, the shared mapping have to be 
> > backed by a
> > file.
> 
> True, but why is this bad?  The shared mapping will be memory resident
> regardless, even if backed by a file (unless swapped out under heavy
> memory pressure, but arguably that's a feature anyway).  More importantly,
> having a file name is a simple and consistent way of identifying such
> shared memory segments.
> 
> With a little work, you can also arrange to map such files into memory
> at a fixed address in all participating processes, thus making internal
> pointers work correctly.

I don't want to say that the interface provided by MAP_SHARED is bad. I am only
arguing that VAS segments and the interface that they provide have an advantage 
over
the existing ones in my opinion. However, Matthew Wilcox also suggested in some
earlier mail that VAS segments could be exported to user space via a special 
purpose
filesystem. This would enable users of VAS segments to also just use some 
special
files to setup the shared memory regions. But since the VAS segment itself 
already
knows where at has to be mapped in the virtual address space of the process, the
establishing of the shared memory region would be very easy for the user.

> > VAS segments on the other side allow sharing of pure in memory data by
> > arbitrary related tasks without the need of a file. This becomes especially
> > interesting if one combines VAS segments with non-volatile memory since one 
> > can keep
> > data structures in the NVM and still be able to share them between multiple 
> > tasks.
> 
> I am not fully up to speed on NV/pmem stuff, but isn't that exactly what
> the DAX mode is supposed to allow you to do?  If so, isn't sharing a
> mapped file on a DAX filesystem on top of pmem equivalent to what
> you're proposing?

If I read the documentation to DAX filesystems correctly, it is indeed possible 
to us
them to create files that life purely in NVM. I wasn't fully aware of this 
feature.
Thanks for the pointer.

However, the main contribution of this patchset is actually the idea of first 
class
virtual address spaces and that they can be used to allow processes to have 
multiple
different views on the system's main memory. For us, VAS segments were another 
logic
step in the same direction (from first class virtual address spaces to first 
class
address space segments). However, if there is already functionality in the Linux
kernel to achieve the exact same behavior, there is no real need to add VAS 
segments.
I will continue thinking about them and either find a different situation where 
the
currently available interface is not sufficient/too complicated or drop VAS 
segments
from future version of the patch set.

Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-14 Thread Till Smejkal
On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
> <till.smej...@googlemail.com> wrote:
> > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> >> This sounds rather complicated.  Getting TLB flushing right seems
> >> tricky.  Why not just map the same thing into multiple mms?
> >
> > This is exactly what happens at the end. The memory region that is 
> > described by the
> > VAS segment will be mapped in the ASes that use the segment.
> 
> So why is this kernel feature better than just doing MAP_SHARED
> manually in userspace?

One advantage of VAS segments is that they can be globally queried by user 
programs
which means that VAS segments can be shared by applications that not 
necessarily have
to be related. If I am not mistaken, MAP_SHARED of pure in memory data will 
only work
if the tasks that share the memory region are related (aka. have a common 
parent that
initialized the shared mapping). Otherwise, the shared mapping have to be 
backed by a
file. VAS segments on the other side allow sharing of pure in memory data by
arbitrary related tasks without the need of a file. This becomes especially
interesting if one combines VAS segments with non-volatile memory since one can 
keep
data structures in the NVM and still be able to share them between multiple 
tasks.

> >> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> >> and not make it look magically different depending on which process
> >> maps it?  If you need a trampoline (which you do, of course), just
> >> write a trampoline in regular user code and map it manually.
> >
> > Did I understand you correctly that you are proposing that the switching 
> > thread
> > should make sure by itself that its code, stack, … memory regions are 
> > properly setup
> > in the new AS before/after switching into it? I think, this would make 
> > using first
> > class virtual address spaces much more difficult for user applications to 
> > the extend
> > that I am not even sure if they can be used at all. At the moment, 
> > switching into a
> > VAS is a very simple operation for an application because the kernel will 
> > just simply
> > do the right thing.
> 
> Yes.  I think that having the same mm_struct look different from
> different tasks is problematic.  Getting it right in the arch code is
> going to be nasty.  The heuristics of what to share are also tough --
> why would text + data + stack or whatever you're doing be adequate?
> What if you're in a thread?  What if two tasks have their stacks in
> the same place?

The different ASes that a task now can have when it uses first class virtual 
address
spaces are not realized in the kernel by using only one mm_struct per task that 
just
looks differently but by using multiple mm_structs - one for each AS that the 
task
can execute in. When a task attaches a first class virtual address space to 
itself to
be able to use another AS, the kernel adds a temporary mm_struct to this task 
that
contains the mappings of the first class virtual address space and the one 
shared
with the task's original AS. If a thread now wants to switch into this attached 
first
class virtual address space the kernel only changes the 'mm' and 'active_mm' 
pointers
in the task_struct of the thread to the temporary mm_struct and performs the
corresponding mm_switch operation. The original mm_struct of the thread will 
not be
changed.

Accordingly, I do not magically make mm_structs look differently depending on 
the
task that uses it, but create temporary mm_structs that only contain mappings 
to the
same memory regions.

I agree that finding a good heuristics of what to share is difficult. At the 
moment,
all memory regions that are available in the task's original AS will also be
available when a thread switches into an attached first class virtual address 
space
(aka. are shared). That means that VAS can mainly be used to extend the AS of a 
task
in the current state of the implementation. The reason why I implemented the 
sharing
in this way is that I didn't want to break shared libraries. If I only share
code+heap+stack, shared libraries would not work anymore after switching into a 
VAS.

> I could imagine something like a sigaltstack() mode that lets you set
> a signal up to also switch mm could be useful.

This is a very interesting idea. I will keep it in mind for future use cases of
multiple virtual address spaces per task.

Thanks
Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc

Re: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'

2017-03-14 Thread Till Smejkal
On Tue, 14 Mar 2017, David Laight wrote:
> From: Linuxppc-dev Till Smejkal
> > Sent: 13 March 2017 22:14
> > The only way until now to create a new memory map was via the exported
> > function 'mm_alloc'. Unfortunately, this function not only allocates a new
> > memory map, but also completely initializes it. However, with the
> > introduction of first class virtual address spaces, some initialization
> > steps done in 'mm_alloc' are not applicable to the memory maps needed for
> > this feature and hence would lead to errors in the kernel code.
> > 
> > Instead of introducing a new function that can allocate and initialize
> > memory maps for first class virtual address spaces and potentially
> > duplicate some code, I decided to split the mm_alloc function as well as
> > the 'mm_init' function that it uses.
> > 
> > Now there are four functions exported instead of only one. The new
> > 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
> > want to have the old behavior of mm_alloc one can use the newly introduced
> > function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
> > also fully initializes it.
> ...
> 
> That looks like bugs waiting to happen.
> You need unchanged code to fail to compile.

Thank you for this hint. I can give the new mm_alloc function a different name 
so
that code that uses the *old* mm_alloc function will fail to compile. I just 
reused
the old name when I wrote the code, because mm_alloc was only used in very few
locations in the kernel (2 times in the whole kernel source) which made 
identifying
and changing them very easy. I also don't think that there will be many users 
in the
kernel for mm_alloc in the future because it is a relatively low level data
structure. But if it is better to use a different name for the new function, I 
am
very happy to change this.

Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-14 Thread Till Smejkal
On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> On Mon, Mar 13, 2017 at 3:14 PM, Till Smejkal
> <till.smej...@googlemail.com> wrote:
> > This patchset extends the kernel memory management subsystem with a new
> > type of address spaces (called VAS) which can be created and destroyed
> > independently of processes by a user in the system. During its lifetime
> > such a VAS can be attached to processes by the user which allows a process
> > to have multiple address spaces and thereby multiple, potentially
> > different, views on the system's main memory. During its execution the
> > threads belonging to the process are able to switch freely between the
> > different attached VAS and the process' original AS enabling them to
> > utilize the different available views on the memory.
> 
> Sounds like the old SKAS feature for UML.

I haven't heard of this feature before, but after shortly looking at the 
description
on the UML website it actually has some similarities with what I am proposing. 
But as
far as I can see this was not merged into the mainline kernel, was it? In 
addition, I
think that first class virtual address spaces goes even one step further by 
allowing
AS to live independently of processes.

> > In addition to the concept of first class virtual address spaces, this
> > patchset introduces yet another feature called VAS segments. VAS segments
> > are memory regions which have a fixed size and position in the virtual
> > address space and can be shared between multiple first class virtual
> > address spaces. Such shareable memory regions are especially useful for
> > in-memory pointer-based data structures or other pure in-memory data.
> 
> This sounds rather complicated.  Getting TLB flushing right seems
> tricky.  Why not just map the same thing into multiple mms?

This is exactly what happens at the end. The memory region that is described by 
the
VAS segment will be mapped in the ASes that use the segment.

> >
> > | VAS |  processes  |
> > -
> > switch  |   468ns |  1944ns |
> 
> The solution here is IMO to fix the scheduler.

IMHO it will be very difficult for the scheduler code to reach the same 
switching
time as the pure VAS switch because switching between VAS does not involve 
saving any
registers or FPU state and does not require selecting the next runnable task. 
VAS
switch is basically a system call that just changes the AS of the current thread
which makes it a very lightweight operation.

> Also, FWIW, I have patches (that need a little work) that will make
> switch_mm() wy faster on x86.

These patches will also improve the speed of the VAS switch operation. We are 
also
using the switch_mm function in the background to perform the actual hardware 
switch
between the two ASes. The main reason why the VAS switch is faster than the task
switch is that it just has to do fewer things.

> > At the current state of the development, first class virtual address spaces
> > have one limitation, that we haven't been able to solve so far. The feature
> > allows, that different threads of the same process can execute in different
> > AS at the same time. This is possible, because the VAS-switch operation
> > only changes the active mm_struct for the task_struct of the calling
> > thread. However, when a thread switches into a first class virtual address
> > space, some parts of its original AS are duplicated into the new one to
> > allow the thread to continue its execution at its current state.
> 
> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> and not make it look magically different depending on which process
> maps it?  If you need a trampoline (which you do, of course), just
> write a trampoline in regular user code and map it manually.

Did I understand you correctly that you are proposing that the switching thread
should make sure by itself that its code, stack, … memory regions are properly 
setup
in the new AS before/after switching into it? I think, this would make using 
first
class virtual address spaces much more difficult for user applications to the 
extend
that I am not even sure if they can be used at all. At the moment, switching 
into a
VAS is a very simple operation for an application because the kernel will just 
simply
do the right thing.

Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc

Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
Hi Vineet,

On Mon, 13 Mar 2017, Vineet Gupta wrote:
> I've not looked at the patches closely (or read the references paper fully 
> yet),
> but at first glance it seems on ARC architecture, we can can potentially
> use/leverage this mechanism to implement the shared TLB entries. Before anyone
> shouts these are not same as the IA64/x86 protection keys which allow TLB 
> entries
> with different protection bits across processes etc. These TLB entries are
> actually *shared* by processes.
> 
> Conceptually there's shared address spaces, independent of processes. e.g. 
> ldso
> code is shared address space #1, libc (code) #2  System can support a 
> limited
> number of shared addr spaces (say 64, enough for typical embedded sys).
> 
> While Normal TLB entries are tagged with ASID (Addr space ID) to keep them 
> unique
> across processes, Shared TLB entries are tagged with Shared address space ID.
> 
> A process MMU context consists of ASID (a single number) and a SASID bitmap 
> (to
> allow "subscription" to multiple Shared spaces. The subscriptions are set up 
> bu
> userspace ld.so which knows about the libs process wants to map.
> 
> The restriction ofcourse is that the spaces are mapped at *same* vaddr is all
> participating processes. I know this goes against whole security, address 
> space
> randomization - but it gives much better real time performance. Why does each
> process need to take a MMU exception for libc code...
> 
> So long story short - it seems there can be multiple uses of this 
> infrastructure !

During the development of this code, we also looked at shared TLB entries, but
the other way around. We wanted to use them to prevent flushing of TLB entries 
of
shared memory regions when switching between multiple ASes. Unfortunately, we 
never
finished this part of the code.

However, we also investigated into a different use-case for first class virtual
address spaces that is related to what you propose if I didn't understand 
something
wrong. The idea is to move shared libraries into their own first class virtual
address space and only load some small trampoline code in the application AS. 
This
trampoline code performs the VAS switch in the libraries AS and execute the 
requested
function there. If we combine this architecture with tagged TLB entries to 
prevent
TLB flushes during the switch operation, it can also reach an acceptable 
performance.
A side effect of moving the shared library into its own AS is that it can not 
be used
by ROP-attacks because it is not accessible in the application's AS.

Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
On Tue, 14 Mar 2017, Richard Henderson wrote:
> On 03/14/2017 10:39 AM, Till Smejkal wrote:
> > > Is this an indication that full virtual address spaces are useless?  It
> > > would seem like if you only use virtual address segments then you avoid 
> > > all
> > > of the problems with executing code, active stacks, and brk.
> > 
> > What do you mean with *virtual address segments*? The nice part of first 
> > class
> > virtual address spaces is that one can share/reuse collections of address 
> > space
> > segments easily.
> 
> What do *I* mean?  You introduced the term, didn't you?
> Rereading your original I see you called them "VAS segments".

Oh, I am sorry. I thought that you were referring to some other feature that I 
don't
know.

> Anyway, whatever they are called, it would seem that these segments do not
> require any of the syncing mechanisms that are causing you problems.

Yes, VAS segments provide a possibility to share memory regions between multiple
address spaces without the need to synchronize heap, stack, etc. Unfortunately, 
the
VAS segment feature itself without the whole concept of first class virtual 
address
spaces is not as powerful. With some additional work it can probably be 
represented
with the existing shmem functionality.

The first class virtual address space feature on the other side provides a real
benefit for applications in our opinion namely that an application can switch 
between
different views on its memory which enables various interesting programming 
paradigms
as mentioned in the cover letter.

Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[RFC PATCH 13/13] fs/proc: Add procfs support for first class virtual address spaces

2017-03-13 Thread Till Smejkal
Add new files and directories to the procfs file system that contain
various information about the first class virtual address spaces attach to
the processes in the system.

To the procfs directories of each process in the system (/proc/$PID) an
additional directory with the name 'vas' is added that contains information
about all the VAS that are attached to this process. In this directory one
can find for each attached VAS a special folder with a file with some
status information about the attached VAS, a file with the current memory
map of the attached VAS and a link to the sysfs folder of the underlying
VAS.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 fs/proc/base.c | 528 +
 fs/proc/inode.c|   1 +
 fs/proc/internal.h |   1 +
 mm/Kconfig |   9 +
 4 files changed, 539 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 87c9a9aacda3..e60c13dd087c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -45,6 +45,9 @@
  *
  *  Paul Mundt <paul.mu...@nokia.com>:
  *  Overall revision about smaps.
+ *
+ *  Till Smejkal <till.smej...@gmail.com>:
+ *  Add entries for first class virtual address spaces.
  */
 
 #include 
@@ -87,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_HARDWALL
 #include 
 #endif
@@ -2841,6 +2845,527 @@ static int proc_pid_personality(struct seq_file *m, 
struct pid_namespace *ns,
return err;
 }
 
+#ifdef CONFIG_VAS_PROCFS
+
+/**
+ * Get a string representation of the access type to a VAS.
+ **/
+#define vas_access_type_str(type) ((type) & MAY_WRITE ?
\
+  ((type) & MAY_READ ? "rw" : "wo") : "ro")
+
+static int att_vas_show_status(struct seq_file *sf, void *unused)
+{
+   struct inode *inode = sf->private;
+   struct proc_inode *pi = PROC_I(inode);
+   struct task_struct *tsk;
+   struct vas_context *vas_ctx;
+   struct att_vas *avas;
+   int vid = pi->vas_id;
+
+   tsk = get_proc_task(inode);
+   if (!tsk)
+   return -ENOENT;
+
+   vas_ctx = tsk->vas_ctx;
+
+   vas_context_lock(vas_ctx);
+
+   list_for_each_entry(avas, _ctx->vases, tsk_link) {
+   if (vid == avas->vas->id)
+   goto good_att_vas;
+   }
+
+   vas_context_unlock(vas_ctx);
+   put_task_struct(tsk);
+
+   return -ENOENT;
+
+good_att_vas:
+   seq_printf(sf,
+  "pid:  %d\n"
+  "vid:  %d\n"
+  "type: %s\n",
+  avas->tsk->pid, avas->vas->id,
+  vas_access_type_str(avas->type));
+
+   vas_context_unlock(vas_ctx);
+   put_task_struct(tsk);
+
+   return 0;
+}
+
+static int att_vas_show_status_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, att_vas_show_status, inode);
+}
+
+static const struct file_operations att_vas_show_status_fops = {
+   .open   = att_vas_show_status_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int att_vas_show_mappings(struct seq_file *sf, void *unused)
+{
+   struct inode *inode = sf->private;
+   struct proc_inode *pi = PROC_I(inode);
+   struct task_struct *tsk;
+   struct vas_context *vas_ctx;
+   struct att_vas *avas;
+   struct mm_struct *mm;
+   struct vm_area_struct *vma;
+   int vid = pi->vas_id;
+
+   tsk = get_proc_task(inode);
+   if (!tsk)
+   return -ENOENT;
+
+   vas_ctx = tsk->vas_ctx;
+
+   vas_context_lock(vas_ctx);
+
+   list_for_each_entry(avas, _ctx->vases, tsk_link) {
+   if (avas->vas->id == vid)
+   goto good_att_vas;
+   }
+
+   vas_context_unlock(vas_ctx);
+   put_task_struct(tsk);
+
+   return -ENOENT;
+
+good_att_vas:
+   mm = avas->mm;
+
+   down_read(>mmap_sem);
+
+   if (!mm->mmap) {
+   seq_puts(sf, "EMPTY\n");
+   goto out_unlock;
+   }
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   vm_flags_t flags = vma->vm_flags;
+   struct file *file = vma->vm_file;
+   unsigned long long pgoff = 0;
+
+   if (file)
+   pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
+
+   seq_printf(sf, "%08lx-%08lx %c%c%c%c [%c:%c] %08llx",
+  vma->vm_start, vma->vm_end,
+  flags & VM_READ ? 'r' : '-',
+  flags & VM_WRITE ? 'w' : '-',
+  flags & VM_EXEC ? 'x' : '-',
+  flags & VM_MAYSHARE ? 's' : 'p',
+ 

[RFC PATCH 10/13] mm: Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
Introduce a different type of address spaces which are first class citizens
in the OS. That means that the kernel now handles two types of AS, those
which are closely coupled with a process and those which aren't. While the
former ones are created and destroyed together with the process by the
kernel and are the default type of AS in the Linux kernel, the latter ones
have to be managed explicitly by the user and are the newly introduced
type.

Accordingly, a first class AS (also called VAS == virtual address space)
can exist in the OS independently from any process. A user has to
explicitly create and destroy them in the system. Processes and VAS can be
combined by attaching a previously created VAS to a process which basically
adds an additional AS to the process that the process' threads are able to
execute in. Hence, VAS allow a process to have different views onto the
main memory of the system (its original AS and the attached VAS) between
which its threads can switch arbitrarily during their lifetime.

The functionality made available through first class virtual address spaces
can be used in various different ways. One possible way to utilize VAS is
to compartmentalize a process for security reasons. Another possible usage
is to improve the performance of data-centric applications by being able to
manage different sets of data in memory without the need to map or unmap
them.

Furthermore, first class virtual address spaces can be attached to
different processes at the same time if the underlying memory is only
readable. This mechanism allows sharing of whole address spaces between
multiple processes that can both execute in them using the contained
memory.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
Signed-off-by: Marco Benatto <marco.antonio@gmail.com>
---
 MAINTAINERS|   10 +
 arch/x86/entry/syscalls/syscall_32.tbl |9 +
 arch/x86/entry/syscalls/syscall_64.tbl |9 +
 fs/exec.c  |3 +
 include/linux/mm_types.h   |8 +
 include/linux/sched.h  |   17 +
 include/linux/syscalls.h   |   11 +
 include/linux/vas.h|  182 +++
 include/linux/vas_types.h  |   88 ++
 include/uapi/asm-generic/unistd.h  |   20 +-
 include/uapi/linux/Kbuild  |1 +
 include/uapi/linux/vas.h   |   16 +
 init/main.c|2 +
 kernel/exit.c  |2 +
 kernel/fork.c  |   28 +-
 kernel/sys_ni.c|   11 +
 mm/Kconfig |   20 +
 mm/Makefile|1 +
 mm/internal.h  |8 +
 mm/memory.c|3 +
 mm/mmap.c  |   22 +
 mm/vas.c   | 2188 
 22 files changed, 2657 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/vas.h
 create mode 100644 include/linux/vas_types.h
 create mode 100644 include/uapi/linux/vas.h
 create mode 100644 mm/vas.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 527d13759ecc..060b1c64e67a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5040,6 +5040,16 @@ F:   Documentation/firmware_class/
 F: drivers/base/firmware*.c
 F: include/linux/firmware.h
 
+FIRST CLASS VIRTUAL ADDRESS SPACES
+M: Till Smejkal <till.smej...@gmail.com>
+L: linux-ker...@vger.kernel.org
+L: linux...@kvack.org
+S: Maintained
+F: include/linux/vas_types.h
+F: include/linux/vas.h
+F: include/uapi/linux/vas.h
+F: mm/vas.c
+
 FLASH ADAPTER DRIVER (IBM Flash Adapter 900GB Full Height PCI Flash Card)
 M: Joshua Morris <josh.h.mor...@us.ibm.com>
 M: Philip Kelleher <pjk1...@linux.vnet.ibm.com>
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 2b3618542544..8c553eef8c44 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -389,3 +389,12 @@
 380i386pkey_mprotect   sys_pkey_mprotect
 381i386pkey_alloc  sys_pkey_alloc
 382i386pkey_free   sys_pkey_free
+383i386vas_create  sys_vas_create
+384i386vas_delete  sys_vas_delete
+385i386vas_findsys_vas_find
+386i386vas_attach  sys_vas_attach
+387i386vas_detach  sys_vas_detach
+388i386vas_switch  sys_vas_switch
+389i386active_vas  sys_active_vas
+390i386vas_getattr sys_vas_getattr
+391i386vas_setattr sys_vas_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index e93ef0b38db8..72f1f0495710 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64

[RFC PATCH 12/13] mm/vas: Add lazy-attach support for first class virtual address spaces

2017-03-13 Thread Till Smejkal
Until now, whenever a task attaches a first class virtual address space,
all the memory regions currently present in the task are replicated into
the first class virtual address space so that the task can continue
executing as if nothing has changed. However, this technique causes the
attach and detach operations to be very costly, since the whole memory map
of the task has to be duplicated.

Lazy-attaching on the other side uses a similar technique as it is done to
copy page tables during fork. Instead of completely duplicating the memory
map of the task together with its page tables, only a skeleton memory map
is created and then later filled with content when a page fault is
triggered when the process actually accesses the memory regions. The big
advantage is, that unnecessary memory regions are not duplicated at all,
but just those that the process actually uses while executing inside the
first class virtual address space. The only memory region which is always
duplicated during the attach-operation is the code memory section, because
this memory region is always necessary for execution and saves us one page
fault later during the process execution.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 include/linux/mm_types.h |   1 +
 include/linux/vas.h  |  26 
 mm/Kconfig   |  18 ++
 mm/memory.c  |   5 ++
 mm/vas.c | 164 ++-
 5 files changed, 197 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 82bf78ea83ee..65e04f14225d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -362,6 +362,7 @@ struct vm_area_struct {
 #ifdef CONFIG_VAS
struct mm_struct *vas_reference;
ktime_t vas_last_update;
+   bool vas_attached;
 #endif
 };
 
diff --git a/include/linux/vas.h b/include/linux/vas.h
index 376b9fa1ee27..8682bfc86568 100644
--- a/include/linux/vas.h
+++ b/include/linux/vas.h
@@ -2,6 +2,7 @@
 #define _LINUX_VAS_H
 
 
+#include 
 #include 
 #include 
 
@@ -293,4 +294,29 @@ static inline int vas_exit(struct task_struct *tsk) { 
return 0; }
 
 #endif /* CONFIG_VAS */
 
+
+/***
+ * Management of the VAS lazy attaching
+ ***/
+
+#ifdef CONFIG_VAS_LAZY_ATTACH
+
+/**
+ * Lazily update the page tables of a vm_area which was not completely setup
+ * during the VAS attaching.
+ *
+ * @param[in] vma: The vm_area for which the page tables should be
+ * setup before continuing the page fault handling.
+ *
+ * @returns:   0 of the lazy-attach was successful or not
+ * necessary, or 1 if something went wrong.
+ */
+extern int vas_lazy_attach_vma(struct vm_area_struct *vma);
+
+#else /* CONFIG_VAS_LAZY_ATTACH */
+
+static inline int vas_lazy_attach_vma(struct vm_area_struct *vma) { return 0; }
+
+#endif /* CONFIG_VAS_LAZY_ATTACH */
+
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 9a80877f3536..934c56bcdbf4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -720,6 +720,24 @@ config VAS
 
  If not sure, then say N.
 
+config VAS_LAZY_ATTACH
+   bool "Use lazy-attach for First Class Virtual Address Spaces"
+   depends on VAS
+   default y
+   help
+ When this option is enabled, memory regions of First Class Virtual 
+ Address Spaces will be mapped in the task's address space lazily after
+ the switch happened. That means, the actual mapping will happen when a
+ page fault occurs for the particular memory region. While this
+ technique is less costly during the switching operation, it can become
+ very costly during the page fault handling.
+
+ Hence if the program uses a lot of different memory regions, this
+ lazy-attaching technique can be more costly than doing the mapping
+ eagerly during the switch.
+
+ If not sure, then say Y.
+
 config VAS_DEBUG
bool "Debugging output for First Class Virtual Address Spaces"
depends on VAS
diff --git a/mm/memory.c b/mm/memory.c
index e4747b3fd5b9..cdefc99a50ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -64,6 +64,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -4000,6 +4001,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned 
long address,
/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
 
+   /* Check if this VMA belongs to a VAS and needs to be lazy attached. */
+   if (unlikely(vas_lazy_attach_vma(vma)))
+   return VM_FAULT_SIGSEGV;
+
/*
 * Enable the memcg OOM handling for faults triggered in user
 * space.  Kernel faults are handled more gracefully.
diff --git a/mm/vas.c b/mm/vas.c
index 345b023c21aa..953ba8d6e603 100644
--- a/mm/vas.c
+++ b/mm/vas.c
@@ -138,12 +138,13 @@ static void __dump_memory_map(const char *ti

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
On Tue, 14 Mar 2017, Richard Henderson wrote:
> On 03/14/2017 08:14 AM, Till Smejkal wrote:
> > At the current state of the development, first class virtual address spaces
> > have one limitation, that we haven't been able to solve so far. The feature
> > allows, that different threads of the same process can execute in different
> > AS at the same time. This is possible, because the VAS-switch operation
> > only changes the active mm_struct for the task_struct of the calling
> > thread. However, when a thread switches into a first class virtual address
> > space, some parts of its original AS are duplicated into the new one to
> > allow the thread to continue its execution at its current state.
> > Accordingly, parts of the processes AS (e.g. the code section, data
> > section, heap section and stack sections) exist in multiple AS if the
> > process has a VAS attached to it. Changes to these shared memory regions
> > are synchronized between the address spaces whenever a thread switches
> > between two of them. Unfortunately, in some scenarios the kernel is not
> > able to properly synchronize all these shared memory regions because of
> > conflicting changes. One such example happens if there are two threads, one
> > executing in an attached first class virtual address space, the other in
> > the tasks original address space. If both threads make changes to the heap
> > section that cause expansion of the underlying vm_area_struct, the kernel
> > cannot correctly synchronize these changes, because that would cause parts
> > of the virtual address space to be overwritten with unrelated data. In the
> > current implementation such conflicts are only detected but not resolved
> > and result in an error code being returned by the kernel during the VAS
> > switch operation. Unfortunately, that means for the particular thread that
> > tried to make the switch, that it cannot do this anymore in the future and
> > accordingly has to be killed.
> 
> This sounds like a fairly fundamental problem to me.

Yes I agree. This is a significant limitation of first class virtual address 
spaces.
However, conflict like this can be mitigated by being careful in the application
that uses multiple first class virtual address spaces. If all threads make sure 
that
they never resize shared memory regions when executing inside a VAS such 
conflicts do
not occur. Another possibility that I investigated but not yet finished is that 
such
resizes of shared memory regions have to be synchronized more frequently than 
just at
every switch between VASes. If one for example "forward" memory region resizes 
to all
AS that share this particular memory region during the resize operation, one can
completely eliminate this problem. Unfortunately, this introduces a significant 
cost
and introduces a difficult to handle race condition.

> Is this an indication that full virtual address spaces are useless?  It
> would seem like if you only use virtual address segments then you avoid all
> of the problems with executing code, active stacks, and brk.

What do you mean with *virtual address segments*? The nice part of first class
virtual address spaces is that one can share/reuse collections of address space
segments easily.

Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[RFC PATCH 08/13] kernel/fork: Define explicitly which mm_struct to duplicate during fork

2017-03-13 Thread Till Smejkal
The dup_mm-function used during 'do_fork' to duplicate the current task's
mm_struct for the newly forked task always implicitly uses current->mm for
this purpose. However, during copy_mm it was already decided which
mm_struct to copy/duplicate. So pass this mm_struct to dup_mm instead of
again deciding which mm_struct to use.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 kernel/fork.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9209f6d5d7c0..d3087d870855 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1158,9 +1158,10 @@ void mm_release(struct task_struct *tsk, struct 
mm_struct *mm)
  * Allocate a new mm structure and copy contents from the
  * mm structure of the passed in task structure.
  */
-static struct mm_struct *dup_mm(struct task_struct *tsk)
+static struct mm_struct *dup_mm(struct task_struct *tsk,
+   struct mm_struct *oldmm)
 {
-   struct mm_struct *mm, *oldmm = current->mm;
+   struct mm_struct *mm;
int err;
 
mm = allocate_mm();
@@ -1226,7 +1227,7 @@ static int copy_mm(unsigned long clone_flags, struct 
task_struct *tsk)
}
 
retval = -ENOMEM;
-   mm = dup_mm(tsk);
+   mm = dup_mm(tsk, oldmm);
if (!mm)
goto fail_nomem;
 
-- 
2.12.0


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions

2017-03-13 Thread Till Smejkal
Hi Matthew,

On Mon, 13 Mar 2017, Matthew Wilcox wrote:
> On Mon, Mar 13, 2017 at 03:14:13PM -0700, Till Smejkal wrote:
> > +/**
> > + * Create a new VAS segment.
> > + *
> > + * @param[in] name:The name of the new VAS segment.
> > + * @param[in] start:   The address where the VAS segment 
> > begins.
> > + * @param[in] end: The address where the VAS segment ends.
> > + * @param[in] mode:The access rights for the VAS segment.
> > + *
> > + * @returns:   The VAS segment ID on success, -ERRNO 
> > otherwise.
> > + **/
> 
> Please follow the kernel-doc conventions, as described in
> Documentation/doc-guide/kernel-doc.rst.  Also, function documentation
> goes with the implementation, not the declaration.

Thank you for this pointer. I wasn't aware of this convention. I will change the
patches accordingly.

> > +/**
> > + * Get ID of the VAS segment belonging to a given name.
> > + *
> > + * @param[in] name:The name of the VAS segment for which 
> > the ID
> > + * should be returned.
> > + *
> > + * @returns:   The VAS segment ID on success, -ERRNO
> > + * otherwise.
> > + **/
> > +extern int vas_seg_find(const char *name);
> 
> So ... segments have names, and IDs ... and access permissions ...
> Why isn't this a special purpose filesystem?

We also thought about this. However, we decided against implementing them as a
special purpose filesystem, mainly because we could not think of a good way to
represent a VAS/VAS segment in this file system (should they be represented 
rather as
file or directory) and we weren't sure what a hierarchy in the filesystem would 
mean
for the underlying address spaces. Hence we decided against it and rather used a
combination of IDR and sysfs. However, I don't have any strong feelings and 
would
also reimplement them as a special purpose filesystem if people rather like 
them to
be one.

Till

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[RFC PATCH 06/13] mm/mmap: Export 'vma_link' and 'find_vma_links' to mm subsystem

2017-03-13 Thread Till Smejkal
Make the functions 'vma_link' and 'find_vma_links' accessible to other
source files in the mm/ source directory of the kernel so that other files
in that directory can also perform low level changes to mm_struct data
structures.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 mm/internal.h | 11 +++
 mm/mmap.c | 12 ++--
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 7aa2ea0a8623..e22cb031b45b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -76,6 +76,17 @@ static inline void set_page_refcounted(struct page *page)
 extern unsigned long highest_memmap_pfn;
 
 /*
+ * in mm/mmap.c
+ */
+extern void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
+struct vm_area_struct *prev, struct rb_node **rb_link,
+struct rb_node *rb_parent);
+extern int find_vma_links(struct mm_struct *mm, unsigned long addr,
+ unsigned long end, struct vm_area_struct **pprev,
+ struct rb_node ***rb_link,
+ struct rb_node **rb_parent);
+
+/*
  * in mm/vmscan.c:
  */
 extern int isolate_lru_page(struct page *page);
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f60c8ebd6b6..d35c6b51cadf 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -466,9 +466,9 @@ anon_vma_interval_tree_post_update_vma(struct 
vm_area_struct *vma)
anon_vma_interval_tree_insert(avc, >anon_vma->rb_root);
 }
 
-static int find_vma_links(struct mm_struct *mm, unsigned long addr,
-   unsigned long end, struct vm_area_struct **pprev,
-   struct rb_node ***rb_link, struct rb_node **rb_parent)
+int find_vma_links(struct mm_struct *mm, unsigned long addr,
+  unsigned long end, struct vm_area_struct **pprev,
+  struct rb_node ***rb_link, struct rb_node **rb_parent)
 {
struct rb_node **__rb_link, *__rb_parent, *rb_prev;
 
@@ -580,9 +580,9 @@ __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
__vma_link_rb(mm, vma, rb_link, rb_parent);
 }
 
-static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
-   struct vm_area_struct *prev, struct rb_node **rb_link,
-   struct rb_node *rb_parent)
+void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct vm_area_struct *prev, struct rb_node **rb_link,
+ struct rb_node *rb_parent)
 {
struct address_space *mapping = NULL;
 
-- 
2.12.0


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
Hi Greg,

First of all thanks for your reply.

On Tue, 14 Mar 2017, Greg Kroah-Hartman wrote:
> On Mon, Mar 13, 2017 at 03:14:12PM -0700, Till Smejkal wrote:
> 
> There's no way with that many cc: lists and people that this is really
> making it through very many people's filters and actually on a mailing
> list.  Please trim them down.

I am sorry that the patch's cc-list is too big. This was the list of people 
that the
get_maintainers.pl script produced. I already recognized that it was a huge 
number of
people, but I didn't want to remove anyone from the list because I wasn't sure 
who
would be interested in this patch set. Do you have any suggestion who to remove 
from
the list? I don't want to annoy anyone with useless emails.

> Minor sysfs questions/issues:
> 
> > +struct vas {
> > +   struct kobject kobj;/* < the internal kobject that we use *
> > +*   for reference counting and sysfs *
> > +*   handling.*/
> > +
> > +   int id; /* < ID   */
> > +   char name[VAS_MAX_NAME_LENGTH]; /* < name */
> 
> The kobject has a name, why not use that?

The reason why I don't use the kobject's name is that I don't restrict the 
names that
are used for VAS/VAS segments. Accordingly, it would be allowed to use a name 
like
"foo/bar/xyz" as VAS name. However, I am not sure what would happen in the 
sysfs if I
would use such a name for the kobject. Especially, since one could think of 
another
VAS with the name "foo/bar" whose name would conflict with the first one 
although it
not necessarily has any connection with it.

> > +
> > +   struct mutex mtx;   /* < lock for parallel access.*/
> > +
> > +   struct mm_struct *mm;   /* < a partial memory map containing  *
> > +*   all mappings of this VAS.*/
> > +
> > +   struct list_head link;  /* < the link in the global VAS list. */
> > +   struct rcu_head rcu;/* < the RCU helper used for  *
> > +*   asynchronous VAS deletion.   */
> > +
> > +   u16 refcount;   /* < how often is the VAS attached.   */
> 
> The kobject has a refcount, use that?  Don't have 2 refcounts in the
> same structure, that way lies madness.  And bugs, lots of bugs...
> 
> And if this really is a refcount (hint, I don't think it is), you should
> use the refcount_t type.

I actually use both the internal kobject refcount to keep track of how often a
VAS/VAS segment is referenced and this 'refcount' variable to keep track how 
often
the VAS is actually attached to a task. They not necessarily must be related to 
each
other. I can rename this variable to attach_count. Or if preferred I can
alternatively only use the kobject reference counter and remove this variable
completely though I would loose information about how often the VAS is attached 
to a
task because the kobject reference counter is also used to keep track of other
variables referencing the VAS.

> > +/**
> > + * The sysfs structure we need to handle attributes of a VAS.
> > + **/
> > +struct vas_sysfs_attr {
> > +   struct attribute attr;
> > +   ssize_t (*show)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> > +   char *buf);
> > +   ssize_t (*store)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> > +const char *buf, size_t count);
> > +};
> > +
> > +#define VAS_SYSFS_ATTR(NAME, MODE, SHOW, STORE)
> > \
> > +static struct vas_sysfs_attr vas_sysfs_attr_##NAME =   
> > \
> > +   __ATTR(NAME, MODE, SHOW, STORE)
> 
> __ATTR_RO and __ATTR_RW should work better for you.  If you really need
> this.

Thank you. I will have a look at these functions.

> Oh, and where is the Documentation/ABI/ updates to try to describe the
> sysfs structure and files?  Did I miss that in the series?

Oh sorry, I forgot to add this file. I will add the ABI descriptions for future
submissions.

> > +static ssize_t __show_vas_name(struct vas *vas, struct vas_sysfs_attr 
> > *vsattr,
> > +  char *buf)
> > +{
> > +   return scnprintf(buf, PAGE_SIZE, "%s", vas->name);
> 
> It's a page size, just use sprintf() and be done with it.  No need to
> ever check, you "know" it will be correct.

OK. I was following the sysfs example in the documentation that used scnprintf, 
but
if sprintf is preferred, I can change this.

> Also, what about a trailing '\n' f

[RFC PATCH 09/13] mm/memory: Add function to one-to-one duplicate page ranges

2017-03-13 Thread Till Smejkal
Add new function to one-to-one duplicate a page table range of one memory
map to another memory map. The new function 'dup_page_range' copies the
page table entries for the specified region from the page table of the
source memory map to the page table of the destination memory map and
thereby allows actual sharing of the referenced memory pages instead of
relying on copy-on-write for anonymous memory pages or page faults for
read-only memory pages as it is done by the existing function
'copy_page_range'. Hence, 'dup_page_range' will produce shared pages
between two address spaces whereas 'copy_page_range' will result in copies
of pages if necessary.

Preexisting mappings in the page table of the destination memory map are
properly zapped by the 'dup_page_range' function if they differ from the
ones in the source memory map before they are replaced with the new ones.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 include/linux/huge_mm.h |   6 +
 include/linux/hugetlb.h |   5 +
 include/linux/mm.h  |   2 +
 mm/huge_memory.c|  65 +++
 mm/hugetlb.c| 205 +++--
 mm/memory.c | 461 +---
 6 files changed, 620 insertions(+), 124 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 94a0e680b7d7..52c0498426ef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 struct vm_area_struct *vma);
+extern int dup_huge_pmd(struct mm_struct *dst_mm,
+   struct vm_area_struct *dst_vma,
+   struct mm_struct *src_mm,
+   struct vm_area_struct *src_vma,
+   struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd,
+   unsigned long addr);
 extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
 extern int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
 extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 72260cc252f2..d8eb682e39a1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -63,6 +63,10 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
 #endif
 
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
+int dup_hugetlb_page_range(struct mm_struct *dst_mm,
+  struct vm_area_struct *dst_vma,
+  struct mm_struct *src_mm,
+  struct vm_area_struct *src_vma);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 struct page **, struct vm_area_struct **,
 unsigned long *, unsigned long *, long, unsigned int);
@@ -134,6 +138,7 @@ static inline unsigned long hugetlb_total_pages(void)
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)  ({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)  ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
+#define dup_hugetlb_page_range(dst, dst_vma, src, src_vma) ({ BUG(); 0; })
 static inline void hugetlb_report_meminfo(struct seq_file *m)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 92925d97da20..b39ec795f64c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1208,6 +1208,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long 
addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
+int dup_page_range(struct mm_struct *dst, struct vm_area_struct *dst_vma,
+  struct mm_struct *src, struct vm_area_struct *src_vma);
 void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d5b2604867e5..1edf8c6d1814 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -887,6 +887,71 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
return ret;
 }
 
+int dup_huge_pmd(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+struct mm_struct *src_mm, struct vm_area_struct *src_vma,
+struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd,
+unsigned long addr)
+{
+   spinlock_t *dst_ptl, *src_ptl;
+   struct page *page;
+   pmd_t pmd;
+   pgtable_t pgtable;
+   int ret;
+
+   pgtable = pte_alloc_one(dst_mm, addr);
+   if (!pgtable)
+   

[RFC PATCH 04/13] mm: Add mm_struct argument to 'get_unmapped_area' and 'vm_unmapped_area'

2017-03-13 Thread Till Smejkal
Add the mm_struct that for which an unmapped area should be found as
explicit argument to the 'get_unmapped_area' function. Previously, the
function simply search for an unmapped area in the memory map of the
 current task. However, with the introduction of first class virtual
address spaces, it is necessary that get_unmapped_area also can look for
unmapped area in memory maps other than the one of the current task.

Changing the signature of the generic 'get_unmapped_area' function also
requires that all the 'arch_get_unmapped_area' functions as well as the
'vm_unmapped_area' function with its dependents have to take the memory
map that they should work on as additional argument. Simply using the one
of the current task, as these functions did before, is not correct anymore
and leads to incorrect results.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 arch/alpha/kernel/osf_sys.c  | 19 ++--
 arch/arc/mm/mmap.c   |  8 ++---
 arch/arm/kernel/process.c|  2 +-
 arch/arm/mm/mmap.c   | 19 ++--
 arch/arm64/kernel/vdso.c |  2 +-
 arch/blackfin/include/asm/pgtable.h  |  3 +-
 arch/blackfin/kernel/sys_bfin.c  |  5 ++--
 arch/frv/mm/elf-fdpic.c  | 11 +++
 arch/hexagon/kernel/vdso.c   |  2 +-
 arch/ia64/kernel/perfmon.c   |  3 +-
 arch/ia64/kernel/sys_ia64.c  |  6 ++--
 arch/ia64/mm/hugetlbpage.c   |  7 +++--
 arch/metag/mm/hugetlbpage.c  | 11 +++
 arch/mips/kernel/vdso.c  |  2 +-
 arch/mips/mm/mmap.c  | 27 +
 arch/parisc/kernel/sys_parisc.c  | 19 ++--
 arch/parisc/mm/hugetlbpage.c |  7 +++--
 arch/powerpc/include/asm/book3s/64/hugetlb.h |  6 ++--
 arch/powerpc/include/asm/page_64.h   |  3 +-
 arch/powerpc/kernel/vdso.c   |  2 +-
 arch/powerpc/mm/hugetlbpage-radix.c  |  9 +++---
 arch/powerpc/mm/hugetlbpage.c|  9 +++---
 arch/powerpc/mm/mmap.c   | 17 +--
 arch/powerpc/mm/slice.c  | 25 
 arch/s390/kernel/vdso.c  |  3 +-
 arch/s390/mm/mmap.c  | 42 +-
 arch/sh/kernel/vsyscall/vsyscall.c   |  2 +-
 arch/sh/mm/mmap.c| 19 ++--
 arch/sparc/include/asm/pgtable_64.h  |  4 +--
 arch/sparc/kernel/sys_sparc_32.c |  6 ++--
 arch/sparc/kernel/sys_sparc_64.c | 31 +++-
 arch/sparc/mm/hugetlbpage.c  | 26 
 arch/tile/kernel/vdso.c  |  2 +-
 arch/tile/mm/hugetlbpage.c   | 26 
 arch/x86/entry/vdso/vma.c|  2 +-
 arch/x86/kernel/sys_x86_64.c | 19 ++--
 arch/x86/mm/hugetlbpage.c| 26 
 arch/xtensa/kernel/syscall.c |  7 +++--
 drivers/char/mem.c   | 15 ++
 drivers/dax/dax.c| 10 +++
 drivers/media/usb/uvc/uvc_v4l2.c |  6 ++--
 drivers/media/v4l2-core/v4l2-dev.c   |  8 ++---
 drivers/media/v4l2-core/videobuf2-v4l2.c |  5 ++--
 drivers/mtd/mtdchar.c|  3 +-
 drivers/usb/gadget/function/uvc_v4l2.c   |  3 +-
 fs/hugetlbfs/inode.c |  8 ++---
 fs/proc/inode.c  | 10 +++
 fs/ramfs/file-mmu.c  |  5 ++--
 fs/ramfs/file-nommu.c| 10 ---
 fs/romfs/mmap-nommu.c|  3 +-
 include/linux/fs.h   |  2 +-
 include/linux/huge_mm.h  |  6 ++--
 include/linux/hugetlb.h  |  5 ++--
 include/linux/mm.h   | 16 ++
 include/linux/mm_types.h |  7 +++--
 include/linux/sched.h| 10 +++
 include/linux/shmem_fs.h |  5 ++--
 include/media/v4l2-dev.h |  3 +-
 include/media/videobuf2-v4l2.h   |  5 ++--
 ipc/shm.c| 10 +++
 kernel/events/uprobes.c  |  2 +-
 mm/huge_memory.c | 18 +++-
 mm/mmap.c| 44 ++--
 mm/mremap.c  | 11 +++
 mm/nommu.c   | 10 ---
 mm/shmem.c   | 14 -
 sound/core/pcm_native.c  |  3 +-
 67 files changed, 370 insertions(+), 326 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 54d8616644e2..281109bcdc5d 

[RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions

2017-03-13 Thread Till Smejkal
VAS segments are an extension to first class virtual address spaces that
can be used to share specific memory regions between multiple first class
virtual address spaces. VAS segments have a specific size and position in a
virtual address space and can thereby be used to share in-memory pointer
based data structures between multiple address spaces as well as other
in-memory data without the need to represent them in mmap-able files or
use shmem.

Similar to first class virtual address spaces, VAS segments must be created
and destroyed explicitly by a user. The system will never automatically
destroy or create a virtual segment. Via attaching a VAS segment to a first
class virtual address space, the memory that is contained in the VAS
segment can be accessed and changed.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
Signed-off-by: Marco Benatto <marco.antonio@gmail.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |7 +
 arch/x86/entry/syscalls/syscall_64.tbl |7 +
 include/linux/syscalls.h   |   10 +
 include/linux/vas.h|  114 +++
 include/linux/vas_types.h  |   91 ++-
 include/uapi/asm-generic/unistd.h  |   16 +-
 include/uapi/linux/vas.h   |   12 +
 kernel/sys_ni.c|7 +
 mm/vas.c   | 1234 ++--
 9 files changed, 1451 insertions(+), 47 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 8c553eef8c44..a4f91d14a856 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,10 @@
 389i386active_vas  sys_active_vas
 390i386vas_getattr sys_vas_getattr
 391i386vas_setattr sys_vas_setattr
+392i386vas_seg_create  sys_vas_seg_create
+393i386vas_seg_delete  sys_vas_seg_delete
+394i386vas_seg_findsys_vas_seg_find
+395i386vas_seg_attach  sys_vas_seg_attach
+396i386vas_seg_detach  sys_vas_seg_detach
+397i386vas_seg_getattr sys_vas_seg_getattr
+398i386vas_seg_setattr sys_vas_seg_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 72f1f0495710..a0f9503c3d28 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,13 @@
 338common  active_vas  sys_active_vas
 339common  vas_getattr sys_vas_getattr
 340common  vas_setattr sys_vas_setattr
+341common  vas_seg_create  sys_vas_seg_create
+342common  vas_seg_delete  sys_vas_seg_delete
+343common  vas_seg_findsys_vas_seg_find
+344common  vas_seg_attach  sys_vas_seg_attach
+345common  vas_seg_detach  sys_vas_seg_detach
+346common  vas_seg_getattr sys_vas_seg_getattr
+347common  vas_seg_setattr sys_vas_seg_setattr
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fdea27d37c96..7380dcdc4bc1 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -66,6 +66,7 @@ struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
 struct vas_attr;
+struct vas_seg_attr;
 union bpf_attr;
 
 #include 
@@ -914,4 +915,13 @@ asmlinkage long sys_active_vas(void);
 asmlinkage long sys_vas_getattr(int vid, struct vas_attr __user *attr);
 asmlinkage long sys_vas_setattr(int vid, struct vas_attr __user *attr);
 
+asmlinkage long sys_vas_seg_create(const char __user *name, unsigned long 
start,
+  unsigned long end, umode_t mode);
+asmlinkage long sys_vas_seg_delete(int sid);
+asmlinkage long sys_vas_seg_find(const char __user *name);
+asmlinkage long sys_vas_seg_attach(int vid, int sid, int type);
+asmlinkage long sys_vas_seg_detach(int vid, int sid);
+asmlinkage long sys_vas_seg_getattr(int sid, struct vas_seg_attr __user *attr);
+asmlinkage long sys_vas_seg_setattr(int sid, struct vas_seg_attr __user *attr);
+
 #endif
diff --git a/include/linux/vas.h b/include/linux/vas.h
index 6a72e42f96d2..376b9fa1ee27 100644
--- a/include/linux/vas.h
+++ b/include/linux/vas.h
@@ -138,6 +138,120 @@ extern int vas_setattr(int vid, struct vas_attr *attr);
 
 
 /***
+ * Management of VAS segments
+ ***/
+
+/**
+ * Lock and unlock helper for VAS segments.
+ **/
+#define vas_seg_lock(seg) mutex_lock(&(seg)->mtx)
+#define vas_seg_unlock(seg) mutex_unlock(&(seg)->mtx)
+
+/**
+ * Create a new VAS segment.
+ *
+ * @param[in] name:The name of the new VAS segment.
+ * @param[in] start:   The address where the VAS segment begins.
+ * @param[in] end: The address where the VAS segment ends.
+ * @param[in] mode:The access 

[RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'

2017-03-13 Thread Till Smejkal
The only way until now to create a new memory map was via the exported
function 'mm_alloc'. Unfortunately, this function not only allocates a new
memory map, but also completely initializes it. However, with the
introduction of first class virtual address spaces, some initialization
steps done in 'mm_alloc' are not applicable to the memory maps needed for
this feature and hence would lead to errors in the kernel code.

Instead of introducing a new function that can allocate and initialize
memory maps for first class virtual address spaces and potentially
duplicate some code, I decided to split the mm_alloc function as well as
the 'mm_init' function that it uses.

Now there are four functions exported instead of only one. The new
'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
want to have the old behavior of mm_alloc one can use the newly introduced
function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
also fully initializes it.

The old 'mm_init' function which fully initialized a mm_struct was split up
into two separate functions. The first one - 'mm_setup' - does all the
initialization of the mm_struct that is not related to the mm_struct
belonging to a particular task. This part of the initialization is done in
the 'mm_set_task' function. This way it is possible to create memory maps
that don't have any task-specific information as needed by the first class
virtual address space feature. Both functions, 'mm_setup' and 'mm_set_task'
are also exported, so that they can be used in all files in the source
tree.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 arch/arm/mach-rpc/ecard.c |  2 +-
 fs/exec.c |  2 +-
 include/linux/sched.h |  7 +-
 kernel/fork.c | 64 +--
 4 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index dc67a7fb3831..15845e8abd7e 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -245,7 +245,7 @@ static void ecard_init_pgtables(struct mm_struct *mm)
 
 static int ecard_init_mm(void)
 {
-   struct mm_struct * mm = mm_alloc();
+   struct mm_struct *mm = mm_alloc_and_setup();
struct mm_struct *active_mm = current->active_mm;
 
if (!mm)
diff --git a/fs/exec.c b/fs/exec.c
index e57946610733..68d7908a1e5a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -380,7 +380,7 @@ static int bprm_mm_init(struct linux_binprm *bprm)
int err;
struct mm_struct *mm = NULL;
 
-   bprm->mm = mm = mm_alloc();
+   bprm->mm = mm = mm_alloc_and_setup();
err = -ENOMEM;
if (!mm)
goto err;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42b9b93a50ac..7955adc00397 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2922,7 +2922,12 @@ static inline unsigned long sigsp(unsigned long sp, 
struct ksignal *ksig)
 /*
  * Routines for handling mm_structs
  */
-extern struct mm_struct * mm_alloc(void);
+extern struct mm_struct *mm_setup(struct mm_struct *mm);
+extern struct mm_struct *mm_set_task(struct mm_struct *mm,
+struct task_struct *p,
+struct user_namespace *user_ns);
+extern struct mm_struct *mm_alloc(void);
+extern struct mm_struct *mm_alloc_and_setup(void);
 
 /* mmdrop drops the mm and the page tables */
 extern void __mmdrop(struct mm_struct *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8ab827c..9209f6d5d7c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -747,8 +747,10 @@ static void mm_init_owner(struct mm_struct *mm, struct 
task_struct *p)
 #endif
 }
 
-static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
-   struct user_namespace *user_ns)
+/**
+ * Initialize all the task-unrelated fields of a mm_struct.
+ **/
+struct mm_struct *mm_setup(struct mm_struct *mm)
 {
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
@@ -767,24 +769,37 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
spin_lock_init(>page_table_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
-   mm_init_owner(mm, p);
mmu_notifier_mm_init(mm);
clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
 #endif
 
+   mm->flags = default_dump_filter;
+   mm->def_flags = 0;
+
+   if (mm_alloc_pgd(mm))
+   goto fail_nopgd;
+
+   return mm;
+
+fail_nopgd:
+   free_mm(mm);
+   return NULL;
+}
+
+/**
+ * Initialize all the task-related fields of a mm_struct.
+ **/
+struct mm_struct *mm_set_task(struct mm_struct *mm, struct task_struct *p,
+ struct user_namespace *user_ns)
+{
if (current->mm) {
mm->flags = current->mm->flags & M

[RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
First class virtual address spaces (also called VAS) are a new functionality of
the Linux kernel allowing address spaces to exist independently of processes.
The general idea behind this feature is described in a paper at ASPLOS16 with
the title 'SpaceJMP: Programming with Multiple Virtual Address Spaces' [1].

This patchset extends the kernel memory management subsystem with a new
type of address spaces (called VAS) which can be created and destroyed
independently of processes by a user in the system. During its lifetime
such a VAS can be attached to processes by the user which allows a process
to have multiple address spaces and thereby multiple, potentially
different, views on the system's main memory. During its execution the
threads belonging to the process are able to switch freely between the
different attached VAS and the process' original AS enabling them to
utilize the different available views on the memory. These multiple virtual
address spaces per process and the possibility to switch between them
freely can be used in multiple interesting ways as also outlined in the
mentioned paper. Some of the many possible applications are for example to
compartmentalize a process for security reasons, to improve the performance
of data-centric applications and to introduce new application models [1].

In addition to the concept of first class virtual address spaces, this
patchset introduces yet another feature called VAS segments. VAS segments
are memory regions which have a fixed size and position in the virtual
address space and can be shared between multiple first class virtual
address spaces. Such shareable memory regions are especially useful for
in-memory pointer-based data structures or other pure in-memory data.

First class virtual address spaces have a significant advantage compared to
forking a process and using inter process communication mechanism, namely
that creating and switching between VAS is significant faster than creating
and switching between processes. As it can be seen in the following table,
measured on an Intel Xeon E5620 CPU with 2.40GHz, creating a VAS is about 7
times faster than forking and switching between VAS is up to 4 times faster
than switching between processes.

| VAS |  processes  |
-
switch  |   468ns |  1944ns |
create  | 20003ns |150491ns |

Hence, first class virtual address spaces provide a fast mechanism for
applications to utilize multiple virtual address spaces in parallel with a
higher performance than splitting up the application into multiple
independent processes.

Both VAS and VAS segments have another significant advantage when combined
with non-volatile memory. Because of their independent life cycle from
processes and other kernel data structures, they can be used to save
special memory regions or even whole AS into non-volatile memory making it
possible to reuse them across multiple system reboots.

At the current state of the development, first class virtual address spaces
have one limitation, that we haven't been able to solve so far. The feature
allows, that different threads of the same process can execute in different
AS at the same time. This is possible, because the VAS-switch operation
only changes the active mm_struct for the task_struct of the calling
thread. However, when a thread switches into a first class virtual address
space, some parts of its original AS are duplicated into the new one to
allow the thread to continue its execution at its current state.
Accordingly, parts of the processes AS (e.g. the code section, data
section, heap section and stack sections) exist in multiple AS if the
process has a VAS attached to it. Changes to these shared memory regions
are synchronized between the address spaces whenever a thread switches
between two of them. Unfortunately, in some scenarios the kernel is not
able to properly synchronize all these shared memory regions because of
conflicting changes. One such example happens if there are two threads, one
executing in an attached first class virtual address space, the other in
the tasks original address space. If both threads make changes to the heap
section that cause expansion of the underlying vm_area_struct, the kernel
cannot correctly synchronize these changes, because that would cause parts
of the virtual address space to be overwritten with unrelated data. In the
current implementation such conflicts are only detected but not resolved
and result in an error code being returned by the kernel during the VAS
switch operation. Unfortunately, that means for the particular thread that
tried to make the switch, that it cannot do this anymore in the future and
accordingly has to be killed.

This code was developed during an internship at Hewlett Packard Enterprise.

[1] http://impact.crhc.illinois.edu/shared/Papers/ASPLOS16-SpaceJMP.pdf

Till Smejkal (13):
  mm: Add mm_struct argument to 'mmap_region'
  mm: Add

[RFC PATCH 01/13] mm: Add mm_struct argument to 'mmap_region'

2017-03-13 Thread Till Smejkal
Add to the 'mmap_region' function the mm_struct that it should operate on
as additional argument. Before, the function simply used the memory map of
the current task. However, with the introduction of first class virtual
address spaces, mmap_region needs also be able to operate on other memory
maps than only the current task ones. By adding it as argument we can now
explicitly define which memory map to use.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 arch/mips/kernel/vdso.c |  2 +-
 arch/tile/mm/elf.c  |  2 +-
 include/linux/mm.h  |  5 +++--
 mm/mmap.c   | 10 +-
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index f9dbfb14af33..9631b42908f3 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -108,7 +108,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
return -EINTR;
 
/* Map delay slot emulation page */
-   base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
+   base = mmap_region(mm, NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
   0);
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 6225cc998db1..a22768059b7a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -141,7 +141,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 */
if (!retval) {
unsigned long addr = MEM_USER_INTRPT;
-   addr = mmap_region(NULL, addr, INTRPT_SIZE,
+   addr = mmap_region(mm, NULL, addr, INTRPT_SIZE,
   VM_READ|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0);
if (addr > (unsigned long) -PAGE_SIZE)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b0f64c..fa483d2ff3eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2016,8 +2016,9 @@ extern int install_special_mapping(struct mm_struct *mm,
 
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned 
long, unsigned long, unsigned long);
 
-extern unsigned long mmap_region(struct file *file, unsigned long addr,
-   unsigned long len, vm_flags_t vm_flags, unsigned long pgoff);
+extern unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+unsigned long addr, unsigned long len,
+vm_flags_t vm_flags, unsigned long pgoff);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate);
diff --git a/mm/mmap.c b/mm/mmap.c
index dc4291dcc99b..5ac276ac9807 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1447,7 +1447,7 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
vm_flags |= VM_NORESERVE;
}
 
-   addr = mmap_region(file, addr, len, vm_flags, pgoff);
+   addr = mmap_region(mm, file, addr, len, vm_flags, pgoff);
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
 (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1582,10 +1582,10 @@ static inline int accountable_mapping(struct file 
*file, vm_flags_t vm_flags)
return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
-unsigned long mmap_region(struct file *file, unsigned long addr,
-   unsigned long len, vm_flags_t vm_flags, unsigned long pgoff)
+unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+   unsigned long addr, unsigned long len, vm_flags_t vm_flags,
+   unsigned long pgoff)
 {
-   struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
int error;
struct rb_node **rb_link, *rb_parent;
@@ -1704,7 +1704,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-   vma == get_gate_vma(current->mm)))
+   vma == get_gate_vma(mm)))
mm->locked_vm += (len >> PAGE_SHIFT);
else
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
-- 
2.12.0


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[RFC PATCH 03/13] mm: Rename 'unmap_region' and add mm_struct argument

2017-03-13 Thread Till Smejkal
Rename the 'unmap_region' function to 'munmap_region' so that it uses the
same naming pattern as the do_mmap <-> mmap_region couple. In addition
also make the new 'munmap_region' function publicly available to all other
kernel sources.

In addition, also add to the function the mm_struct it should operate on
as additional argument. Before, the function simply used the memory map of
the current task. However, with the introduction of first class virtual
address spaces, munmap_region need also be able to operate on other memory
maps than just the current task's one. Accordingly, add a new argument to
the function so that one can define explicitly which memory map should be
used.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 include/linux/mm.h |  4 
 mm/mmap.c  | 14 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fb11be77545f..71a90604d21f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2023,6 +2023,10 @@ extern unsigned long do_mmap(struct mm_struct *mm, 
struct file *file,
unsigned long addr, unsigned long len, unsigned long prot,
unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,
unsigned long *populate);
+
+extern void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct vm_area_struct *prev, unsigned long start,
+ unsigned long end);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 static inline unsigned long
diff --git a/mm/mmap.c b/mm/mmap.c
index 70028bf7b58d..ea79bc4da5b7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -70,10 +70,6 @@ int mmap_rnd_compat_bits __read_mostly = 
CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
-static void unmap_region(struct mm_struct *mm,
-   struct vm_area_struct *vma, struct vm_area_struct *prev,
-   unsigned long start, unsigned long end);
-
 /* description of effects of mapping type and prot in current implementation.
  * this is due to the limited x86 page protection hardware.  The expected
  * behavior is in parens:
@@ -1731,7 +1727,7 @@ unsigned long mmap_region(struct mm_struct *mm, struct 
file *file,
fput(file);
 
/* Undo any partial mapping done by a device driver. */
-   unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+   munmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
charged = 0;
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
@@ -2447,9 +2443,9 @@ static void remove_vma_list(struct mm_struct *mm, struct 
vm_area_struct *vma)
  *
  * Called with the mm semaphore held.
  */
-static void unmap_region(struct mm_struct *mm,
-   struct vm_area_struct *vma, struct vm_area_struct *prev,
-   unsigned long start, unsigned long end)
+void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma,
+   struct vm_area_struct *prev, unsigned long start,
+   unsigned long end)
 {
struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
struct mmu_gather tlb;
@@ -2654,7 +2650,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
 * Remove the vma's, and unmap the actual pages
 */
detach_vmas_to_be_unmapped(mm, vma, prev, end);
-   unmap_region(mm, vma, prev, start, end);
+   munmap_region(mm, vma, prev, start, end);
 
arch_unmap(mm, vma, start, end);
 
-- 
2.12.0


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[RFC PATCH 02/13] mm: Add mm_struct argument to 'do_mmap' and 'do_mmap_pgoff'

2017-03-13 Thread Till Smejkal
Add to the 'do_mmap' and 'do_mmap_pgoff' functions the mm_struct they
should operate on as additional argument. Before, both functions simply
used the memory map of the current task. However, with the introduction of
first class virtual address spaces, these functions also need to be usable
for other memory maps than just the one of the current process. Hence,
explicitly define during the function call which memory map to use.

Signed-off-by: Till Smejkal <till.smej...@gmail.com>
---
 arch/x86/mm/mpx.c  |  4 ++--
 fs/aio.c   |  4 ++--
 include/linux/mm.h | 11 ++-
 ipc/shm.c  |  3 ++-
 mm/mmap.c  | 16 
 mm/nommu.c |  7 ---
 mm/util.c  |  2 +-
 7 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index af59f808742f..99c664a97c35 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -50,8 +50,8 @@ static unsigned long mpx_mmap(unsigned long len)
return -EINVAL;
 
down_write(>mmap_sem);
-   addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-   MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, );
+   addr = do_mmap(mm, NULL, 0, len, PROT_READ | PROT_WRITE,
+  MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, );
up_write(>mmap_sem);
if (populate)
mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 873b4ca82ccb..df9bba5a2aff 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -510,8 +510,8 @@ static int aio_setup_ring(struct kioctx *ctx)
return -EINTR;
}
 
-   ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
-  PROT_READ | PROT_WRITE,
+   ctx->mmap_base = do_mmap_pgoff(current->mm, ctx->aio_ring_file, 0,
+  ctx->mmap_size, PROT_READ | PROT_WRITE,
   MAP_SHARED, 0, );
up_write(>mmap_sem);
if (IS_ERR((void *)ctx->mmap_base)) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa483d2ff3eb..fb11be77545f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2019,17 +2019,18 @@ extern unsigned long get_unmapped_area(struct file *, 
unsigned long, unsigned lo
 extern unsigned long mmap_region(struct mm_struct *mm, struct file *file,
 unsigned long addr, unsigned long len,
 vm_flags_t vm_flags, unsigned long pgoff);
-extern unsigned long do_mmap(struct file *file, unsigned long addr,
-   unsigned long len, unsigned long prot, unsigned long flags,
-   vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate);
+extern unsigned long do_mmap(struct mm_struct *mm, struct file *file,
+   unsigned long addr, unsigned long len, unsigned long prot,
+   unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,
+   unsigned long *populate);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 static inline unsigned long
-do_mmap_pgoff(struct file *file, unsigned long addr,
+do_mmap_pgoff(struct mm_struct *mm, struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
unsigned long pgoff, unsigned long *populate)
 {
-   return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate);
+   return do_mmap(mm, file, addr, len, prot, flags, 0, pgoff, populate);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index 81203e8ba013..64c21fb32ca9 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1222,7 +1222,8 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
goto invalid;
}
 
-   addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, );
+   addr = do_mmap_pgoff(mm, file, addr, size, prot, flags, 0,
+);
*raddr = addr;
err = 0;
if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 5ac276ac9807..70028bf7b58d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1299,14 +1299,14 @@ static inline int mlock_future_check(struct mm_struct 
*mm,
 }
 
 /*
- * The caller must hold down_write(>mm->mmap_sem).
+ * The caller must hold down_write(>mmap_sem).
  */
-unsigned long do_mmap(struct file *file, unsigned long addr,
-   unsigned long len, unsigned long prot,
-   unsigned long flags, vm_flags_t vm_flags,
-   unsigned long pgoff, unsigned long *populate)
+unsigned long do_mmap(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long prot, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff,
+ unsigned long *populate)
 {
-   struct mm_struct *mm = current->mm;
int pkey = 0;
 
*populate = 0;
@@