Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Boaz Harrosh

On 17/09/2020 15:48, Matthew Wilcox wrote:

On Thu, Sep 17, 2020 at 02:01:33PM +0200, Oleg Nesterov wrote:

<>


If we change bio_endio to invoke the ->bi_end_io callbacks in softirq
context instead of hardirq context, we can change the pagecache to take
BH-safe locks instead of IRQ-safe locks.  I believe the only reason the
lock needs to be IRQ-safe is for the benefit of paths like:



From my totally subjective experience on the filesystem side (user of 
bio_endio) all HW block drivers I used including Nvme isci, sata... etc. 
end up calling bio_endio in softirq. The big exception to that is the 
vdX drivers under KVM. Which is very Ironic to me.

I wish we could make all drivers be uniform in this regard.

But maybe I'm just speaking crap. Its only from my limited debuging 
expirience.


Thanks
Boaz


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Boaz Harrosh

On 16/09/2020 15:32, Hou Tao wrote:
<>

However the performance degradation is huge under aarch64 (4 sockets, 24 core 
per sockets): nearly 60% lost.

v4.19.111
no writer, reader cn   | 24| 48| 72 
   | 96
the rate of down_read/up_read per second   | 166129572 | 166064100 | 
165963448 | 165203565
the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
63757267 |  63514920



I believe perhaps Peter Z's suggestion of an additional
percpu_down_read_irqsafe() API and let only those in IRQ users pay the 
penalty.


Peter Z wrote:

My leading alternative was adding: percpu_down_read_irqsafe() /
percpu_up_read_irqsafe(), which use local_irq_save() instead of
preempt_disable().


Thanks
Boaz


Re: [PATCH 0/5] Enable per-file/directory DAX operations

2019-10-23 Thread Boaz Harrosh
On 22/10/2019 14:21, Boaz Harrosh wrote:
> On 20/10/2019 18:59, ira.we...@intel.com wrote:
<>
>> At LSF/MM we discussed the difficulties of switching the mode of a file with
>> active mappings / page cache. Rather than solve those races the decision was 
>> to
>> just limit mode flips to 0-length files.
>>
> 
> What I understand above is that only "writers" before writing any bytes may
> turn the flag on, which then persists. But as a very long time user of DAX, 
> usually
> it is the writers that are least interesting. With lots of DAX technologies 
> and
> emulations the write is slower and needs slow "flushing".
> 
> The more interesting and performance gains comes from DAX READs actually.
> specially cross the VM guest. (IE. All VMs share host memory or pmem)
> 
> This fixture as I understand it, that I need to know before I write if I will
> want DAX or not and then the write is DAX as well as reads after that, looks
> not very interesting for me as a user.
> 
> Just my $0.17
> Boaz
> 

I want to say one more thing about this patchset please. I was sleeping on it
and I think it is completely wrong approach with a completely wrong API.

The DAX or not DAX is a matter of transport. and is nothing to do with data
content and persistency. It's like connecting a disk behind, say, a scsi bus 
and then
take the same DB and putting it behind a multy-path or an NFS server. It is 
always
the same data.
(Same mistake we did with ndctl which is cry for generations)

For me the DAX or NO-DAX is an application thing and not a data thing.

The right API is perhaps an open O_FLAG where if you are not the first opener
the open returns EBUSY and then the app can decide if to open without the
flag or complain and fail. (Or apply open locks)

If you are a second opener when the file is already opened O_DAX you are
silently running DAX. If you are 2nd open(O_DAX) when already oppened
O_DAX then off course you succeed.
(Also the directory inheritance thing is all mute too)

Please explain the use case behind your model?

Thanks
Boaz


Re: [PATCH 1/5] fs/stat: Define DAX statx attribute

2019-10-22 Thread Boaz Harrosh
On 20/10/2019 18:59, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> In order for users to determine if a file is currently operating in DAX
> mode (effective DAX).  Define a statx attribute value and set that
> attribute if the effective DAX flag is set.
> 
> To go along with this we propose the following addition to the statx man
> page:
> 
> STATX_ATTR_DAX
> 
>   DAX (cpu direct access) is a file mode that attempts to minimize
>   software cache effects for both I/O and memory mappings of this
>   file.  It requires a capable device, a compatible filesystem
>   block size, and filesystem opt-in. It generally assumes all
>   accesses are via cpu load / store instructions which can
>   minimize overhead for small accesses, but adversely affect cpu
>   utilization for large transfers. File I/O is done directly
>   to/from user-space buffers. While the DAX property tends to
>   result in data being transferred synchronously it does not give
>   the guarantees of synchronous I/O that data and necessary
>   metadata are transferred. Memory mapped I/O may be performed
>   with direct mappings that bypass system memory buffering. Again
>   while memory-mapped I/O tends to result in data being
>   transferred synchronously it does not guarantee synchronous
>   metadata updates. A dax file may optionally support being mapped
>   with the MAP_SYNC flag which does allow cpu store operations to
>   be considered synchronous modulo cpu cache effects.
> 
> Signed-off-by: Ira Weiny 
> ---
>  fs/stat.c | 3 +++
>  include/uapi/linux/stat.h | 1 +
>  2 files changed, 4 insertions(+)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index c38e4c2e1221..59ca360c1ffb 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -77,6 +77,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat 
> *stat,
>   if (IS_AUTOMOUNT(inode))
>   stat->attributes |= STATX_ATTR_AUTOMOUNT;
>  
> + if (inode->i_flags & S_DAX)

Is there a reason not to use IS_DAX(inode) ?

> + stat->attributes |= STATX_ATTR_DAX;
> +
>   if (inode->i_op->getattr)
>   return inode->i_op->getattr(path, stat, request_mask,
>   query_flags);
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..5b0962121ef7 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -169,6 +169,7 @@ struct statx {
>  #define STATX_ATTR_ENCRYPTED 0x0800 /* [I] File requires key to 
> decrypt in fs */
>  
>  #define STATX_ATTR_AUTOMOUNT 0x1000 /* Dir: Automount trigger */
> +#define STATX_ATTR_DAX   0x2000 /* [I] File is DAX */
>  
>  
>  #endif /* _UAPI_LINUX_STAT_H */
> 



Re: [PATCH 0/5] Enable per-file/directory DAX operations

2019-10-22 Thread Boaz Harrosh
On 20/10/2019 18:59, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> consumption due to their inability to detect whether the kernel will
> instantiate page cache for a file, and cases where a global dax enable via a
> mount option is too coarse.
> 
> The following patch series enables selecting the use of DAX on individual 
> files
> and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> scheme the dax mount option can be omitted to allow the per-file property to
> take effect.
> 
> The insight at LSF/MM was to separate the per-mount or per-file "physical"
> capability switch from an "effective" attribute for the file.
> 
> At LSF/MM we discussed the difficulties of switching the mode of a file with
> active mappings / page cache. Rather than solve those races the decision was 
> to
> just limit mode flips to 0-length files.
> 

What I understand above is that only "writers" before writing any bytes may
turn the flag on, which then persists. But as a very long time user of DAX, 
usually
it is the writers that are least interesting. With lots of DAX technologies and
emulations the write is slower and needs slow "flushing".

The more interesting and performance gains comes from DAX READs actually.
specially cross the VM guest. (IE. All VMs share host memory or pmem)

This fixture as I understand it, that I need to know before I write if I will
want DAX or not and then the write is DAX as well as reads after that, looks
not very interesting for me as a user.

Just my $0.17
Boaz

> Finally, the physical DAX flag inheritance is maintained from previous work 
> on 
> XFS but should be added for other file systems for consistence.
> 
> 
> [1] https://lwn.net/Articles/787973/
> [2] https://lwn.net/Articles/787233/
> 
> To: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: "Darrick J. Wong" 
> Cc: Dan Williams 
> Cc: Dave Chinner 
> Cc: Christoph Hellwig 
> Cc: "Theodore Y. Ts'o" 
> Cc: Jan Kara 
> Cc: linux-e...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> 
> Ira Weiny (5):
>   fs/stat: Define DAX statx attribute
>   fs/xfs: Isolate the physical DAX flag from effective
>   fs/xfs: Separate functionality of xfs_inode_supports_dax()
>   fs/xfs: Clean up DAX support check
>   fs/xfs: Allow toggle of physical DAX flag
> 
>  fs/stat.c |  3 +++
>  fs/xfs/xfs_ioctl.c| 32 ++--
>  fs/xfs/xfs_iops.c | 36 ++--
>  fs/xfs/xfs_iops.h |  2 ++
>  include/uapi/linux/stat.h |  1 +
>  5 files changed, 50 insertions(+), 24 deletions(-)
> 



Re: pagecache locking

2019-07-07 Thread Boaz Harrosh
On 06/07/2019 02:31, Dave Chinner wrote:

> 
> As long as the IO ranges to the same file *don't overlap*, it should
> be perfectly safe to take separate range locks (in read or write
> mode) on either side of the mmap_sem as non-overlapping range locks
> can be nested and will not self-deadlock.
> 
> The "recursive lock problem" still arises with DIO and page faults
> inside gup, but it only occurs when the user buffer range overlaps
> the DIO range to the same file. IOWs, the application is trying to
> do something that has an undefined result and is likely to result in
> data corruption. So, in that case I plan to have the gup page faults
> fail and the DIO return -EDEADLOCK to userspace
> 

This sounds very cool. I now understand. I hope you put all the tools
for this in generic places so it will be easier to salvage.

One thing I will be very curious to see is how you teach lockdep
about the "range locks can be nested" thing. I know its possible,
other places do it, but its something I never understood.

> Cheers,
> Dave.

[ Ha one more question if you have time:

  In one of the mails, and you also mentioned it before, you said about
  the rw_read_lock not being able to scale well on mammoth machines
  over 10ns of cores (maybe you said over 20).
  I wonder why that happens. Is it because of the atomic operations,
  or something in the lock algorithm. In my theoretical understanding,
  as long as there are no write-lock-grabbers, why would the readers
  interfere with each other?
]

Thanks
Boaz


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-04 Thread Boaz Harrosh
On 04/07/2019 16:58, Matthew Wilcox wrote:
> On Thu, Jul 04, 2019 at 04:00:00PM +0300, Boaz Harrosh wrote:
<>
>> Matthew you must be kidding an ilog2 in binary is zero clocks
>> (Return the highest bit or something like that)
> 
> You might want to actually check the documentation instead of just
> making shit up.
> 

Yes you are right I stand corrected. Must be smoking ;-)

> https://www.agner.org/optimize/instruction_tables.pdf
> 
> I think in this instance what we want is BSR (aka ffz) since the input is
> going to be one of 0, 1, 3, 7, 15 or 31 (and we want 0, 1, 2, 3, 4, 5 as
> results).
> 
<>
> The compiler doesn't know the range of 'sibs'.  Unless we do the
> profile-feedback thing.
> 

Would you please consider the use of get_order() macro from #include 

Just for the sake of understanding? (Or at least a comment)

Thanks
Boaz


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-04 Thread Boaz Harrosh
On 04/07/2019 06:27, Matthew Wilcox wrote:
> On Wed, Jul 03, 2019 at 02:28:41PM -0700, Dan Williams wrote:
<>
>>> +#ifdef CONFIG_XARRAY_MULTI
>>> +   unsigned int sibs = xas->xa_sibs;
>>> +
>>> +   while (sibs) {
>>> +   order++;
>>> +   sibs /= 2;
>>> +   }
>>
>> Use ilog2() here?
> 
> Thought about it.  sibs is never going to be more than 31, so I don't
> know that it's worth eliminating 5 add/shift pairs in favour of whatever
> the ilog2 instruction is on a given CPU.  In practice, on x86, sibs is
> going to be either 0 (PTEs) or 7 (PMDs).  We could also avoid even having
> this function by passing PMD_ORDER or PTE_ORDER into get_unlocked_entry().
> 
> It's probably never going to be noticable in this scenario because it's
> the very last thing checked before we put ourselves on a waitqueue and
> go to sleep.
> 

Matthew you must be kidding an ilog2 in binary is zero clocks
(Return the highest bit or something like that)

In any way. It took me 5 minutes to understand what you are doing
here. And I only fully got it when Dan gave his comment. So please for
the sake of stupid guys like me could you please make it ilog2() so
to make it easier to understand?
(And please don't do the compiler's job. If in some arch the loop
 is the fastest let the compiler decide?)

Thanks
Boaz


Re: [PATCH] mm: Support madvise_willneed override by Filesystems

2019-07-03 Thread Boaz Harrosh
On 03/07/2019 20:21, Jan Kara wrote:
> On Wed 03-07-19 04:04:57, Boaz Harrosh wrote:
>> On 19/06/2019 11:21, Jan Kara wrote:
>> <>
<>
>> Hi Jan
>>
>> Funny I'm sitting on the same patch since LSF last. I need it too for other
>> reasons. I have not seen, have you pushed your patch yet?
>> (Is based on old v4.20)
> 
> Your patch is wrong due to lock ordering. You should not call vfs_fadvise()
> under mmap_sem. So we need to do a similar dance like madvise_remove(). I
> have to get to writing at least XFS fix so that the madvise change gets
> used and post the madvise patch with it... Sorry it takes me so long.
> 
>   Honza

Ha Sorry I was not aware of this. Lockdep did not catch it on my setup
because my setup does not have any locking conflicts with mmap_sem on the
WILL_NEED path.

But surly you are right because the all effort is to fix the locking problems.

I will also try in a day or two to do as you suggest, and look at 
madvise_remove()
once I have a bit of time. Who ever gets to be less busy ...

Thank you for your help
Boaz


Re: [PATCH] filesystem-dax: Disable PMD support

2019-07-02 Thread Boaz Harrosh
On 03/07/2019 03:42, Dan Williams wrote:
> On Tue, Jul 2, 2019 at 5:23 PM Boaz Harrosh  wrote:
<>
> 
> Yes, but the trick is how to manage cases where someone waiting on one
> type needs to be woken up by an event on the other. 

Exactly I'm totally with you on this.

> So all I'm saying it lets live with more hash collisions until we can figure 
> out a race
> free way to better scale waitqueue usage.
> 

Yes and lets actually do real measurements to see if this really hurts 
needlessly.
Maybe not so much

Thanks
Boaz

<>


Re: pagecache locking

2019-07-02 Thread Boaz Harrosh
On 03/07/2019 04:07, Patrick Farrell wrote:
> Recursively read locking is generally unsafe, that’s why lockdep
> complains about it.  The common RW lock primitives are queued in
> their implementation, meaning this recursive read lock sequence:
> P1 - read (gets lock)
> P2 - write
> P1 - read
> 
> Results not in a successful read lock, but P1 blocking behind P2,
> which is blocked behind P1.  

> Readers are not allowed to jump past waiting writers.

OK thanks that makes sense. I did not know about that last part. Its a kind
of a lock fairness I did not know we have.

So I guess I'll keep my two locks than. The write_locker is the SLOW
path for me anyway, right?

[if we are already at the subject, Do mutexes have the same lock fairness as
 above? Do the write_lock side of rw_sem have same fairness? Something I never
 figured out]

Thanks
Boaz

> 
> - Patrick


[PATCH] mm: Support madvise_willneed override by Filesystems

2019-07-02 Thread Boaz Harrosh
On 19/06/2019 11:21, Jan Kara wrote:
<>
> Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
> well. I'll post it soon since the rest of the series isn't really dependent
> on it.
> 
>   Honza
> 

Hi Jan

Funny I'm sitting on the same patch since LSF last. I need it too for other
reasons. I have not seen, have you pushed your patch yet?
(Is based on old v4.20)

~
>From fddb38169e33d23060ddd444ba6f2319f76edc89 Mon Sep 17 00:00:00 2001
From: Boaz Harrosh 
Date: Thu, 16 May 2019 20:02:14 +0300
Subject: [PATCH] mm: Support madvise_willneed override by Filesystems

In the patchset:
[b833a3660394] ovl: add ovl_fadvise()
[3d8f7615319b] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
[45cd0faae371] vfs: add the fadvise() file operation

Amir Goldstein introduced a way for filesystems to overide fadvise.
Well madvise_willneed is exactly as fadvise_willneed except it always
returns 0.

In this patch we call the FS vector if it exists.

NOTE: I called vfs_fadvise(..,POSIX_FADV_WILLNEED);
  (Which is my artistic preference)

I could also selectively call
if (file->f_op->fadvise)
return file->f_op->fadvise(..., POSIX_FADV_WILLNEED);
If we fear theoretical side effects. I don't mind either way.

CC: Amir Goldstein 
CC: Miklos Szeredi 
Signed-off-by: Boaz Harrosh 
---
 mm/madvise.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6cb1ca93e290..6b84ddcaaaf2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -303,7 +304,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-   force_page_cache_readahead(file->f_mapping, file, start, end - start);
+   vfs_fadvise(file, start << PAGE_SHIFT, (end - start) << PAGE_SHIFT,
+   POSIX_FADV_WILLNEED);
return 0;
 }
 
-- 
2.20.1



Re: pagecache locking

2019-07-02 Thread Boaz Harrosh
On 20/06/2019 01:37, Dave Chinner wrote:
<>
> 
> I'd prefer it doesn't get lifted to the VFS because I'm planning on
> getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> likely to go away in the near term because a range lock can be
> taken on either side of the mmap_sem in the page fault path.
> 
<>
Sir Dave

Sorry if this was answered before. I am please very curious. In the zufs
project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
(Read & writes all take read-locks ...)
The only reason I have it is because of lockdep actually.

Specifically for those xfstests that mmap a buffer then direct_IO in/out
of that buffer from/to another file in the same FS or the same file.
(For lockdep its the same case).
I would be perfectly happy to recursively _read_lock both from the top
of the page_fault at the DIO path, and under in the page_fault. I'm
_read_locking after all. But lockdep is hard to convince. So I stole the
xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
truncate/punch/clone time when all mapping traversal needs to stop for
the destructive change to take place. (Allocations are done another way
and are race safe with traversal)

How do you intend to address this problem with range-locks? ie recursively
taking the same "lock"? because if not for the recursive-ity and lockdep I would
not need the extra lock-object per inode.

Thanks
Boaz


Re: [PATCH] filesystem-dax: Disable PMD support

2019-07-02 Thread Boaz Harrosh
On 02/07/2019 18:37, Dan Williams wrote:
<>
> 
> I'd be inclined to do the brute force fix of not trying to get fancy
> with separate PTE/PMD waitqueues and then follow on with a more clever
> performance enhancement later. Thoughts about that?
> 

Sir Dan

I do not understand how separate waitqueues are any performance enhancement?
The all point of the waitqueues is that there is enough of them and the hash
function does a good radomization spread to effectively grab a single locker
per waitqueue unless the system is very contended and waitqueues are shared.
Which is good because it means you effectively need a back pressure to the app.
(Because pmem IO is mostly CPU bound with no long term sleeps I do not think
 you will ever get to that situation)

So the way I understand it having twice as many waitqueues serving two types
will be better performance over all then, segregating the types each with half
the number of queues.

(Regardless of the above problem of where the segregation is not race clean)

Thanks
Boaz


Re: remove exofs, the T10 OSD code and block/scsi bidi support V4

2019-02-04 Thread Boaz Harrosh
On 01/02/19 09:55, Christoph Hellwig wrote:
> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and we
> removed it 1.5 years ago.  Exofs is just a simple example without
> real life users.
> 
> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, 

For the 4th time: Which ?

> and does not even work properly currently,
> so I think it's finally time to drop it.
> 
> Quote from Boaz:
> 

Hello? anyone home?

Please resend without the below Quote!!!
Because yes I said that but I have also said the opposite. And I have clearly
stated in reaction to the last 3 resends. That I now oppose to the below 
statement.
And that you should please remove it.

Do your own text please. Don't drag me in the mud

> "As I said then. It is used in Universities for studies and experiments.
> Every once in a while. I get an email with questions and reports.
> 
> But yes feel free to remove the all thing!!
> 
> I guess I can put it up on github. In a public tree.
> 
> Just that I will need to forward port it myself, til now you guys
> been doing this for me ;-)"
> 
> Now the last time this caused a bit of a stir, but still no actual users,
> not even for SG_IO passthrough commands.  So here we go again, this time
> including removing everything in the scsi and block layer supporting it,
> and thus shrinking struct request.
> 

NAK-by: Boaz harrosh 
For what ever that means ...

Cheers
Boaz



Re: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-12-23 Thread Boaz Harrosh
On 20/12/18 09:26, Christoph Hellwig wrote:
> On Wed, Dec 19, 2018 at 09:01:53PM -0500, Douglas Gilbert wrote:
>>>   1) reduce the size of every kernel with block layer support, and
>>>  even more for every kernel with scsi support
>>
>> By proposing the removal of bidi support from the block layer, it isn't
>> just the SCSI subsystem that will be impacted. Those NVMe documents
>> that you referred me to earlier in the year, in the command tables
>> in 1.3c and earlier you have noticed the 2 bit direction field and
>> what 11b means? Even if there aren't any bidi NVMe commands *** yet,
>> the fact that NVMe's 64 byte command format has provision for 4
>> (not 2) independent data transfers (data + meta, for each direction).
>> Surely NVMe will sooner or later take advantage of those ... a
>> command like READ GATHERED comes to mind.
> 
> NVMe on the other hand does have support for separate read and write
> buffers as in the current SCSI bidi support, as it encodes the data
> transfers in that SQE.  So IFF NVMe does bidi commands it would have
> to use a single buffer for data in/out, 

There is no such thing as "buffer" there is at first a bio, and after
virtual-to-iommu mapping a scatter-gather-list. All these are currently
governed by a struct request.
request, bio, and sgl, have a single direction, All API's expect a single
direction.

All BIDI did was to say. Lets not change any API or structure but just
use two of them at the same time.

All the wiser is the very high level user, and the very low HW driver like
iscsi. All the middlewere was never touched.

In the view of a bidi target like say an osd. It all stream looks like a single
"Buffer" on the wire, were some of it is read and some of it is written
to.

> which can be easily done

?? Did you try. It will take much more than an additional pointer sir

> in the block layer without the current bidi support that chains
> two struct request instances for data in and data out.
> 

That was the all trick of not changing a single API or structure
Just have two of the same thing, we already know how to handle

>>>   2) reduce the size of the critical struct request structure by
>>>  128 bits, thus reducing the memory used by every blk-mq driver
>>>  significantly, never mind the cache effects
>>
>> Hmm, one pointer (that is null in the non-bidi case) should be enough,
>> that's 64 or 32 bits.
> 
> Due to the way we use request chaining we need two fields at the
> moment.  ->special and ->next_rq.  

No! ->special is nothing to do with bidi. ->special is a field to be
used by LLD's only and are not to be touched by block layer or transports
or high level users.
 
Request has the single ->next_rq for bidi. And could be eliminated by
sharing space with the elevator info. Do you want a patch?

(So in effect it can be taking 0 bytes, and yes a little bit of code)

> If we'd refactor the whole thing
> for the basically non-existent user we could indeed probably get it
> down to a single pointer. 
> 
>> While on the subject of bidi, the order of transfers: is the data-out
>> (to the target) always before the data-in or is it the target device
>> that decides (depending on the semantics of the command) who is first?
> 
> The way I read SAM data needs to be transferred to the device for
> processing first, then the processing occurs and then it is transferred
> out, so the order seems fixed.
> 

Not sure what is the "SAM" above. But most of the BIDI commands I know,
osd and otherwise, the order is command specific, and many times it is
done in parallel.
Read some bits than write some bits, rinse and repeat ...

(You see in scsi the all OUT buffer is part of the actual CDB, so in effect
 any READ is a BIDI. The novelty here is the variable sizes and the SW stack
 memory targets for the different operations)

>>
>> Doug Gilbert
>>
>>  *** there could already be vendor specific bidi NVMe commands out
>> there (ditto for SCSI)
> 
> For NVMe they'd need to transfer data in and out in the same buffer
> to sort work, and even then only if we don't happen to be bounce
> buffering using swiotlb, or using a network transport.  Similarly for
> SCSI only iSCSI at the moment supports bidi CDBs, so we could have
> applications using vendor specific bidi commands on iSCSI, which
> is exactly what we're trying to find out, but it is a bit of a very
> niche use case.
> 

Again bidi works NOW. Did not yet see the big gain, of throwing it
out.

Jai Maa
Boaz



Re: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-12-23 Thread Boaz Harrosh
On 19/12/18 16:43, Christoph Hellwig wrote:
> On Mon, Nov 26, 2018 at 07:11:10PM +0200, Boaz Harrosh wrote:
>> On 11/11/18 15:32, Christoph Hellwig wrote:
>>> The only real user of the T10 OSD protocol, the pNFS object layout
>>> driver never went to the point of having shipping products, and we
>>> removed it 1.5 years ago.  Exofs is just a simple example without
>>> real life users.
>>>
>>
>> You have failed to say what is your motivation for this patchset? What
>> is it you are trying to fix/improve.
> 
> Drop basically unused support, which allows us to
> 
>  1) reduce the size of every kernel with block layer support, and
> even more for every kernel with scsi support

Do you have numbers? its mainly code-segment so I don't think you will see
any real life measurable difference.

>  2) reduce the size of the critical struct request structure by
> 128 bits, thus reducing the memory used by every blk-mq driver
> significantly, never mind the cache effects

128 bits? I see the "struct request *next_rq;"
is there another one?

It could share space with elv; && flush;
Do you want a patch?

>  3) stop having the maintainance overhead for this code in the
> block layer, which has been rather painful at times
> 

I hear you man. Life is pain. But is it really such an overhead?
I mean it is already implemented. What else is there to do?
Please please show me? (Sorry for being slow)

Jai Maa
Boaz



Re: [PATCH 33/41] scsi: osd: osd_initiator: mark expected switch fall-throughs

2018-12-18 Thread Boaz Harrosh
On 28/11/18 06:32, Gustavo A. R. Silva wrote:
> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> Signed-off-by: Gustavo A. R. Silva 

ACK-by: Boaz Harrosh 

> ---
>  drivers/scsi/osd/osd_initiator.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/osd/osd_initiator.c 
> b/drivers/scsi/osd/osd_initiator.c
> index 60cf7c5eb880..cb26f26d5ec1 100644
> --- a/drivers/scsi/osd/osd_initiator.c
> +++ b/drivers/scsi/osd/osd_initiator.c
> @@ -1849,6 +1849,7 @@ int osd_req_decode_sense_full(struct osd_request *or,
>   32, 1, dump, sizeof(dump), true);
>   OSD_SENSE_PRINT2("response_integrity [%s]\n", dump);
>   }
> + /* fall through */
>   case osd_sense_attribute_identification:
>   {
>   struct osd_sense_attributes_data_descriptor
> @@ -1879,7 +1880,7 @@ int osd_req_decode_sense_full(struct osd_request *or,
>   attr_page, attr_id);
>   }
>   }
> - /*These are not legal for OSD*/
> + /* fall through - These are not legal for OSD */
>   case scsi_sense_field_replaceable_unit:
>   OSD_SENSE_PRINT2("scsi_sense_field_replaceable_unit\n");
>   break;
> 



Re: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-11-26 Thread Boaz Harrosh
On 11/11/18 15:32, Christoph Hellwig wrote:
> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and we
> removed it 1.5 years ago.  Exofs is just a simple example without
> real life users.
> 

You have failed to say what is your motivation for this patchset? What
is it you are trying to fix/improve.

For the sake of "not used much" I fail to see the risk taking of
this removal.

> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, so I think it's finally time to drop it.
> 
> Quote from Boaz:
> 
> "As I said then. It is used in Universities for studies and experiments.
> Every once in a while. I get an email with questions and reports.
> 
> But yes feel free to remove the all thing!!
> 
> I guess I can put it up on github. In a public tree.
> 
> Just that I will need to forward port it myself, til now you guys
> been doing this for me ;-)"
> 

Yes I wrote that for V1. But I wrote the *opposite* thing in a later mail.
Which nullifies this statement above. So please remove this quote in future
submits.

Here is what I wrote later as response of V2 of this set:



I think I'm changing my mind about this.

Because of two reasons:

One: I see 3 thousands bit-rots in the Kernel and particularly SCSI drivers
that are much older and fat-and-ugliness consuming then the clean osd
stack. For example the all ISA bus and ZONE_DMA stuff.

Two: I have offered many times, every time this came up. That if
anyone has a major (or even minor) change to the block and/or scsi layers
that he/she has done. And that now breaks the compilation/run time of
OSD or exofs.
  I'm personally willing to spend my weekends and fix it myself. Send me
  a URL of the tree with the work done, and I will send the patches needed
  to revitalize OSD/exofs as part of that change set.

I have never received any such requests to date.

So I would please like to protest on two of Christoph's statements above.

"The code has been mostly unmaintained for years and is getting in the
 way of block / SCSI changes"

1. What does "maintained" means? I have for all these years been immediately
   responsive to any inquiries and patches to the code in question.
   And am listed as MAINTAINER of this code.
2. I have regularly, for ever, every kernel release around the RC3-RC4
   time frame, compiled and ran my almost automatic setup and made sure
   the things still run as expected (No regressions).

So Yes the code has not seen any new fixtures for years. But it is regularly
tested and proven to work, on latest kernel. So it fails the definition 
of a "bit rot"

Christoph you've been saying for so long "getting in the way of block/SCSI
changes". And every time and again this time please tell me, you never answered
before. What are these changes you want to make? can I please help?

Send me any tree where exofs/osd compilation is broken and I will personally
fix it in "ONE WEEK" time.

(If compilation is fine but you know runtime will break, its nice to have an
 heads up, but if not my automatic system will detect it anyway)

Lets say that if in the FUTURE a change-set is submitted that breaks OSD/EXOFS
compilation, and I failed to respond with a fix within "ONE WEEK". Then
this goes in as part of that change-set. And not with the argument of
"Not used, not maintained" - But as "Breaks compilation of the following 
changes"
I promise I will gladly ACK it then.

So for now. A personal NACK from me on the grounds that. You never told me
why / what this is breaking.

Thanks
Boaz



> Now the last time this caused a bit of a stir, but still no actual users,
> not even for SG_IO passthrough commands.  So here we go again, this time
> including removing everything in the scsi and block layer supporting it,
> and thus shrinking struct request.
> 

Again. T10-OSD or not. Bidi is currently actively used. By Linus rules
You are not allowed to remove it.

Two use paths:
1. Management CDBS of private vendors yes via iscsi. virt_io and usb-scsi
2. Target mode support of WRITE-RETURN-XOR, and COMPARE_AND_WRITE

---
You guys should do what you feel best. Even not answering my questions and
of course not agreeing with my advise, .i.e about breaking people's setups.

But please remove the wrong quote from me. Please quote my objection
of the matter. (pretty please because you may surly ignore that request as
well)

[I am not fighting about this at all. Please do what you need to do.
 Just want to set the record strait that's all]

Cheers :-)
Boaz



Re: remove exofs and the T10 OSD code V2

2018-10-31 Thread Boaz Harrosh
On 31/10/18 23:10, Douglas Gilbert wrote:
> On 2018-10-31 4:57 p.m., Boaz Harrosh wrote:
>> On 30/10/18 09:45, Christoph Hellwig wrote:
>>> On Mon, Oct 29, 2018 at 02:42:12PM -0600, Jens Axboe wrote:
>>>> LGTM, for both:
>>>
>>> I also have this one on top as requested by Martin.  The core block
>>> bidi support is unfortunately also used by bsg-lib, although it is
>>> not anywhere near as invasive.  But that is another argument for
>>> looking into moving bsg-lib away from using block queues..
>>>
>>
>> BUT this patch is very very wrong.
>>
>> Totally apart from T10-OSD and its use in the field. Support for scsi BIDI
>> commands is not exclusive to T10-OSD at all. Even the simple scsi-array
>> command-set has BIDI operations defined. for example the write-return-xor
>> and so on.
>>
>> Also some private administrative CDBs of some vendor devices uses SCSI-BIDI.
>> So this patch just broke some drivers. (User-mode apps use bsg pass through)
>>
>> Also you might (try hard and) remove all usage of scsi-bidi as an initiator
>> from the Linux Kernel. But what about target mode. As a target we have 
>> supported
>> on the wire bidi protocols like write-return-xor and others for a long time.
>> Are you willing to silently break all these setups in the field on the next 
>> update?
>> Are you so sure these are never used?
>>
>> PLEASE, I beg of you guys. Do not remove SCSI-BIDI. It is a cry of 
>> generations.
>>
>> And I think by the rules of Linus, as far as target mode. You are not allowed
>> to break users in this way.
> 
> Hi,
> I'm in the process of rebuilding the sg driver with the following goals:
> 
>- undo 20 years of road wear, some of which is caused by literally
>  hundreds of "laparoscopic" patches that gradually ruin a driver,
>  at least from the maintainer's viewpoint. Comments that once made
>  sense become cryptic or just nonsense; naming schemes that
>  obliterate variable names to the extent that a random name
>  generator would be easier to follow; and just plain broken code.
>  For example, why sort out the existing locking at a particular
>  level when you can add a new lock in a completely non-orthogonal
>  way? [Yes, I looking at you, google.] Anyway, my first cut at this
>  is out there (on the linux-scsi list, see: "[PATCH v3 0/8] sg:
>  major cleanup, remove max_queue limit"). Not much new there,
>  unless you look very closely
> 
>- the next step is to add to the sg driver async SCSI command
>  capability based on the sg_io_v4 structure previously only used
>  by the bsg driver and now removed from bsg. The main advantage
>  of the sg_io_v4 structure over previous pass-through interface
>  attempts is the support of SCSI bidi commands
> 
>- as part of this effort introduce two new ioctls: SG_IOSUBMIT and
>  SG_IORECEIVE to replace the write()/read() technique currently
>  in use (since Linux 1.0 in 1992). The write()/read() technique
>  seems to be responsible for various security folks losing clumps
>  of their hair. One advantage of ioctls, as Alan Cox pointed out,
>  is the ability to write to and read back from the kernel in a way
>  that is naturally synchronized. [Actually, those security folks
>  shouldn't look too closely at sg_read() in that respect.]
> 
> In discussions with several folks who are on the T10 committee, I
> wondered why there was no READ GATHERED command (there has been a
> WRITE SCATTERED for 2 years). The answer was lack of interest ***,
> plus the difficultly of implementation. You see, READ GATHERED needs
> to send a scatter gather list to the device and get the corresponding
> data back (as a linear array). And that requires either:
>a) bidi commands (scatter gather list in the data-out, corresponding
>   "read" data in the data-in), or
>b) lng SCSI commands, up to around 256 bytes long in which the
>   sgat list is the latter part of that command
> 
> And the T10 folks say neither of those options is well supported or
> is expensive. 

It is supported in Linux scsi/osd driver is a proof of that. And expensive
it is not. I have demonstrated the ability to saturate a 10G link over
a raid of devices from a single writer. In OSD everything is bidi.

> I'm guessing they are referring to Linux and Windows.
> I haven't argued much beyond that point, but it looks like a bit of
> a chicken and egg situation.
> 
> 
> Don't know too much about the T10 OSD stuff. But I can see that it
> uses both long SCSI com

Re: [RFC][PATCH v2 05/10] exofs: use common file type conversion

2018-10-24 Thread Boaz Harrosh
On 23/10/18 23:19, Phillip Potter wrote:
> Deduplicate the exofs file type conversion implementation.
> 
> Original patch by Amir Goldstein.
> 
> v2:
> - This version does not remove EXOFS_FT_x enum from fs/exofs/common.h,
>   as these values are now used in compile-time checks added by
>   Phillip Potter to make sure they remain the same as FT_x values
> 
> v1:
> - Initial implementation
> 
> Signed-off-by: Phillip Potter 

Yes thank you, totally

ACK-by: Boaz Harrosh 

> ---
>  fs/exofs/dir.c | 49 ++---
>  1 file changed, 18 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
> index f0138674c1ed..2e3161ca9014 100644
> --- a/fs/exofs/dir.c
> +++ b/fs/exofs/dir.c
> @@ -204,33 +204,10 @@ exofs_validate_entry(char *base, unsigned offset, 
> unsigned mask)
>   return (char *)p - base;
>  }
>  
> -static unsigned char exofs_filetype_table[EXOFS_FT_MAX] = {
> - [EXOFS_FT_UNKNOWN]  = DT_UNKNOWN,
> - [EXOFS_FT_REG_FILE] = DT_REG,
> - [EXOFS_FT_DIR]  = DT_DIR,
> - [EXOFS_FT_CHRDEV]   = DT_CHR,
> - [EXOFS_FT_BLKDEV]   = DT_BLK,
> - [EXOFS_FT_FIFO] = DT_FIFO,
> - [EXOFS_FT_SOCK] = DT_SOCK,
> - [EXOFS_FT_SYMLINK]  = DT_LNK,
> -};
> -
> -#define S_SHIFT 12
> -static unsigned char exofs_type_by_mode[S_IFMT >> S_SHIFT] = {
> - [S_IFREG >> S_SHIFT]= EXOFS_FT_REG_FILE,
> - [S_IFDIR >> S_SHIFT]= EXOFS_FT_DIR,
> - [S_IFCHR >> S_SHIFT]= EXOFS_FT_CHRDEV,
> - [S_IFBLK >> S_SHIFT]= EXOFS_FT_BLKDEV,
> - [S_IFIFO >> S_SHIFT]= EXOFS_FT_FIFO,
> - [S_IFSOCK >> S_SHIFT]   = EXOFS_FT_SOCK,
> - [S_IFLNK >> S_SHIFT]= EXOFS_FT_SYMLINK,
> -};
> -
>  static inline
>  void exofs_set_de_type(struct exofs_dir_entry *de, struct inode *inode)
>  {
> - umode_t mode = inode->i_mode;
> - de->file_type = exofs_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
> + de->file_type = fs_umode_to_ftype(inode->i_mode);
>  }
>  
>  static int
> @@ -279,17 +256,27 @@ exofs_readdir(struct file *file, struct dir_context 
> *ctx)
>   exofs_put_page(page);
>   return -EIO;
>   }
> - if (de->inode_no) {
> - unsigned char t;
>  
> - if (de->file_type < EXOFS_FT_MAX)
> - t = exofs_filetype_table[de->file_type];
> - else
> - t = DT_UNKNOWN;
> + /*
> +  * compile-time asserts that generic FT_x types
> +  * still match EXOFS_FT_x types - no need to list
> +  * for other functions as well as build will
> +  * fail either way
> +  */
> + BUILD_BUG_ON(EXOFS_FT_UNKNOWN != FT_UNKNOWN);
> + BUILD_BUG_ON(EXOFS_FT_REG_FILE != FT_REG_FILE);
> + BUILD_BUG_ON(EXOFS_FT_DIR != FT_DIR);
> + BUILD_BUG_ON(EXOFS_FT_CHRDEV != FT_CHRDEV);
> + BUILD_BUG_ON(EXOFS_FT_BLKDEV != FT_BLKDEV);
> + BUILD_BUG_ON(EXOFS_FT_FIFO != FT_FIFO);
> + BUILD_BUG_ON(EXOFS_FT_SOCK != FT_SOCK);
> + BUILD_BUG_ON(EXOFS_FT_SYMLINK != FT_SYMLINK);
> + BUILD_BUG_ON(EXOFS_FT_MAX != FT_MAX);
>  
> + if (de->inode_no) {
>   if (!dir_emit(ctx, de->name, de->name_len,
>   le64_to_cpu(de->inode_no),
> - t)) {
> + fs_dtype(de->file_type))) {
>   exofs_put_page(page);
>   return 0;
>   }
> 



Re: [PATCH] fs/exofs: Remove ignored __weak attribute

2018-10-02 Thread Boaz Harrosh
On 30/09/18 23:51, Nathan Chancellor wrote:
> Clang warns that the __weak attribute is going to be ignored on
> g_attr_inode_data because it's not in the correct location (needs to be
> after the type).
> 
> In file included from fs/exofs/dir.c:35:
> In file included from fs/exofs/exofs.h:41:
> fs/exofs/common.h:186:21: warning: 'weak' attribute only applies to
> variables, functions, and classes [-Wignored-attributes]
> static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
> ^
> 
> Turns out that GCC ignores the attribute too albeit silently because
> moving the attribute after either osd_attr or g_attr_inode_data like
> all other uses of __weak on variables in the kernel causes a build
> error on both GCC and Clang because static variables cannot be weak
> since weak definitions rely on not having internal linkage:
> 
> In file included from fs/exofs/namei.c:34:
> In file included from fs/exofs/exofs.h:41:
> fs/exofs/common.h:186:30: error: weak declaration cannot have internal
> linkage
> static const struct osd_attr __weak g_attr_inode_data = ATTR_DEF(
>  ^
> 
> Just remove the attribute because it hasn't been correct since the
> initial addition of this file in commit b14f8ab28449 ("exofs: Kbuild,
> Headers and osd utils").
> 
> Reported-by: Nick Desaulniers 
> Reviewed-by: Nick Desaulniers 

ACK-by: Boaz Harrosh 

Yes! thanks
Boaz

> Signed-off-by: Nathan Chancellor 
> ---
>  fs/exofs/common.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/exofs/common.h b/fs/exofs/common.h
> index 7d88ef566213..45da96a1528d 100644
> --- a/fs/exofs/common.h
> +++ b/fs/exofs/common.h
> @@ -183,7 +183,7 @@ struct exofs_fcb {
>  #define EXOFS_INO_ATTR_SIZE  sizeof(struct exofs_fcb)
>  
>  /* This is the Attribute the fcb is stored in */
> -static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
> +static const struct osd_attr g_attr_inode_data = ATTR_DEF(
>   EXOFS_APAGE_FS_DATA,
>   EXOFS_ATTR_INODE_DATA,
>   EXOFS_INO_ATTR_SIZE);
> 



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-22 Thread Boaz Harrosh
On 18/05/18 17:14, Christopher Lameter wrote:
> On Tue, 15 May 2018, Boaz Harrosh wrote:
> 
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>>
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> But there are no provisions for probhiting accesses from other cores?
> 
> This means that a casual accidental write from a thread executing on
> another core can lead to arbitrary memory corruption because the cache
> flushing has been bypassed.
> 

No this is not accurate. A "casual accidental write" will not do any harm.
Only a well concerted malicious server can exploit this. A different thread
on a different core will need to hit the exact time to read from the exact
pointer at the narrow window while the IO is going on. fault-in a TLB at the
time of the valid mapping. Then later after the IO has ended and before any
of the threads where scheduled out, maliciously write. All the while the App
has freed its buffers and the buffer was used for something else.
Please bear in mind that this is only As root, in an /sbin/ executable signed
by the Kernel's key. I think that anyone who as gained such an access to the
system (i.e compiled and installed an /sbin server), Can just walk the front 
door.
He does not need to exploit this narrow random hole. Hell he can easily just
modprob a Kernel module.

And I do not understand. Every one is motivated in saying "no cannot be solved"
So lets start from the Beginning.

How can we implement "Private memory"?

You know how in the fork days. We have APIs for "shared memory".

I.E: All read/write memory defaults to private except special setup
 "shared memory"
This is vs Threads where all memory regions are shared.

[Q] How can we implement a "private memory" region.
.I.E All read/write memory defaults to shared except special setup
 "private memory"

Can this be done? How, please advise?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 17:17, Peter Zijlstra wrote:
<>
>>
>> So I would love some mm guy to explain where are those bits collected?
> 
> Depends on the architecture, some architectures only ever set bits,
> some, like x86, clear bits again. You want to look at switch_mm().
> 
> Basically x86 clears the bit again when we switch away from the mm and
> have/will invalidate TLBs for it in doing so.
> 

Ha, OK I am starting to get a picture.

>> Which brings me to another question. How can I find from
>> within a thread Say at the file_operations->mmap() call that the thread
>> is indeed core-pinned. What mm_cpumask should I inspect?
> 
> is_percpu_thread().

Right thank you a lot Peter. this helps.

Boaz
> .
> 



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 17:18, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 05:10:57PM +0300, Boaz Harrosh wrote:
>> I'm not a lawyer either but I think I'm doing OK. Because I am doing exactly
>> like FUSE is doing. Only some 15 years later, with modern CPUs in mind. I do 
>> not
>> think I am doing anything new here, am I?
> 
> You should talk to a lawyer.  I'm not giving you legal advice.
> I'm telling you that I think what you're doing is unethical.
> .
> 

Not more unethical than what is already there. And I do not see how
this is unethical at all? I trust your opinion and would really want
to understand.

For example your not-in-c zero-copy Server. How is it unethical?
I have the same problem actually some important parts are not in C.

How is it unethical to want to make this run fast?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 16:50, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 04:29:22PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 15:03, Matthew Wilcox wrote:
>>> You're getting dangerously close to admitting that the entire point
>>> of this exercise is so that you can link non-GPL NetApp code into the
>>> kernel in clear violation of the GPL.
>>
>> It is not that at all. What I'm trying to do is enable a zero-copy,
>> synchronous, low latency, low overhead. highly parallel - a new modern
>> interface with application servers.
> 
> ... and fully buzzword compliant.
> 
>> You yourself had such a project that could easily be served out-of-the-box
>> with zufs, of a device that wanted to sit in user-mode.
> 
> For a very different reason.  I think the source code to that project
> is publically available; the problem is that it's not written in C.
> 

Exactly the point, sir. Many reasons to sit in user-land for example
for me it is libraries that can not be loaded into Kernel.

>> Sometimes it is very convenient and needed for Servers to sit in
>> user-mode. And this interface allows that. And it is not always
>> a licensing thing. Though yes licensing is also an issue sometimes.
>> It is the reality we are living in.
>>
>> But please indulge me I am curious how the point of signing /sbin/
>> servers, made you think about GPL licensing issues?
>>
>> That said, is your point that as long as user-mode servers are sl
>> they are OK to be supported but if they are as fast as the kernel,
>> (as demonstrated a zufs based FS was faster then xfs-dax on same pmem)
>> Then it is a GPL violation?
> 
> No.  Read what Linus wrote:
> 
>NOTE! This copyright does *not* cover user programs that use kernel
>  services by normal system calls - this is merely considered normal use
>  of the kernel, and does *not* fall under the heading of "derived work".
> 
> What you're doing is far beyond that exception.  You're developing in
> concert a userspace and kernel component, and claiming that the GPL does
> not apply to the userspace component.  I'm not a lawyer, but you're on
> very thin ice.
> 

But I am not the first one here am I? Fuse and other interfaces already do
exactly this long before I did. Actually any Kernel Interface has some user-mode
component, specifically written for it. And again I am only legally doing 
exactly as
FUSE is doing only much faster, and more importantly for me highly parallel on 
all
cores. Because from my testing the biggest problem of FUSE for me is that it 
does not
scale

I'm not a lawyer either but I think I'm doing OK. Because I am doing exactly
like FUSE is doing. Only some 15 years later, with modern CPUs in mind. I do not
think I am doing anything new here, am I?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:03, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 02:41:41PM +0300, Boaz Harrosh wrote:
>> That would be very hard. Because that program would:
>> - need to be root
>> - need to start and pretend it is zus Server with the all mount
>>   thread thing, register new filesystem, grab some pmem devices.
>> - Mount the said filesystem on said pmem. Create core-pinned ZT threads
>>   for all CPUs, start accepting IO.
>> - And only then it can start leaking the pointer and do bad things.
> 
> All of these things you've done for me by writing zus Server.  All I
> have to do now is compromise zus Server.
> 
>>   The bad things it can do to the application, not to the Kernel.
>>   And as a full filesystem it can do those bad things to the application
>>   through the front door directly not needing the mismatch tlb at all.
> 
> That's not true.  When I have a TLB entry that points to a page of kernel
> ram, I can do almost anything, depending on what the kernel decides to
> do with that ram next.  Maybe it's page cache again, in which case I can
> affect whatever application happens to get it allocated.  Maybe it's a
> kmalloc page next, in which case I can affect any part of the kernel.
> Maybe it's a page table, then I can affect any process.
> 
>> That said. It brings up a very important point that I wanted to talk about.
>> In this design the zuf(Kernel) and the zus(um Server) are part of the 
>> distribution.
>> I would like to have the zus module be signed by the distro's Kernel's key 
>> and
>> checked on loadtime. I know there is an effort by Redhat guys to try and 
>> sign all
>> /sbin/* servers and have Kernel check these. So this is not the first time 
>> people
>> have thought about that.
> 
> You're getting dangerously close to admitting that the entire point
> of this exercise is so that you can link non-GPL NetApp code into the
> kernel in clear violation of the GPL.
> 

It is not that at all. What I'm trying to do is enable a zero-copy,
synchronous, low latency, low overhead. highly parallel - a new modern
interface with application servers.

You yourself had such a project that could easily be served out-of-the-box
with zufs, of a device that wanted to sit in user-mode.

Sometimes it is very convenient and needed for Servers to sit in
user-mode. And this interface allows that. And it is not always
a licensing thing. Though yes licensing is also an issue sometimes.
It is the reality we are living in.

But please indulge me I am curious how the point of signing /sbin/
servers, made you think about GPL licensing issues?

That said, is your point that as long as user-mode servers are sl
they are OK to be supported but if they are as fast as the kernel,
(as demonstrated a zufs based FS was faster then xfs-dax on same pmem)
Then it is a GPL violation?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:54, Boaz Harrosh wrote:
> On 15/05/18 03:44, Matthew Wilcox wrote:
>> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh  wrote:
>>>> In this project we utilize a per-core server thread so everything
>>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>>> are scheduled for the unmap, though in our case we know that we
>>>> have only used a single core. The regular zap_ptes adds a very big
>>>> latency on every operation and mostly kills the concurrency of the
>>>> over all system. Because it imposes a serialization between all cores
>>>
>>> I'd have thought that in this situation, only the local CPU's bit is
>>> set in the vma's mm_cpumask() and the remote invalidations are not
>>> performed.  Is that a misunderstanding, or is all that stuff not working
>>> correctly?
>>
>> I think you misunderstand Boaz's architecture.  He has one thread per CPU,
>> so every bit will be set in the mm's (not vma's) mm_cpumask.
>>
> 
> Hi Andrew, Matthew
> 
> Yes I have been trying to investigate and trace this for days.
> Please see the code below:
> 
>> #define flush_tlb_range(vma, start, end) \
>>  flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
> 
> The mm_struct @mm below comes from here vma->vm_mm
> 
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index e055d1a..1d398a0 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling 
>> __read_mostly = 33;
>>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>>  unsigned long end, unsigned long vmflag)
>>  {
>>  int cpu;
>>  
>>  struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
>>  .mm = mm,
>>  };
>>  
>>  cpu = get_cpu();
>>  
>>  /* This is also a barrier that synchronizes with switch_mm(). */
>>  info.new_tlb_gen = inc_mm_tlb_gen(mm);
>>  
>>  /* Should we flush just the requested range? */
>>  if ((end != TLB_FLUSH_ALL) &&
>>  !(vmflag & VM_HUGETLB) &&
>>  ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
>>  info.start = start;
>>  info.end = end;
>>  } else {
>>  info.start = 0UL;
>>  info.end = TLB_FLUSH_ALL;
>>  }
>>  
>>  if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>>  VM_WARN_ON(irqs_disabled());
>>  local_irq_disable();
>>  flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
>>  local_irq_enable();
>>  }
>>  
>> -if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> +if (!(vmflag & VM_LOCAL_CPU) &&
>> +cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>>  flush_tlb_others(mm_cpumask(mm), &info);
>>  
> 
> I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
> different points. At vma creation (file_operations->mmap()), 
> and before the call to insert_pfn (which calls here).
> 
> At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
> should have a single bit set just as the affinity of the thread on
> creation of that thread. But then I saw that at %80 of the times some
> other random bits are also set.
> 
> Yes Random. Always the thread affinity (single) bit was set but
> then zero one or two more bits were set as well. Never seen more then
> two though, which baffles me a lot.
> 
> If it was like Matthew said .i.e the cpumask of the all process
> then I would expect all the bits to be set. Because I have a thread
> on each core. And also I would even expect that all vma->vm_mm
> or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
> But it was not so. it was pointing to some thread unique object but
> still those phantom bits were set all over. (And I am almost sure
> same vma had those bits change over time)
> 
> So I would love some mm guy to explain where are those bits collected?
> But I do not think this is a Kernel bug because as Matthew showed.
> that vma above can and is allowed to leak memory addresses to other
> threads / cores in the same process. So it appears that the Kernel
> has some correct logic behind its madness.
> 
Hi Mark

So you see %20 of the times the mm_cpumask(vma->vm_mm) is a single
bit set. And a natural call to flush_tlb_range will only invalidate
the local cpu. Are you familiar with this logic?

> Which brings me to another question. How can I find from
> within a thread Say at the file_operations->mmap() call that the thread
> is indeed core-pinned. What mm_cpumask should I inspect?
> 

Mark, Peter do you know how I can check the above?

Thanks
Boaz

>>  put_cpu();
>>  }
> 
> Thanks
> Boaz
> 



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:07, Mark Rutland wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 03:41, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>>>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>>>> On a call to mmap an mmap provider (like an FS) can put
>>>>>> this flag on vma->vm_flags.
>>>>>>
>>>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>>>> PTE(s) need not be a wide CPU scheduling.
>>>>>
>>>>> I still don't get this.  You're opening the kernel up to being exploited
>>>>> by any application which can persuade it to set this flag on a VMA.
>>>>>
>>>>
>>>> No No this is not an application accessible flag this can only be set
>>>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>>>
>>>> Please see the zuf patches for usage (Again apologise for pushing before
>>>> a user)
>>>>
>>>> The mmap provider has all the facilities to know that this can not be
>>>> abused, not even by a trusted Server.
>>>
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>>
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
> 
> Even if (in the specific context of your application) software on other
> cores might not explicitly access this area, that does not prevent
> allocations into TLBs, and TLB maintenance *cannot* be elided.
> 
> Even assuming that software *never* explicitly accesses an address which
> it has not mapped is insufficient.
> 
> For example, imagine you have two threads, each pinned to a CPU, and
> some local_cpu_{mmap,munmap} which uses your new flag:
> 
>   CPU0CPU1
>   x = local_cpu_mmap(...);
>   do_things_with(x);
>   // speculatively allocates TLB
>   // entries for X.
> 
>   // only invalidates local TLBs
>   local_cpu_munmap(x);
> 
>   // TLB entries for X still live
>   
>   y = local_cpu_mmap(...);
> 
>   // if y == x, we can hit the

But this y == x is not possible. The x here is held throughout the
lifetime  of CPU0-pinned thread. And cannot be allocated later to
another thread.

In fact if that file holding the VMA closes we know the server
crashed and we cleanly close everything.
(Including properly zapping all maps)

>   // stale TLB entry, and access
>   // the wrong page
>   do_things_with(y);
> 

So even if the CPU pre fetched that TLB no one in the system will use
this address until proper close. Where everything is properly flushed.

> Consider that after we free x, the kernel could reuse the page for any
> purpose (e.g. kernel page tables), so this is a major risk.
> 

So yes. We never free x. We hold it for the entire duration of the ZT-thread
(ZThread is that core-pinned thread per-core we are using)
And each time we map some application buffers into that vma and local_tlb
invalidate when done.

When x is de-allocated, do to a close or a crash, it is all properly zapped.

> This flag simply is not safe, unless the *entire* mm is only ever
> accessed from a single CPU. In that case, we don't need the flag anyway,
> as the mm already has a cpumask.
> 

Did you please see that other part of the thread, and my answer to Andrew?
why is the vma->vm_mm cpumask so weird. It is neither all bits set nor
a single bit set. It is very common (20% of the time) for mm_cpumask(vma->vm_mm)
to be a single bit set. Exactly in an application where I have a thread-per-core
Please look at that? (I'll ping you from that email)

> Thanks,
> Mark.
> 

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:09, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 02:41:41PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 14:11, Matthew Wilcox wrote:
> 
>>> You're still thinking about this from the wrong perspective.  If you
>>> were writing a program to attack this facility, how would you do it?
>>> It's not exactly hard to leak one pointer's worth of information.
>>>
>>
>> That would be very hard. Because that program would:
>> - need to be root
>> - need to start and pretend it is zus Server with the all mount
>>   thread thing, register new filesystem, grab some pmem devices.
>> - Mount the said filesystem on said pmem. Create core-pinned ZT threads
>>   for all CPUs, start accepting IO.
>> - And only then it can start leaking the pointer and do bad things.
>>   The bad things it can do to the application, not to the Kernel.
> 
> No I think you can do bad things to the kernel at that point. Consider
> it populating the TLBs on the 'wrong' CPU by 'inadvertenly' touching
> 'random' memory.
> 
> Then cause an invalidation and get the page re-used for kernel bits.
> 
> Then access that page through the 'stale' TLB entry we still have on the
> 'wrong' CPU and corrupt kernel data.
> 

Yes a BAD filesystem Server can do bad things I agree. But a filesystem can
do very bad things in any case. through the front door, No? and we trust
it with our data. So there is some trust we already put in a filesystem i think.

I will try to look at this deeper, see if I can actually enforce this policy.
Do you have any ideas? can I force page_faults on the other cores?

Thank you for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:47, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> What happens when the userspace part -- there is one, right, how else do
> you get an mm to stick a vma in -- simply does a full address range
> probe scan?
> 
> Something like this really needs a far more detailed Changelog that
> explains how its to be used and how it is impossible to abuse. Esp. that
> latter is _very_ important.
> 

Thank you yes. I will try and capture all this thread in the commit message
and as Christoph demanded supply a user code to demonstrate usage.

Thank you for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 03:44, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh  wrote:
>>> In this project we utilize a per-core server thread so everything
>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>> are scheduled for the unmap, though in our case we know that we
>>> have only used a single core. The regular zap_ptes adds a very big
>>> latency on every operation and mostly kills the concurrency of the
>>> over all system. Because it imposes a serialization between all cores
>>
>> I'd have thought that in this situation, only the local CPU's bit is
>> set in the vma's mm_cpumask() and the remote invalidations are not
>> performed.  Is that a misunderstanding, or is all that stuff not working
>> correctly?
> 
> I think you misunderstand Boaz's architecture.  He has one thread per CPU,
> so every bit will be set in the mm's (not vma's) mm_cpumask.
> 

Hi Andrew, Matthew

Yes I have been trying to investigate and trace this for days.
Please see the code below:

> #define flush_tlb_range(vma, start, end)  \
>   flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)

The mm_struct @mm below comes from here vma->vm_mm

> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a..1d398a0 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling 
> __read_mostly = 33;
>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>   unsigned long end, unsigned long vmflag)
>  {
>   int cpu;
>  
>   struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
>   .mm = mm,
>   };
>  
>   cpu = get_cpu();
>  
>   /* This is also a barrier that synchronizes with switch_mm(). */
>   info.new_tlb_gen = inc_mm_tlb_gen(mm);
>  
>   /* Should we flush just the requested range? */
>   if ((end != TLB_FLUSH_ALL) &&
>   !(vmflag & VM_HUGETLB) &&
>   ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
>   info.start = start;
>   info.end = end;
>   } else {
>   info.start = 0UL;
>   info.end = TLB_FLUSH_ALL;
>   }
>  
>   if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>   VM_WARN_ON(irqs_disabled());
>   local_irq_disable();
>   flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
>   local_irq_enable();
>   }
>  
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> + if (!(vmflag & VM_LOCAL_CPU) &&
> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>   flush_tlb_others(mm_cpumask(mm), &info);
>  

I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
different points. At vma creation (file_operations->mmap()), 
and before the call to insert_pfn (which calls here).

At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
should have a single bit set just as the affinity of the thread on
creation of that thread. But then I saw that at %80 of the times some
other random bits are also set.

Yes Random. Always the thread affinity (single) bit was set but
then zero one or two more bits were set as well. Never seen more then
two though, which baffles me a lot.

If it was like Matthew said .i.e the cpumask of the all process
then I would expect all the bits to be set. Because I have a thread
on each core. And also I would even expect that all vma->vm_mm
or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
But it was not so. it was pointing to some thread unique object but
still those phantom bits were set all over. (And I am almost sure
same vma had those bits change over time)

So I would love some mm guy to explain where are those bits collected?
But I do not think this is a Kernel bug because as Matthew showed.
that vma above can and is allowed to leak memory addresses to other
threads / cores in the same process. So it appears that the Kernel
has some correct logic behind its madness.

Which brings me to another question. How can I find from
within a thread Say at the file_operations->mmap() call that the thread
is indeed core-pinned. What mm_cpumask should I inspect?

>   put_cpu();
>  }

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:11, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 03:41, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>>>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>>>> On a call to mmap an mmap provider (like an FS) can put
>>>>>> this flag on vma->vm_flags.
>>>>>>
>>>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>>>> PTE(s) need not be a wide CPU scheduling.
>>>>>
>>>>> I still don't get this.  You're opening the kernel up to being exploited
>>>>> by any application which can persuade it to set this flag on a VMA.
>>>>>
>>>>
>>>> No No this is not an application accessible flag this can only be set
>>>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>>>
>>>> Please see the zuf patches for usage (Again apologise for pushing before
>>>> a user)
>>>>
>>>> The mmap provider has all the facilities to know that this can not be
>>>> abused, not even by a trusted Server.
>>>
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> You're still thinking about this from the wrong perspective.  If you
> were writing a program to attack this facility, how would you do it?
> It's not exactly hard to leak one pointer's worth of information.
> 

That would be very hard. Because that program would:
- need to be root
- need to start and pretend it is zus Server with the all mount
  thread thing, register new filesystem, grab some pmem devices.
- Mount the said filesystem on said pmem. Create core-pinned ZT threads
  for all CPUs, start accepting IO.
- And only then it can start leaking the pointer and do bad things.
  The bad things it can do to the application, not to the Kernel.
  And as a full filesystem it can do those bad things to the application
  through the front door directly not needing the mismatch tlb at all.

That said. It brings up a very important point that I wanted to talk about.
In this design the zuf(Kernel) and the zus(um Server) are part of the 
distribution.
I would like to have the zus module be signed by the distro's Kernel's key and
checked on loadtime. I know there is an effort by Redhat guys to try and sign 
all
/sbin/* servers and have Kernel check these. So this is not the first time 
people
have thought about that.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 10:08, Christoph Hellwig wrote:
> On Mon, May 14, 2018 at 09:26:13PM +0300, Boaz Harrosh wrote:
>> I am please pushing for this patch ahead of the push of ZUFS, because
>> this is the only patch we need from otherwise an STD Kernel.
>>
>> We are partnering with Distro(s) to push ZUFS out-of-tree to beta clients
>> to try and stabilize such a big project before final submission and
>> an ABI / on-disk freeze.
>>
> 
> Please stop this crap.  If you want any new kernel functionality send
> it together with a user.  Even more so for something as questionanble
> and hairy as this.
> 
> With a stance like this you disqualify yourself.
> 

OK thank you I see your point. I will try to push a user ASAP.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 03:41, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>> On a call to mmap an mmap provider (like an FS) can put
>>>> this flag on vma->vm_flags.
>>>>
>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>> PTE(s) need not be a wide CPU scheduling.
>>>
>>> I still don't get this.  You're opening the kernel up to being exploited
>>> by any application which can persuade it to set this flag on a VMA.
>>>
>>
>> No No this is not an application accessible flag this can only be set
>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>
>> Please see the zuf patches for usage (Again apologise for pushing before
>> a user)
>>
>> The mmap provider has all the facilities to know that this can not be
>> abused, not even by a trusted Server.
> 
> I don't think page tables work the way you think they work.
> 
> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
> 
> That doesn't just insert it into the local CPU's page table.  Any CPU
> which directly accesses or even prefetches that address will also get
> the translation into its cache.
> 

Yes I know, but that is exactly the point of this flag. I know that this
address is only ever accessed from a single core. Because it is an mmap (vma)
of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
only that thread any kind of access to this vma. Both the filehandle and the
mmaped pointer are kept on the thread stack and have no access from outside.

So the all point of this flag is the kernel driver telling mm that this
address is enforced to only be accessed from one core-pinned thread.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh
On 14/05/18 22:15, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>> On a call to mmap an mmap provider (like an FS) can put
>> this flag on vma->vm_flags.
>>
>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>> from a single-core only, and therefore invalidation (flush_tlb) of
>> PTE(s) need not be a wide CPU scheduling.
> 
> I still don't get this.  You're opening the kernel up to being exploited
> by any application which can persuade it to set this flag on a VMA.
> 

No No this is not an application accessible flag this can only be set
by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).

Please see the zuf patches for usage (Again apologise for pushing before
a user)

The mmap provider has all the facilities to know that this can not be
abused, not even by a trusted Server.

>> NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
>> always used in a synchronous way from a thread pinned to a single core.
> 
> It's not a question of how your app is going to use this flag.  It's a
> question about how another app can abuse this flag (or how your app is
> going to be exploited to abuse this flag) to break into the kernel.
> 

If you look at the zuf user you will see that the faults all return
SIG_BUS. These can never fault. The server has access to this mapping
from a single thread pinned to a core.

Again it is not an app visible flag in anyway

Thanks for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh
On 14/05/18 20:28, Boaz Harrosh wrote:
> 
> On a call to mmap an mmap provider (like an FS) can put
> this flag on vma->vm_flags.
> 
> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
> from a single-core only, and therefore invalidation (flush_tlb) of
> PTE(s) need not be a wide CPU scheduling.
> 
> The motivation of this flag is the ZUFS project where we want
> to optimally map user-application buffers into a user-mode-server
> execute the operation and efficiently unmap.
> 

I am please pushing for this patch ahead of the push of ZUFS, because
this is the only patch we need from otherwise an STD Kernel.

We are partnering with Distro(s) to push ZUFS out-of-tree to beta clients
to try and stabilize such a big project before final submission and
an ABI / on-disk freeze.

By itself this patch has 0 risk and can not break anything.

Thanks
Boaz

> In this project we utilize a per-core server thread so everything
> is kept local. If we use the regular zap_ptes() API All CPU's
> are scheduled for the unmap, though in our case we know that we
> have only used a single core. The regular zap_ptes adds a very big
> latency on every operation and mostly kills the concurrency of the
> over all system. Because it imposes a serialization between all cores
> 
> Some preliminary measurements on a 40 core machines:
> 
>   unpatched   patched
> Threads   Op/sLat [us]Op/sLat [us]
> 
> 1 185391  4.9 200799  4.6
> 2 197993  9.6 314321  5.9
> 4 310597  12.1565574  6.6
> 8 546702  13.81113138 6.6
> 12641728  17.21598451 6.8
> 18744750  22.21648689 7.8
> 24790805  28.31702285 8
> 36849763  38.91783346 13.4
> 48792000  44.61741873 17.4
> 
> We can clearly see that on an unpatched Kernel we do not scale
> and the threads are interfering with each other. This is because
> flush-tlb is scheduled on all (other) CPUs.
> 
> NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
> always used in a synchronous way from a thread pinned to a single core.
> 
> Signed-off-by: Boaz Harrosh 
> ---
>  arch/x86/mm/tlb.c  |  3 ++-
>  fs/proc/task_mmu.c |  3 +++
>  include/linux/mm.h |  3 +++
>  mm/memory.c| 13 +++--
>  4 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a..1d398a0 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -640,7 +640,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned 
> long start,
>   local_irq_enable();
>   }
>  
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> + if (!(vmflag & VM_LOCAL_CPU) &&
> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>   flush_tlb_others(mm_cpumask(mm), &info);
>  
>   put_cpu();
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c486ad4..305d6e4 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -680,6 +680,9 @@ static void show_smap_vma_flags(struct seq_file *m, 
> struct vm_area_struct *vma)
>   [ilog2(VM_PKEY_BIT2)]   = "",
>   [ilog2(VM_PKEY_BIT3)]   = "",
>  #endif
> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> + [ilog2(VM_LOCAL_CPU)]   = "lc",
> +#endif
>   };
>   size_t i;
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ac1f06..3d14107 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -226,6 +226,9 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_HIGH_ARCH_2   BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3   BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4   BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_LOCAL_CPU BIT(37) /* FIXME: Needs to move from here */
> +#else /* ! CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
> +#define VM_LOCAL_CPU 0   /* FIXME: Needs to move from here */
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #if defined(CONFIG_X86)
> diff --git a/mm/memory.c b/mm/memory.c
> index 01f5464..6236f5e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1788,6 +1788,7 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,
>   int retval;
>   pte_t *pte, entry;
>   spinlock_t *ptl;
> + bool need_flush = false;
>  
>   retval = -ENOMEM;
>   pte = get_locked_pte(mm, addr, &ptl);
> @@ -1795,7 +1796,12 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,

[PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh

On a call to mmap an mmap provider (like an FS) can put
this flag on vma->vm_flags.

The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
from a single-core only, and therefore invalidation (flush_tlb) of
PTE(s) need not be a wide CPU scheduling.

The motivation of this flag is the ZUFS project where we want
to optimally map user-application buffers into a user-mode-server
execute the operation and efficiently unmap.

In this project we utilize a per-core server thread so everything
is kept local. If we use the regular zap_ptes() API All CPU's
are scheduled for the unmap, though in our case we know that we
have only used a single core. The regular zap_ptes adds a very big
latency on every operation and mostly kills the concurrency of the
over all system. Because it imposes a serialization between all cores

Some preliminary measurements on a 40 core machines:

unpatched   patched
Threads Op/sLat [us]Op/sLat [us]

1   185391  4.9 200799  4.6
2   197993  9.6 314321  5.9
4   310597  12.1565574  6.6
8   546702  13.81113138 6.6
12  641728  17.21598451 6.8
18  744750  22.21648689 7.8
24  790805  28.31702285 8
36  849763  38.91783346 13.4
48  792000  44.61741873 17.4

We can clearly see that on an unpatched Kernel we do not scale
and the threads are interfering with each other. This is because
flush-tlb is scheduled on all (other) CPUs.

NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
always used in a synchronous way from a thread pinned to a single core.

Signed-off-by: Boaz Harrosh 
---
 arch/x86/mm/tlb.c  |  3 ++-
 fs/proc/task_mmu.c |  3 +++
 include/linux/mm.h |  3 +++
 mm/memory.c| 13 +++--
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e055d1a..1d398a0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -640,7 +640,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
local_irq_enable();
}
 
-   if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+   if (!(vmflag & VM_LOCAL_CPU) &&
+   cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
 
put_cpu();
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c486ad4..305d6e4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -680,6 +680,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
[ilog2(VM_PKEY_BIT2)]   = "",
[ilog2(VM_PKEY_BIT3)]   = "",
 #endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+   [ilog2(VM_LOCAL_CPU)]   = "lc",
+#endif
};
size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06..3d14107 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -226,6 +226,9 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_LOCAL_CPU   BIT(37) /* FIXME: Needs to move from here */
+#else /* ! CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+#define VM_LOCAL_CPU   0   /* FIXME: Needs to move from here */
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
diff --git a/mm/memory.c b/mm/memory.c
index 01f5464..6236f5e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1788,6 +1788,7 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
int retval;
pte_t *pte, entry;
spinlock_t *ptl;
+   bool need_flush = false;
 
retval = -ENOMEM;
pte = get_locked_pte(mm, addr, &ptl);
@@ -1795,7 +1796,12 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
goto out;
retval = -EBUSY;
if (!pte_none(*pte)) {
-   if (mkwrite) {
+   if ((vma->vm_flags & VM_LOCAL_CPU)) {
+   /* VM_LOCAL_CPU is set, A single CPU is allowed to not
+* go through zap_vma_ptes before changing a pte
+*/
+   need_flush = true;
+   } else if (mkwrite) {
/*
 * For read faults on private mappings the PFN passed
 * in may not match the PFN we have mapped if the
@@ -1807,8 +1813,9 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
goto out_unlock;
entry = *pte;
goto out_mkwrite;
-   } else
+   } else {
goto out_unloc

Re: [PATCH] scsi: libosd: Remove VLA usage

2018-05-13 Thread Boaz Harrosh
On 03/05/18 01:55, Kees Cook wrote:
> On the quest to remove all VLAs from the kernel[1] this rearranges the
> code to avoid a VLA warning under -Wvla (gcc doesn't recognize "const"
> variables as not triggering VLA creation). Additionally cleans up variable
> naming to avoid 80 character column limit.
> 
> [1] 
> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
> 

ACK-BY: Boaz Harrosh 

> Signed-off-by: Kees Cook 
> ---
>  drivers/scsi/osd/osd_initiator.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/scsi/osd/osd_initiator.c 
> b/drivers/scsi/osd/osd_initiator.c
> index e18877177f1b..917a86a2ae8c 100644
> --- a/drivers/scsi/osd/osd_initiator.c
> +++ b/drivers/scsi/osd/osd_initiator.c
> @@ -1842,14 +1842,14 @@ int osd_req_decode_sense_full(struct osd_request *or,
>   case osd_sense_response_integrity_check:
>   {
>   struct osd_sense_response_integrity_check_descriptor
> - *osricd = cur_descriptor;
> - const unsigned len =
> -   sizeof(osricd->integrity_check_value);
> - char key_dump[len*4 + 2]; /* 2nibbles+space+ASCII */
> -
> - hex_dump_to_buffer(osricd->integrity_check_value, len,
> -32, 1, key_dump, sizeof(key_dump), true);
> - OSD_SENSE_PRINT2("response_integrity [%s]\n", key_dump);
> + *d = cur_descriptor;
> + /* 2nibbles+space+ASCII */
> + char dump[sizeof(d->integrity_check_value) * 4 + 2];
> +
> + hex_dump_to_buffer(d->integrity_check_value,
> + sizeof(d->integrity_check_value),
> + 32, 1, dump, sizeof(dump), true);
> + OSD_SENSE_PRINT2("response_integrity [%s]\n", dump);
>   }
>   case osd_sense_attribute_identification:
>   {
> 



Re: [PATCH 02/11 linux-next] exofs: use magic.h

2017-05-23 Thread Boaz Harrosh
On 05/21/2017 06:40 PM, Fabian Frederick wrote:
> Filesystems generally use SUPER_MAGIC values from magic.h
> instead of a local definition.
> 

Is fine by me to move to magic.h but ...

> Signed-off-by: Fabian Frederick 
> ---
>  fs/exofs/common.h  | 6 +-
>  include/uapi/linux/magic.h | 1 +
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/exofs/common.h b/fs/exofs/common.h
> index 7d88ef5..fb988f9 100644
> --- a/fs/exofs/common.h
> +++ b/fs/exofs/common.h
> @@ -37,6 +37,7 @@
>  #define __EXOFS_COM_H__
>  
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -79,11 +80,6 @@ enum {
>  #define EXOFS_BLKSHIFT   12
>  #define EXOFS_BLKSIZE(1UL << EXOFS_BLKSHIFT)
>  
> -/
> - * superblock-related things
> - 
> /

Please do not remove this comment, it acts as a separator and is related to
everything up to the comment that says:
 * inode-related things

And not only to this definition

Thanks
Boaz

> -#define EXOFS_SUPER_MAGIC0x5DF5
> -
>  /*
>   * The file system control block - stored in object EXOFS_SUPER_ID's data.
>   * This is where the in-memory superblock is stored on disk.
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index e3bb43b..1265355 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -26,6 +26,7 @@
>  #define EXT2_SUPER_MAGIC 0xEF53
>  #define EXT3_SUPER_MAGIC 0xEF53
>  #define EXT4_SUPER_MAGIC 0xEF53
> +#define EXOFS_SUPER_MAGIC0x5DF5
>  #define F2FS_SUPER_MAGIC 0xF2F52010
>  #define FUTEXFS_SUPER_MAGIC  0xBAD1DEA
>  #define HOSTFS_SUPER_MAGIC   0x00c0ffee
> 



Re: [PATCH, RFC] MAINTAINERS: update OSD entries

2017-05-03 Thread Boaz Harrosh
On 05/02/2017 02:33 PM, Jeff Layton wrote:
> On Tue, 2017-05-02 at 09:57 +0200, Christoph Hellwig wrote:
>> The open-osd domain doesn't exist anymore, and mails to the list lead
>> to really annoying bounced that repeat every day.
>>
>> Also the primarydata address for Benny bounces, and while I have a new
>> one for him he doesn't seem to be maintaining the OSD code any more.
>>
>> Which beggs the question:  should we really leave the Supported status
>> in MAINTAINERS given that the code is barely maintained?
>>
>> Signed-off-by: Christoph Hellwig 
>> ---
>>  MAINTAINERS | 4 
>>  1 file changed, 4 deletions(-)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 1bb06c5f7716..28dd83a1d9e2 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -9418,10 +9418,6 @@ F:drivers/net/wireless/intersil/orinoco/
>>  
>>  OSD LIBRARY and FILESYSTEM
>>  M:  Boaz Harrosh 
>> -M:  Benny Halevy 
>> -L:  osd-...@open-osd.org
>> -W:  http://open-osd.org
>> -T:  git git://git.open-osd.org/open-osd.git
>>  S:  Maintained
>>  F:  drivers/scsi/osd/
>>  F:  include/scsi/osd_*
> 
> Hah, you beat me to it! I was going to spin up a patch for this today.
> 
> Acked-by: Jeff Layton 

Acked-by: Boaz Harrosh 

> 



Re: [PATCH] ore: fix spelling mistake: "Multples" -> "multiples"

2017-04-23 Thread Boaz Harrosh
On 04/22/2017 03:48 PM, Colin King wrote:
> From: Colin Ian King 
> 
> trivial fix to spelling mistake in ORE_ERR message and make word all
> lower case.
> 
> Signed-off-by: Colin Ian King 

Thanks
ACK-by: Boaz Harrosh 

> ---
>  fs/exofs/ore.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
> index 8bb72807e70d..811522ae45e1 100644
> --- a/fs/exofs/ore.c
> +++ b/fs/exofs/ore.c
> @@ -68,7 +68,7 @@ int ore_verify_layout(unsigned total_comps, struct 
> ore_layout *layout)
>   }
>   if (0 != (layout->stripe_unit & ~PAGE_MASK)) {
>   ORE_ERR("Stripe Unit(0x%llx)"
> -   " must be Multples of PAGE_SIZE(0x%lx)\n",
> +   " must be multiples of PAGE_SIZE(0x%lx)\n",
> _LLU(layout->stripe_unit), PAGE_SIZE);
>   return -EINVAL;
>   }
> 



Re: [PATCH 3/4] nfs: remove the objlayout driver

2017-04-23 Thread Boaz Harrosh
On 04/21/2017 05:00 PM, Trond Myklebust wrote:
> Maintenance is not development. It’s about doing all the followup
> _after_ the feature is declared to be developed. That’s been missing
> for quite some time in the case of the OSD pNFS code, which is why
> I’m not even bothering to consider staging. If you are saying you are
> still maintaining exofs, and want to continue doing so, then great,
> but note that there is a file called BUGS in that directory that has
> been untouched since 2008, and that’s why I thing staging is a good
> idea.
> 

No, the BUGS file is just stale. As you said was not ever updated. All
the bugs (1) in there do no longer exist for a long long time.
I will send a patch to remove the file.

Yes I maintain this fs. It has the complete fixture list, of what was
first intended. I keep running it every major kernel release to make
sure it actually runs, and there are no regressions.

If this FS stands in the way of any new development please anyone
contact me and I will help in the conversion. Of exofs and the
osd-uld.

Thanks
Boaz



Re: [PATCH 3/4] nfs: remove the objlayout driver

2017-04-21 Thread Boaz Harrosh
On 04/20/2017 11:18 PM, Trond Myklebust wrote:
<>
> 
> OK. I'm applying this patch for the 4.12 merge window. 

That is understandable this code was not tested for a long while

> If, as Boaz
> suggests, there is still an interest in exofs, then I suggest we put
> that to the test by moving it into the STAGING area, to see if someone
> will step up to maintain it.
> 

I do not understand what you mean by "someone will maintain it"
What is it that is missing in exofs that you see needs development?

As I said I regularly run and test this FS and update as things advance
there where no new fixtures for a while, but no regressions either.

Please explain what it means "will maintain it"

> Cheers
>   Trond
> 

Thanks
Boaz



Re: linux-next: manual merge of the scsi-mkp tree with the char-misc tree

2017-04-20 Thread Boaz Harrosh
On 04/07/2017 10:50 PM, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 13:29 -0600, Logan Gunthorpe wrote:
>> On 07/04/17 09:49 AM, Bart Van Assche wrote:
>>> Sorry that I had not yet noticed Logan's patch series. Should my two
>>> patches that conflict with Logan's patch series be dropped and reworked
>>> after Logan's patches are upstream?
>>
>> Yeah, Greg took my patchset around a few maintainers relatively quickly.
>> This is the second conflict, so sorry about that. Looks like the easiest
>> thing would be to just base your change off of mine. It doesn't look too
>> difficult. If you can do it before my patch hits upstream, I'd
>> appreciate some testing and/or review as no one from the scsi side
>> responded and that particular patch was a bit more involved than I would
>> have liked.
> 
> Boaz, had you noticed Logan's osd patch? If not, can you have a look?
> 

I did look, I even sent an ACK on one of the early versions.
The merge breakage is more of a build issue because I never
had get_device fail for me in my testing so it is more
academic.

Yes they both look fine BTW
Boaz

> Thanks,
> Bart.
> 



Re: [PATCH 1/4] block: remove the osdblk driver

2017-04-19 Thread Boaz Harrosh
On 04/12/2017 07:01 PM, Christoph Hellwig wrote:
> This was just a proof of concept user for the SCSI OSD library, and
> never had any real users.
> 
> Signed-off-by: Christoph Hellwig 

Yes please remove this driver

ACK-by Boaz Harrosh 

> ---
>  drivers/block/Kconfig  |  16 --
>  drivers/block/Makefile |   1 -
>  drivers/block/osdblk.c | 693 
> -
>  3 files changed, 710 deletions(-)
>  delete mode 100644 drivers/block/osdblk.c
> 
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index a1c2e816128f..58e24c354933 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -312,22 +312,6 @@ config BLK_DEV_SKD
>  
>   Use device /dev/skd$N amd /dev/skd$Np$M.
>  
> -config BLK_DEV_OSD
> - tristate "OSD object-as-blkdev support"
> - depends on SCSI_OSD_ULD
> - ---help---
> -   Saying Y or M here will allow the exporting of a single SCSI
> -   OSD (object-based storage) object as a Linux block device.
> -
> -   For example, if you create a 2G object on an OSD device,
> -   you can then use this module to present that 2G object as
> -   a Linux block device.
> -
> -   To compile this driver as a module, choose M here: the
> -   module will be called osdblk.
> -
> -   If unsure, say N.
> -
>  config BLK_DEV_SX8
>   tristate "Promise SATA SX8 support"
>   depends on PCI
> diff --git a/drivers/block/Makefile b/drivers/block/Makefile
> index b12c772bbeb3..42e8cee1cbc7 100644
> --- a/drivers/block/Makefile
> +++ b/drivers/block/Makefile
> @@ -22,7 +22,6 @@ obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o
>  obj-$(CONFIG_MG_DISK)+= mg_disk.o
>  obj-$(CONFIG_SUNVDC) += sunvdc.o
>  obj-$(CONFIG_BLK_DEV_SKD)+= skd.o
> -obj-$(CONFIG_BLK_DEV_OSD)+= osdblk.o
>  
>  obj-$(CONFIG_BLK_DEV_UMEM)   += umem.o
>  obj-$(CONFIG_BLK_DEV_NBD)+= nbd.o
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> deleted file mode 100644
> index 8127b8201a01..
> --- a/drivers/block/osdblk.c
> +++ /dev/null
> @@ -1,693 +0,0 @@
> -
> -/*
> -   osdblk.c -- Export a single SCSI OSD object as a Linux block device
> -
> -
> -   Copyright 2009 Red Hat, Inc.
> -
> -   This program is free software; you can redistribute it and/or modify
> -   it under the terms of the GNU General Public License as published by
> -   the Free Software Foundation.
> -
> -   This program is distributed in the hope that it will be useful,
> -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> -   GNU General Public License for more details.
> -
> -   You should have received a copy of the GNU General Public License
> -   along with this program; see the file COPYING.  If not, write to
> -   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
> -
> -
> -   Instructions for use
> -   
> -
> -   1) Map a Linux block device to an existing OSD object.
> -
> -  In this example, we will use partition id 1234, object id 5678,
> -  OSD device /dev/osd1.
> -
> -  $ echo "1234 5678 /dev/osd1" > /sys/class/osdblk/add
> -
> -
> -   2) List all active blkdev<->object mappings.
> -
> -  In this example, we have performed step #1 twice, creating two blkdevs,
> -  mapped to two separate OSD objects.
> -
> -  $ cat /sys/class/osdblk/list
> -  0 174 1234 5678 /dev/osd1
> -  1 179 1994 897123 /dev/osd0
> -
> -  The columns, in order, are:
> -  - blkdev unique id
> -  - blkdev assigned major
> -  - OSD object partition id
> -  - OSD object id
> -  - OSD device
> -
> -
> -   3) Remove an active blkdev<->object mapping.
> -
> -  In this example, we remove the mapping with blkdev unique id 1.
> -
> -  $ echo 1 > /sys/class/osdblk/remove
> -
> -
> -   NOTE:  The actual creation and deletion of OSD objects is outside the 
> scope
> -   of this driver.
> -
> - */
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -#define DRV_NAME "osdblk"
> -#define PFX DRV_NAME ": "
> -
> -/* #define _OSDBLK_DEBUG */
> -#ifdef _OSDBLK_DEBUG
> -#define OSDBLK_DEBUG(fmt, a...) \
> - printk(KERN_NOTICE "osdblk @%s:%d: " fmt, __func__, __LINE__, ##a)
> -#else
> -#define OSDBLK_DEBUG(fmt, a...) \
> - do { if (0) printk(fmt, ##a); } while (0)
> -#endif
> -
> -MODULE_AUTHOR("

Re: RFC: drop the T10 OSD code and its users

2017-04-18 Thread Boaz Harrosh
On 04/12/2017 07:01 PM, Christoph Hellwig wrote:

Hi Sir Christoph

> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and the
> other two users (osdblk and exofs) were simple example of it's usage.
> 

I understand why osdblk might be a pain, and was broken from day one, and
should by all means go away ASAP.

But exofs should not be bothering anyone, and as far as I know does
not use any special API's except the osd_uld code of course.

> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, so I think it's finally time to drop it.
> 

Please tell me what are those changes you are talking about? I might be
able to help in conversion. I guess you mean osd_uld and the Upper SCSI API.
Just point me at a tree where osd_uld is broken, and perhaps with a little
guidance from you I can do a satisfactory conversion.

Is true that no new code went in for a long while, but I still from time
to time run a setup and test that the all stack, like iscsi-bidi and so on still
works.

That said, yes only a stand alone exofs was tested for a long time, a full
pnfs setup is missing any supporting server. So Yes I admit that pnfs-obj is
getting very rotten. And is most probably broken, on the pnfs side of things.
[Which I admit makes my little plea kind of mute ;-) ]

Every once in a while I get emails from Students basing all kind of interesting
experiments on top of the exofs and object base storage. So for the sake of 
academics
and for the sake of a true bidi-stack testing, might we want to evaluate what 
is the
up coming cost, and what is a minimum set we are willing to keep?

Please advise?

thanks
Boaz

> These patches are against Jens' block for-next tree as that already
> has various modifications of the SCSI code.
> 



Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Boaz Harrosh
On 03/30/2017 09:35 PM, Jeff Layton wrote:
<>
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.
> 

Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure
what is a parallel for that in xfs.
s_mtime - time-of-last mount 
s_wtime - time-of-last mount, umount, freez, unfreez, remount, ...
Of course you'll need a per FS vector to access that.

But this will need some math foo to get the bits compacted correctly
just a thought.

Thanks
Boaz



Re: RFC: reject unknown open flags

2017-03-30 Thread Boaz Harrosh
On 03/30/2017 09:45 PM, Linus Torvalds wrote:
> On Thu, Mar 30, 2017 at 11:26 AM, Christoph Hellwig  wrote:
>>
>> That would be nice, but still won't work as we blindly copy f_flags
>> into F_GETFL, not even masking our internal FMODE_ bits.
> 
> Ok, *that* is just silly of us, and we could try to just fix, and even 
> backport.
> 
> There's no possible valid use I could see where that should break
> (famous last words - user code does some damn odd things at times).
> 
> Of course, that won't fix old kernels that are out there, but then
> neither would your original patch...
> 
> Side note: I think you *can* detect the O_ATOMIC support by using
> F_SETFL, because F_SETFL only allows you to change flags that we
> recognize. So somebody who really wants to *guarantee* that O_ATOMIC
> is there and honored even with old kernels could presumable do
> something like
> 
>fd = open(..); // *no* O_ATOMIC
>fcnt(fd, F_SETFL, O_ATOMIC);
>if (fcnt(fd, F_GETFL, NULL) & O_ATOMIC)
> // Yay! We actually got it
>else
> // I guess we need to fall back on old behavior
> 
> although I agree that that is ridiculously inconvenient and not a
> great thing, and it's worth trying to aim for some better model.
> 

Perhaps in that case it is time for an F_GETFL2 an F_GET_REAL_FL
that gives you the nice simple user code Linus wanted for new applications.
and solves forward and backwords for applications and Kernels?

Just my $0.017
Boaz

> Linus
> 



Re: [PATCH] scsi: osd_uld: remove an unneeded NULL check

2017-03-23 Thread Boaz Harrosh
On 03/23/2017 12:41 PM, Dan Carpenter wrote:
> We don't call the remove() function unless probe() succeeds so "oud"
> can't be NULL here.  Plus, if it were NULL, we dereference it on the
> next line so it would crash anyway.
> 
> Signed-off-by: Dan Carpenter 
> 

Thanks sure!
ACK-by Boaz Harrosh 

> diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c
> index 4101c3178411..8b9941a5687a 100644
> --- a/drivers/scsi/osd/osd_uld.c
> +++ b/drivers/scsi/osd/osd_uld.c
> @@ -507,10 +507,9 @@ static int osd_remove(struct device *dev)
>   struct scsi_device *scsi_device = to_scsi_device(dev);
>   struct osd_uld_device *oud = dev_get_drvdata(dev);
>  
> - if (!oud || (oud->od.scsi_device != scsi_device)) {
> - OSD_ERR("Half cooked osd-device %p,%p || %p!=%p",
> - dev, oud, oud ? oud->od.scsi_device : NULL,
> - scsi_device);
> + if (oud->od.scsi_device != scsi_device) {
> + OSD_ERR("Half cooked osd-device %p, || %p!=%p",
> + dev, oud->od.scsi_device, scsi_device);
>   }
>  
>   cdev_device_del(&oud->cdev, &oud->class_dev);
> 



Re: [Lsf-pc] [LSF/MM TOPIC] do we really need PG_error at all?

2017-02-28 Thread Boaz Harrosh
On 02/28/2017 03:11 AM, Jeff Layton wrote:
<>
> 
> I'll probably have questions about the read side as well, but for now it
> looks like it's mostly used in an ad-hoc way to communicate errors
> across subsystems (block to fs layer, for instance).

If memory does not fail me it used to be checked long time ago in the
read-ahead case. On the buffered read case, the first page is read synchronous
and any error is returned to the caller, but then a read-ahead chunk is
read async all the while the original thread returned to the application.
So any errors are only recorded on the page-bit, since otherwise the uptodate
is off and the IO will be retransmitted. Then the move to read_iter changed
all that I think.
But again this is like 5-6 years ago, and maybe I didn't even understand
very well.

> --
> Jeff Layton 
> 

I would like a Documentation of all this as well please. Where are the
tests for this?

Thanks
Boaz



Re: [PATCH v1 45/54] exofs: convert to bio_for_each_segment_all_sp()

2017-01-03 Thread Boaz Harrosh
On 12/27/2016 06:04 PM, Ming Lei wrote:
> Signed-off-by: Ming Lei 

Cool
ACK-by: Boaz Harrosh 

> ---
>  fs/exofs/ore.c  | 3 ++-
>  fs/exofs/ore_raid.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
> index 8bb72807e70d..38a7d8bfdd4c 100644
> --- a/fs/exofs/ore.c
> +++ b/fs/exofs/ore.c
> @@ -406,8 +406,9 @@ static void _clear_bio(struct bio *bio)
>  {
>   struct bio_vec *bv;
>   unsigned i;
> + struct bvec_iter_all bia;
>  
> - bio_for_each_segment_all(bv, bio, i) {
> + bio_for_each_segment_all_sp(bv, bio, i, bia) {
>   unsigned this_count = bv->bv_len;
>  
>   if (likely(PAGE_SIZE == this_count))
> diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
> index 27cbdb697649..37c0a9aa2ec2 100644
> --- a/fs/exofs/ore_raid.c
> +++ b/fs/exofs/ore_raid.c
> @@ -429,6 +429,7 @@ static void _mark_read4write_pages_uptodate(struct 
> ore_io_state *ios, int ret)
>  {
>   struct bio_vec *bv;
>   unsigned i, d;
> + struct bvec_iter_all bia;
>  
>   /* loop on all devices all pages */
>   for (d = 0; d < ios->numdevs; d++) {
> @@ -437,7 +438,7 @@ static void _mark_read4write_pages_uptodate(struct 
> ore_io_state *ios, int ret)
>   if (!bio)
>   continue;
>  
> - bio_for_each_segment_all(bv, bio, i) {
> + bio_for_each_segment_all_sp(bv, bio, i, bia) {
>   struct page *page = bv->bv_page;
>  
>   SetPageUptodate(page);
> 



Re: [PATCH v2 1/3] introduce memcpy_nocache()

2016-11-01 Thread Boaz Harrosh
On 10/28/2016 04:54 AM, Boylston, Brian wrote:
> Boaz Harrosh wrote on 2016-10-26:
>> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>>> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
>>> if possible.  Without arch-specific support, this defaults to just
>>> memcpy().  For now, include arch-specific support for x86.
>>>
>>> Cc: Ross Zwisler 
>>> Cc: Thomas Gleixner 
>>> Cc: Ingo Molnar 
>>> Cc: "H. Peter Anvin" 
>>> Cc: 
>>> Cc: Al Viro 
>>> Cc: Dan Williams 
>>> Signed-off-by: Brian Boylston 
>>> Reviewed-by: Toshi Kani 
>>> Reported-by: Oliver Moreno 
>>> ---
>>>  arch/x86/include/asm/string_32.h |  3 +++
>>>  arch/x86/include/asm/string_64.h |  3 +++
>>>  arch/x86/lib/misc.c  | 12 
>>>  include/linux/string.h   | 15 +++
>>>  4 files changed, 33 insertions(+)
>>> diff --git a/arch/x86/include/asm/string_32.h 
>>> b/arch/x86/include/asm/string_32.h
>>> index 3d3e835..64f80c0 100644
>>> --- a/arch/x86/include/asm/string_32.h
>>> +++ b/arch/x86/include/asm/string_32.h
>>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void 
>>> *from, size_t len)
>>>
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMMOVE
>>>  void *memmove(void *dest, const void *src, size_t n);
>>> diff --git a/arch/x86/include/asm/string_64.h 
>>> b/arch/x86/include/asm/string_64.h
>>> index 90dbbd9..a8fdd55 100644
>>> --- a/arch/x86/include/asm/string_64.h
>>> +++ b/arch/x86/include/asm/string_64.h
>>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t 
>>> len);
>>>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMSET
>>>  void *memset(void *s, int c, size_t n);
>>>  void *__memset(void *s, int c, size_t n);
>>> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
>>> index 76b373a..c993ab3 100644
>>> --- a/arch/x86/lib/misc.c
>>> +++ b/arch/x86/lib/misc.c
>>> @@ -1,3 +1,6 @@
>>> +#include 
>>> +#include 
>>> +
>>>  /*
>>>   * Count the digits of @val including a possible sign.
>>>   *
>>> @@ -19,3 +22,12 @@ int num_digits(int val)
>>> }
>>> return d;
>>>  }
>>> +
>>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>>> +{
>>> +   __copy_from_user_inatomic_nocache(dest, src, count);
>>> +   return dest;
>>> +}
>>> +EXPORT_SYMBOL(memcpy_nocache);
>>> +#endif
>>> diff --git a/include/linux/string.h b/include/linux/string.h
>>> index 26b6f6a..7f40c41 100644
>>> --- a/include/linux/string.h
>>> +++ b/include/linux/string.h
>>> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>>>  #ifndef __HAVE_ARCH_MEMCPY
>>>  extern void * memcpy(void *,const void *,__kernel_size_t);
>>>  #endif
>>> +
>>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +/**
>>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>>> + * processor cache if possible
>>> + * @dest: Where to copy to
>>> + * @src: Where to copy from
>>> + * @count: The size of the area.
>>> + */
>>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t 
>>> count)
>>> +{
>>> +   return memcpy(dest, src, count);
>>> +}
>>
>> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch 
>> switches.
>>
>> Feels bad to add yet just another arch switch over __copy_user_nocache
>>
>> Just feels like too many things that do the same thing. Sigh
> 
> I agree that this looks like a nicer path.
> 
> I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
> but lib/iov_iter.c doesn't currently #include linux/pmem.h.  Would it be
> acceptable to add it?  Also, I wasn't sure if memcpy_to_pmem() would always
> mean exactly "memcpy nocache".
> 

I think this is the way to go. In my opinion there is no reason why not to

Re: [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()

2016-10-26 Thread Boaz Harrosh
On 10/26/2016 06:50 PM, Brian Boylston wrote:
> copy_from_iter_nocache() now uses nocache copies for all types of iovecs
> on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.
> 
> Cc: Ross Zwisler 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: 
> Cc: Al Viro 
> Cc: Dan Williams 
> Signed-off-by: Brian Boylston 
> Reviewed-by: Toshi Kani 
> Reported-by: Oliver Moreno 
> ---
>  arch/x86/include/asm/pmem.h | 19 +--
>  1 file changed, 1 insertion(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index 643eba4..2fbf4ae 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
> size)
>   clwb(p);
>  }
>  
> -/*
> - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
> - * iterators, so for other types (bvec & kvec) we must do a cache write-back.
> - */
> -static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
> -{
> - return iter_is_iovec(i) == false;
> -}
> -
>  /**
>   * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
>   * @addr:PMEM destination address
> @@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
>  static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
>   struct iov_iter *i)
>  {
> - size_t len;
> -
> - /* TODO: skip the write-back by always using non-temporal stores */
> - len = copy_from_iter_nocache(addr, bytes, i);
> -
> - if (__iter_needs_pmem_wb(i))
> - arch_wb_cache_pmem(addr, bytes);
> -
> - return len;
> + return copy_from_iter_nocache(addr, bytes, i);

I wish you would remove all this completely. Don't see the point for it anymore 

Thanks a lot for doing this. I have this patch for ages in my trees and was too
lasy to fight them through. Yes it is a big boost for many workloads.

Lots of gratitude
Boaz

>  }
>  
>  /**
> 



Re: [PATCH v2 1/3] introduce memcpy_nocache()

2016-10-26 Thread Boaz Harrosh
On 10/26/2016 06:50 PM, Brian Boylston wrote:
> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
> if possible.  Without arch-specific support, this defaults to just
> memcpy().  For now, include arch-specific support for x86.
> 
> Cc: Ross Zwisler 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: 
> Cc: Al Viro 
> Cc: Dan Williams 
> Signed-off-by: Brian Boylston 
> Reviewed-by: Toshi Kani 
> Reported-by: Oliver Moreno 
> ---
>  arch/x86/include/asm/string_32.h |  3 +++
>  arch/x86/include/asm/string_64.h |  3 +++
>  arch/x86/lib/misc.c  | 12 
>  include/linux/string.h   | 15 +++
>  4 files changed, 33 insertions(+)
> 
> diff --git a/arch/x86/include/asm/string_32.h 
> b/arch/x86/include/asm/string_32.h
> index 3d3e835..64f80c0 100644
> --- a/arch/x86/include/asm/string_32.h
> +++ b/arch/x86/include/asm/string_32.h
> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void 
> *from, size_t len)
>  
>  #endif
>  
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
>  #define __HAVE_ARCH_MEMMOVE
>  void *memmove(void *dest, const void *src, size_t n);
>  
> diff --git a/arch/x86/include/asm/string_64.h 
> b/arch/x86/include/asm/string_64.h
> index 90dbbd9..a8fdd55 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t 
> len);
>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>  #endif
>  
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
>  #define __HAVE_ARCH_MEMSET
>  void *memset(void *s, int c, size_t n);
>  void *__memset(void *s, int c, size_t n);
> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
> index 76b373a..c993ab3 100644
> --- a/arch/x86/lib/misc.c
> +++ b/arch/x86/lib/misc.c
> @@ -1,3 +1,6 @@
> +#include 
> +#include 
> +
>  /*
>   * Count the digits of @val including a possible sign.
>   *
> @@ -19,3 +22,12 @@ int num_digits(int val)
>   }
>   return d;
>  }
> +
> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
> +void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + __copy_from_user_inatomic_nocache(dest, src, count);
> + return dest;
> +}
> +EXPORT_SYMBOL(memcpy_nocache);
> +#endif
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 26b6f6a..7f40c41 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>  #ifndef __HAVE_ARCH_MEMCPY
>  extern void * memcpy(void *,const void *,__kernel_size_t);
>  #endif
> +
> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
> +/**
> + * memcpy_nocache - Copy one area of memory to another, avoiding the
> + * processor cache if possible
> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.
> + */
> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + return memcpy(dest, src, count);
> +}

What about memcpy_to_pmem() in linux/pmem.h it already has all the arch 
switches.

Feels bad to add yet just another arch switch over __copy_user_nocache

Just feels like too many things that do the same thing. Sigh

Boaz

> +#endif
> +
>  #ifndef __HAVE_ARCH_MEMMOVE
>  extern void * memmove(void *,const void *,__kernel_size_t);
>  #endif
> 



Re: [PATCH 4/7] fs: make remaining filesystems use .rename2

2016-08-23 Thread Boaz Harrosh
On 08/23/2016 07:24 PM, Boaz Harrosh wrote:
> On 08/23/2016 05:05 PM, Miklos Szeredi wrote:
>> This is trivial to do:
>>
>>  - add flags argument to foo_rename()
>>  - check if flags is zero
>>  - assign foo_rename() to .rename2 instead of .rename
>>
>> This doesn't mean it's impossible to support RENAME_NOREPLACE for these
>> filesystems, but it is not trivial, like for local filesystems.
>> RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
>> for a file to be created on one host while it is overwritten by rename on
>> another host).
>>
>> Filesystems converted:
>>
>> 9p, afs, ceph, coda, ecryptfs, exofs, kernfs, lustre, ncpfs, nfs, ocfs2,
>> orangefs.
>>
>> After this, we can get rid of the duplicate interfaces for rename.
>>
>> Signed-off-by: Miklos Szeredi 
>> Cc: Eric Van Hensbergen 
>> Cc: David Howells 
>> Cc: Ilya Dryomov 
>> Cc: Jan Harkes 
>> Cc: Tyler Hicks 
>> Cc: Boaz Harrosh 
> 
> Hi exofs is not a distributed file system in the nfs-client 
> sense. All meta-data operations happen on the single exofs mount.
> The distribution of an exofs cluster is done by an NFSD-like daemon
> that supports pNFS, and an std pNFS-client.
> So the code you see below is just the same as an ext4 FS with
> a raid of iscsi devices below it. (Even if then later this FS is
> exported by an NFSD server)
> 
> That said it is fine as is don't sweat over this unused FS so:
> ACK-by: Boaz Harrosh 
> 
> <>
> 
>> diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
>> index 622a686bb08b..897280163f3c 100644
>> --- a/fs/exofs/namei.c
>> +++ b/fs/exofs/namei.c
>> @@ -227,7 +227,8 @@ static int exofs_rmdir(struct inode *dir, struct dentry 
>> *dentry)
>>  }
>>  
>>  static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
>> -struct inode *new_dir, struct dentry *new_dentry)
>> +struct inode *new_dir, struct dentry *new_dentry,
>> +unsigned int flags)
>>  {
>>  struct inode *old_inode = d_inode(old_dentry);
>>  struct inode *new_inode = d_inode(new_dentry);
>> @@ -237,6 +238,9 @@ static int exofs_rename(struct inode *old_dir, struct 
>> dentry *old_dentry,
>>  struct exofs_dir_entry *old_de;
>>  int err = -ENOENT;
>>  
>> +if (flags)
>> +return -EINVAL;
>> +

+   if (flags & ~RENAME_NOREPLACE)
+   return -EINVAL;
+

And move to the other patch if you feel like it

>>  old_de = exofs_find_entry(old_dir, old_dentry, &old_page);
>>  if (!old_de)
>>  goto out;
>> @@ -310,7 +314,7 @@ const struct inode_operations exofs_dir_inode_operations 
>> = {
>>  .mkdir  = exofs_mkdir,
>>  .rmdir  = exofs_rmdir,
>>  .mknod  = exofs_mknod,
>> -.rename = exofs_rename,
>> +.rename2= exofs_rename,
>>  .setattr= exofs_setattr,
>>  };
>>  
> <>
> 



Re: [PATCH 4/7] fs: make remaining filesystems use .rename2

2016-08-23 Thread Boaz Harrosh
On 08/23/2016 05:05 PM, Miklos Szeredi wrote:
> This is trivial to do:
> 
>  - add flags argument to foo_rename()
>  - check if flags is zero
>  - assign foo_rename() to .rename2 instead of .rename
> 
> This doesn't mean it's impossible to support RENAME_NOREPLACE for these
> filesystems, but it is not trivial, like for local filesystems.
> RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
> for a file to be created on one host while it is overwritten by rename on
> another host).
> 
> Filesystems converted:
> 
> 9p, afs, ceph, coda, ecryptfs, exofs, kernfs, lustre, ncpfs, nfs, ocfs2,
> orangefs.
> 
> After this, we can get rid of the duplicate interfaces for rename.
> 
> Signed-off-by: Miklos Szeredi 
> Cc: Eric Van Hensbergen 
> Cc: David Howells 
> Cc: Ilya Dryomov 
> Cc: Jan Harkes 
> Cc: Tyler Hicks 
> Cc: Boaz Harrosh 

Hi exofs is not a distributed file system in the nfs-client 
sense. All meta-data operations happen on the single exofs mount.
The distribution of an exofs cluster is done by an NFSD-like daemon
that supports pNFS, and an std pNFS-client.
So the code you see below is just the same as an ext4 FS with
a raid of iscsi devices below it. (Even if then later this FS is
exported by an NFSD server)

That said it is fine as is don't sweat over this unused FS so:
ACK-by: Boaz Harrosh 

<>

> diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
> index 622a686bb08b..897280163f3c 100644
> --- a/fs/exofs/namei.c
> +++ b/fs/exofs/namei.c
> @@ -227,7 +227,8 @@ static int exofs_rmdir(struct inode *dir, struct dentry 
> *dentry)
>  }
>  
>  static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
> - struct inode *new_dir, struct dentry *new_dentry)
> + struct inode *new_dir, struct dentry *new_dentry,
> + unsigned int flags)
>  {
>   struct inode *old_inode = d_inode(old_dentry);
>   struct inode *new_inode = d_inode(new_dentry);
> @@ -237,6 +238,9 @@ static int exofs_rename(struct inode *old_dir, struct 
> dentry *old_dentry,
>   struct exofs_dir_entry *old_de;
>   int err = -ENOENT;
>  
> + if (flags)
> + return -EINVAL;
> +
>   old_de = exofs_find_entry(old_dir, old_dentry, &old_page);
>   if (!old_de)
>   goto out;
> @@ -310,7 +314,7 @@ const struct inode_operations exofs_dir_inode_operations 
> = {
>   .mkdir  = exofs_mkdir,
>   .rmdir  = exofs_rmdir,
>   .mknod  = exofs_mknod,
> - .rename = exofs_rename,
> + .rename2= exofs_rename,
>   .setattr= exofs_setattr,
>  };
>  
<>



Re: DAX can not work on virtual nvdimm device

2016-08-21 Thread Boaz Harrosh
On 08/19/2016 09:30 PM, Ross Zwisler wrote:
> On Fri, Aug 19, 2016 at 07:59:29AM -0700, Dan Williams wrote:
>> On Fri, Aug 19, 2016 at 4:19 AM, Xiao Guangrong
>>  wrote:
>>>
>>> Hi Dan,
>>>
>>> Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
>>> more detailed info please refer to:
>>>https://bugzilla.redhat.com/show_bug.cgi?id=1365721
>>>
>>> The reason for this bug is that the memory region created by mmap()
>>> on the dax-based file was gone so that the region can not be found
>>> in /proc/self/smaps during the runtime.
>>>
>>> This is a simple way to trigger this issue:
>>>mount -o dax /dev/pmem0 /mnt/pmem/
>>>vim /mnt/pmem/xxx
>>> then 'vim' is crashed due to segment fault.
>>>
>>> This bug can be reproduced on your tree, the top commit is
>>> 10d7902fa0e82b (dax: unmap/truncate on device shutdown), the kernel
>>> configure file is attached.
>>>
>>> Your thought or comment is highly appreciated.
>>
>> I'm going to be offline until Tuesday, but I will investigate when I'm
>> back.  In the meantime if Ross or Vishal had an opportunity to take a
>> look I wouldn't say "no" :).
> 
> I haven't been able to reproduce this vim segfault.  I'm using QEMU v2.6.0,
> and the kernel commit you mentioned, and your kernel config.
> 
> Here's my QEMU command line:
> 
> sudo ~/qemu/bin/qemu-system-x86_64 /var/lib/libvirt/images/alara.qcow2 \
> -machine pc,nvdimm -m 8G,maxmem=100G,slots=100  -object \
> memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=8G -device \
> nvdimm,memdev=mem1,id=nv1 -smp 6 -machine pc,accel=kvm 
> 
> With this I'm able to mkfs the guest's /dev/pmem0, mount it with -o dax, and
> write a file with vim.
> 
> Can you reproduce your results with a pmem device created via a memmap kernel
> command line parameter in the guest?  You'll need to update your kernel
> config to enable CONFIG_X86_PMEM_LEGACY and CONFIG_X86_PMEM_LEGACY_DEVICE.

Last time I had crashes like this was when the memmap= or the KVM command line
memory size and settings did not match the KVM defined memory settings. Nothing
fails but the pmemX device is configured over a none existent memory. All writes
just succeed but all reads return ZEROs.

Please check that the sizes of your pmem matches the memory size of the VM as
defined in virsh edit

Boaz



Re: [PATCH] x86/mm: only allow memmap=XX!YY over existing RAM

2016-06-23 Thread Boaz Harrosh
On 06/20/2016 10:33 AM, Yigal Korman wrote:
> Before this patch, passing a range that is beyond the physical memory
> range will succeed, the user will see a /dev/pmem0 and will be able to
> access it. Reads will always return 0 and writes will be silently
> ignored.
> 
> I've gotten more than one bug report about mkfs.{xfs,ext4} or nvml
> failing that were eventually tracked down to be wrong values passed to
> memmap.
> 
> This patch prevents the above issue by instead of adding a new memory
> range, only update a RAM memory range with the PRAM type. This way,
> passing the wrong memmap will either not give you a pmem at all or give
> you a smaller one that actually has RAM behind it.
> 
> And if someone still needs to fake a pmem that doesn't have RAM behind
> it, they can simply do memmap=XX@YY,XX!YY.
> 

We are running with this patch for a while in the lab and it does
solve the problem above with no maleffects so:

Tested-by: Boaz Harrosh 

> Signed-off-by: Yigal Korman 
> ---
>  arch/x86/kernel/e820.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 621b501..4bd4207 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -878,7 +878,7 @@ static int __init parse_memmap_one(char *p)
>   e820_add_region(start_at, mem_size, E820_RESERVED);
>   } else if (*p == '!') {
>   start_at = memparse(p+1, &p);
> - e820_add_region(start_at, mem_size, E820_PRAM);
> + e820_update_range(start_at, mem_size, E820_RAM, E820_PRAM);
>   } else
>   e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
>  
> 



Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-02 Thread Boaz Harrosh
On 05/02/2016 09:48 PM, Dan Williams wrote:
<>
>> And then it keeps broken the aligned buffered writes, which are still
>> broken after this set.
> 
> ...identical to the current situation with a traditional disk.
> 

Not true!! please see what I wrote "aligned buffered writes"
If there are no reads involved then there are no errors returned
to application.

>> I have by now read the v2 patches. And I think you guys did not yet try
>> the proper fix for dax_do_io. I think you need to go deeper into the loops
>> and selectively call bdev_* when error on a specific page copy. No need to
>> go through direct_IO path at all.
> 
> We still reach a point where the minimum granularity of
> bdev_direct_access() is larger than a sector, so you end up still
> needing to have the application understand how to send a properly
> aligned I/O.  The semantics of how to send a properly aligned
> direct-I/O are already well understood, so we simply reuse that path.
> 

You are making a mountain out of a mouse. The simple copy of a file
from start (offset ZERO) to end-of-file which is the most common usage
on earth is perfectly aligned and needs not any O_DIRECT and is what is used
everywhere.

>> Do you need that I send you a patch to demonstrate what I mean?
> 
> I remain skeptical of what you are proposing, but yes, a patch has a
> better chance to move the discussion forward.
> 

Sigh! OK
Boaz



Re: [PATCH v2 0/3] Add alignment check for DAX mount

2016-05-02 Thread Boaz Harrosh
On 05/02/2016 09:42 PM, Toshi Kani wrote:
> When a partition is not aligned by 4KB, mount -o dax succeeds,
> but any read/write access to the filesystem fails, except for
> metadata update.
> 
> Add alignment check to ext4, ext2, and xfs.
> 
> v2:
>  - Use a helper function via ->direct_access for the check.
>(Christoph Hellwig)
>  - Call bdev_direct_access() with sector 0 for the check.
>(Boaz Harrosh)
> 
> ---
> Toshi Kani (3):
>  1/3 ext4: Add alignment check for DAX mount
>  2/3 ext2: Add alignment check for DAX mount
>  3/3 xfs: Add alignment check for DAX mount
> 

All patches look very good to me, and keep the
internals internal. Thanks Toshi

Review-by: Boaz Harrosh 

> ---
>  fs/ext2/super.c| 21 +++--
>  fs/ext4/super.c| 20 ++--
>  fs/xfs/xfs_super.c | 23 +++
>  3 files changed, 56 insertions(+), 8 deletions(-)
> 



Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-02 Thread Boaz Harrosh
On 05/02/2016 09:10 PM, Dan Williams wrote:
<>
> 
> The semantic I am talking about preserving is:
> 
> buffered / unaligned write of a bad sector => -EIO on reading into the
> page cache
> 

What about aligned buffered write? like write 0-to-eof
This still broken? (and is what restore apps do)

> ...and that the only guaranteed way to clear an error (assuming the
> block device supports it) is an O_DIRECT write.
> 

Sure fixing dax_do_io will guaranty that.

<>
> I still think we're talking past each other on this point.  

Yes we are!

> This patch
> set is not overloading error semantics, it's fixing the error handling
> problem that was introduced in this commit:
> 
>d475c6346a38 dax,ext2: replace XIP read and write with DAX I/O
> 
> ...where we started overloading O_DIRECT and dax_do_io() semantics.
> 

But above does not fix them does it? it just completely NULLs DAX for
O_DIRECT which is a great pity, why did we do all this work in the first
place.

And then it keeps broken the aligned buffered writes, which are still
broken after this set.

I have by now read the v2 patches. And I think you guys did not yet try
the proper fix for dax_do_io. I think you need to go deeper into the loops
and selectively call bdev_* when error on a specific page copy. No need to
go through direct_IO path at all.
Do you need that I send you a patch to demonstrate what I mean?

But yes I feel too that "we're talking past each other". I did want
to come to LSF and talk to you, but was not invited. Should I call you?

Thanks
Boaz



Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-02 Thread Boaz Harrosh
On 05/02/2016 07:49 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh  wrote:
>> On 05/02/2016 07:01 PM, Dan Williams wrote:
>>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh  wrote:
>>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>>> handle media errors, and thus cannot provide a recovery path that can
>>>>> send a write through the driver to clear errors.
>>>>>
>>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>>> direct_IO path instead of DAX.
>>>>>
>>>>
>>>> Really? What are your thinking here?
>>>>
>>>> What about all the current users of O_DIRECT, you have just made them
>>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>>> direct_IO path will queue an IO request and all.
>>>> (And if it is not so slow then why do we need dax_do_io at all? 
>>>> [Rhetorical])
>>>>
>>>> I hate it that you overload the semantics of a known and expected
>>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>>> and unrelated overload of the semantics of O_DIRECT.
>>>
>>> I think it is the opposite situation, it us undoing the premature
>>> overloading of O_DIRECT that went in without performance numbers.
>>
>> We have tons of measurements. Is not hard to imagine the results though.
>> Specially the 1000 threads case
>>
>>> This implementation clarifies that dax_do_io() handles the lack of a
>>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>>> by sending an I/O to the driver.
>>
>>> It has the benefit of matching the
>>> error semantics of a typical block device where a buffered write could
>>> hit an error filling the page cache, but an O_DIRECT write potentially
>>> triggers the drive to remap the block.
>>>
>>
>> I fail to see how in writes the device error semantics regarding remapping of
>> blocks is any different between buffered and direct IO. As far as the block
>> device it is the same exact code path. All The big difference is higher in 
>> the
>> VFS.
>>
>> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
>> 1% error path? and piggybacking on poor O_DIRECT.
>>
>> Again there are tons of O_DIRECT apps out there, why are you forcing them to
>> change if they want true pmem performance?
> 
> This isn't forcing them to change.  This is the path of least surprise
> as error semantics are identical to a typical block device.  Yes, an
> application can go faster by switching to the "buffered" / dax_do_io()
> path it can go even faster to switch to mmap() I/O and use DAX
> directly.  If we can later optimize the O_DIRECT path to bring it's
> performance more in line with dax_do_io(), great, but the
> implementation should be correct first and optimized later.
> 

Why does it need to be either or. Why not both?
And also I disagree if you are correct and dax_do_io is bad and needs fixing
than you have broken applications. Because in current model:

read => -EIO, write-bufferd, sync()
gives you the same error semantics as: read => -EIO, write-direct-io

In fact this is what the delete, restore from backup model does today.
Who said it uses / must direct IO. Actually I think it does not.

Two things I can think of which are better:
[1]
Why not go deeper into the dax io loops, and for any WRITE
failed page call bdev_rw_page() to let the pmem.c clear / relocate
the error page.

So reads return -EIO - is what you wanted no?
writes get a memory error and retry with bdev_rw_page() to let the bdev
relocate / clear the error - is what you wanted no?

In the partial page WRITE case on bad sectors. we can carefully 
read-modify-write
sector-by-sector and zero-out the bad-sectors that could not be read, what else?
(Or enhance the bdev_rw_page() API)

[2]
Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
hate that you overload error semantics with O_DIRECT which does not exist today
see above

Thanks
Boaz



Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-02 Thread Boaz Harrosh
On 05/02/2016 07:01 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh  wrote:
>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>> handle media errors, and thus cannot provide a recovery path that can
>>> send a write through the driver to clear errors.
>>>
>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>> direct_IO path instead of DAX.
>>>
>>
>> Really? What are your thinking here?
>>
>> What about all the current users of O_DIRECT, you have just made them
>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>> direct_IO path will queue an IO request and all.
>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>
>> I hate it that you overload the semantics of a known and expected
>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>> and unrelated overload of the semantics of O_DIRECT.
> 
> I think it is the opposite situation, it us undoing the premature
> overloading of O_DIRECT that went in without performance numbers.

We have tons of measurements. Is not hard to imagine the results though.
Specially the 1000 threads case

> This implementation clarifies that dax_do_io() handles the lack of a
> page cache for buffered I/O and O_DIRECT behaves as it nominally would
> by sending an I/O to the driver.  

> It has the benefit of matching the
> error semantics of a typical block device where a buffered write could
> hit an error filling the page cache, but an O_DIRECT write potentially
> triggers the drive to remap the block.
> 

I fail to see how in writes the device error semantics regarding remapping of
blocks is any different between buffered and direct IO. As far as the block
device it is the same exact code path. All The big difference is higher in the
VFS.

And ... So you are willing to sacrifice the 99% hotpath for the sake of the
1% error path? and piggybacking on poor O_DIRECT.

Again there are tons of O_DIRECT apps out there, why are you forcing them to
change if they want true pmem performance?

I still believe dax_do_io() can be made more resilient to errors, and clear
errors on writes. Me going digging in old patches ...

Cheers
Boaz



Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-02 Thread Boaz Harrosh
On 05/02/2016 06:51 PM, Vishal Verma wrote:
> On Mon, 2016-05-02 at 18:41 +0300, Boaz Harrosh wrote:
>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>
>>> All IO in a dax filesystem used to go through dax_do_io, which
>>> cannot
>>> handle media errors, and thus cannot provide a recovery path that
>>> can
>>> send a write through the driver to clear errors.
>>>
>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the
>>> IO
>>> path for DAX filesystems, use the same direct_IO path for both DAX
>>> and
>>> direct_io iocbs, but use the flags to identify when we are in
>>> O_DIRECT
>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the
>>> conventional
>>> direct_IO path instead of DAX.
>>>
>> Really? What are your thinking here?
>>
>> What about all the current users of O_DIRECT, you have just made them
>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>> direct_IO path will queue an IO request and all.
>> (And if it is not so slow then why do we need dax_do_io at all?
>> [Rhetorical])
>>
>> I hate it that you overload the semantics of a known and expected
>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>> and unrelated overload of the semantics of O_DIRECT.
> 
> We overloaded O_DIRECT a long time ago when we made DAX piggyback on
> the same path:
> 
> static inline bool io_is_direct(struct file *filp)
> {
>   return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> }
> 

No as far as the user is concerned we have not. The O_DIRECT user
is still getting all the semantics he wants, .i.e no syncs no
memory cache usage, no copies ...

Only with DAX the buffered IO is the same since with pmem it is faster.
Then why not? The basic contract with the user did not break.

The above was just an implementation detail to easily navigate
through the Linux vfs IO stack and make the least amount of changes
in every FS that wanted to support DAX.(And since dax_do_io is much
more like direct_IO then like page-cache IO)

> Yes O_DIRECT on a DAX mounted file system will now be slower, but -
> 
>>
>>>
>>> This allows us a recovery path in the form of opening the file with
>>> O_DIRECT and writing to it with the usual O_DIRECT semantics
>>> (sector
>>> alignment restrictions).
>>>
>> I understand that you want a sector aligned IO, right? for the
>> clear of errors. But I hate it that you forced all O_DIRECT IO
>> to be slow for this.
>> Can you not make dax_do_io handle media errors? At least for the
>> parts of the IO that are aligned.
>> (And your recovery path application above can use only aligned
>>  IO to make sure)
>>
>> Please look for another solution. Even a special
>> IOCTL_DAX_CLEAR_ERROR
> 
>  - see all the versions of this series prior to this one, where we try
> to do a fallback...
> 

And?

So now all O_DIRECT APPs go 4 times slower. I will have a look but if
it is really so bad than please consider an IOCTL or syscall. Or a special
O_DAX_ERRORS flag ...

Please do not trash all the O_DIRECT users, they are the more important
clients, like DBs and VMs.

Thanks
Boaz

>>
>> [*"less concurrent" because of the queuing done in bdev. Note how
>>   pmem is not even multi-queue, and even if it was it will be much
>>   slower then DAX because of the code depth and all the locks and
>> task
>>   switches done in the block layer. In DAX the final memcpy is done
>> directly
>>   on the user-mode thread]
>>
>> Thanks
>> Boaz
>>
> 



Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-02 Thread Boaz Harrosh
On 04/29/2016 12:16 AM, Vishal Verma wrote:
> All IO in a dax filesystem used to go through dax_do_io, which cannot
> handle media errors, and thus cannot provide a recovery path that can
> send a write through the driver to clear errors.
> 
> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
> path for DAX filesystems, use the same direct_IO path for both DAX and
> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
> direct_IO path instead of DAX.
> 

Really? What are your thinking here?

What about all the current users of O_DIRECT, you have just made them
4 times slower and "less concurrent*" then "buffred io" users. Since
direct_IO path will queue an IO request and all.
(And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])

I hate it that you overload the semantics of a known and expected
O_DIRECT flag, for special pmem quirks. This is an incompatible
and unrelated overload of the semantics of O_DIRECT.

> This allows us a recovery path in the form of opening the file with
> O_DIRECT and writing to it with the usual O_DIRECT semantics (sector
> alignment restrictions).
> 

I understand that you want a sector aligned IO, right? for the
clear of errors. But I hate it that you forced all O_DIRECT IO
to be slow for this.
Can you not make dax_do_io handle media errors? At least for the
parts of the IO that are aligned.
(And your recovery path application above can use only aligned
 IO to make sure)

Please look for another solution. Even a special IOCTL_DAX_CLEAR_ERROR

[*"less concurrent" because of the queuing done in bdev. Note how
  pmem is not even multi-queue, and even if it was it will be much
  slower then DAX because of the code depth and all the locks and task
  switches done in the block layer. In DAX the final memcpy is done directly
  on the user-mode thread]

Thanks
Boaz

> Cc: Matthew Wilcox 
> Cc: Dan Williams 
> Cc: Ross Zwisler 
> Cc: Jeff Moyer 
> Cc: Christoph Hellwig 
> Cc: Dave Chinner 
> Cc: Jan Kara 
> Cc: Jens Axboe 
> Cc: Al Viro 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/block/loop.c |  2 +-
>  fs/block_dev.c   | 17 +
>  fs/ext2/inode.c  | 16 
>  fs/ext4/file.c   |  2 +-
>  fs/ext4/inode.c  | 19 +--
>  fs/xfs/xfs_aops.c| 20 +---
>  fs/xfs/xfs_file.c|  4 ++--
>  include/linux/fs.h   | 15 ---
>  mm/filemap.c |  4 ++--
>  9 files changed, 69 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 80cf8ad..c0a24c3 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -568,7 +568,7 @@ struct switch_request {
>  
>  static inline void loop_update_dio(struct loop_device *lo)
>  {
> - __loop_update_dio(lo, io_is_direct(lo->lo_backing_file) |
> + __loop_update_dio(lo, (lo->lo_backing_file->f_flags & O_DIRECT) |
>   lo->use_dio);
>  }
>  
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 79defba..97a1f5f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -167,12 +167,21 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, loff_t offset)
>   struct file *file = iocb->ki_filp;
>   struct inode *inode = bdev_file_inode(file);
>  
> - if (IS_DAX(inode))
> + if (iocb_is_direct(iocb))
> + return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
> + offset, blkdev_get_block, NULL,
> + NULL, DIO_SKIP_DIO_COUNT);
> + else if (iocb_is_dax(iocb))
>   return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
>   NULL, DIO_SKIP_DIO_COUNT);
> - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> - blkdev_get_block, NULL, NULL,
> - DIO_SKIP_DIO_COUNT);
> + else {
> + /*
> +  * If we're in the direct_IO path, either the IOCB_DIRECT or
> +  * IOCB_DAX flags must be set.
> +  */
> + WARN_ONCE(1, "Kernel Bug with iocb flags\n");
> + return -ENXIO;
> + }
>  }
>  
>  int __sync_blockdev(struct block_device *bdev, int wait)
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 35f2b0bf..45f2b51 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -861,12 +861,20 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, loff_t offset)
>   size_t count = iov_iter_count(iter);
>   ssize_t ret;
>  
> - if (IS_DAX(inode))
> - ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
> - DIO_LOCKING);
> - else
> + if (iocb_is_direct(iocb))
>   ret = blockdev_direct_IO(iocb, inode, iter, offset,
>ext2_get_

Re: [PATCH 1/3] ext4: Add alignment check for DAX mount

2016-05-02 Thread Boaz Harrosh
On 05/01/2016 08:31 PM, Christoph Hellwig wrote:
> On Fri, Apr 29, 2016 at 02:39:33PM -0600, Toshi Kani wrote:
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 304c712..90a8670 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -3421,6 +3421,12 @@ static int ext4_fill_super(struct super_block *sb, 
>> void *data, int silent)
>>  "error: unsupported blocksize for dax");
>>  goto failed_mount;
>>  }
>> +if (sb->s_bdev->bd_part->start_sect % (PAGE_SIZE / 512) ||
>> +sb->s_bdev->bd_part->nr_sects % (PAGE_SIZE / 512)) {
>> +ext4_msg(sb, KERN_ERR,
>> +"error: unaligned partition for dax");
>> +goto failed_mount;
>> +}
>>  if (!sb->s_bdev->bd_disk->fops->direct_access) {
>>  ext4_msg(sb, KERN_ERR,
>>  "error: device does not support dax");
> 
> Factor your new checks and the ->direct_access into a new helper. It
> should take the block device as file systems might have multiple
> underlying devices.

This already exists as part of bdev_direct_access()

All the code needs to do is call bdev_direct_access() on sector ZERO and
see if it is successful. If the alignment is broken it will fail.

Infact the FS does not even need the second if 
(!...bd_disk->fops->direct_access)
check either because it is also done inside bdev_direct_access(). The only check
needed is calling bdev_direct_access() and the generic layer does all the
checks needed already.

Cheers
Boaz



Re: [PATCH 1/3] ext4: Add alignment check for DAX mount

2016-05-01 Thread Boaz Harrosh
On 04/29/2016 11:39 PM, Toshi Kani wrote:
> When a partition is not aligned by 4KB, mount -o dax succeeds,
> but any read/write access to the filesystem fails, except for
> metadata update.
> 
> Add alignment check to ext4_fill_super() when -o dax is specified.
> 
> Reported-by: Micah Parrish 
> Signed-off-by: Toshi Kani 
> Cc: "Theodore Ts'o" 
> Cc: Andreas Dilger 
> Cc: Jan Kara 
> Cc: Dan Williams 
> Cc: Ross Zwisler 
> ---
>  fs/ext4/super.c |6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 304c712..90a8670 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3421,6 +3421,12 @@ static int ext4_fill_super(struct super_block *sb, 
> void *data, int silent)
>   "error: unsupported blocksize for dax");
>   goto failed_mount;
>   }
> + if (sb->s_bdev->bd_part->start_sect % (PAGE_SIZE / 512) ||
> + sb->s_bdev->bd_part->nr_sects % (PAGE_SIZE / 512)) {

Can you please not do this like this? For me is a layering violation only
the device should know what are its limits.

I would prefer if you just try to bdev_direct_access() the 0 sector and if
it fails then fail here. This way you let the device decide what it needs
and if/how to support unaligned partitions.
(For example it could by shrinking its size)

Thanks
Boaz

> + ext4_msg(sb, KERN_ERR,
> + "error: unaligned partition for dax");
> + goto failed_mount;
> + }
>   if (!sb->s_bdev->bd_disk->fops->direct_access) {
>   ext4_msg(sb, KERN_ERR,
>   "error: device does not support dax");
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 



Re: [PATCH 31/71] exofs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

2016-03-21 Thread Boaz Harrosh
On 03/20/2016 08:40 PM, Kirill A. Shutemov wrote:
> PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago
> with promise that one day it will be possible to implement page cache with
> bigger chunks than PAGE_SIZE.
> 
> This promise never materialized. And unlikely will.
> 
> We have many places where PAGE_CACHE_SIZE assumed to be equal to
> PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_*
> or PAGE_* constant should be used in a particular case, especially on the
> border between fs and mm.
> 
> Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
> breakage to be doable.
> 
> Let's stop pretending that pages in page cache are special. They are not.
> 
> The changes are pretty straight-forward:
> 
>  -  << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
> 
>  - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
> 
>  - page_cache_get() -> get_page();
> 
>  - page_cache_release() -> put_page();
> 
> Signed-off-by: Kirill A. Shutemov 
> Cc: Boaz Harrosh 

ACK-by: Boaz Harrosh 

Could you please push this through some other maintainer perhaps
the vfs tree?

Thank you Kirill
Boaz

> Cc: Benny Halevy 
> ---
>  fs/exofs/dir.c   | 30 +++---
>  fs/exofs/inode.c | 34 +-
>  fs/exofs/namei.c |  4 ++--
>  3 files changed, 34 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
> index e5bb2abf77f9..547b93cbea63 100644
> --- a/fs/exofs/dir.c
> +++ b/fs/exofs/dir.c
> @@ -41,16 +41,16 @@ static inline unsigned exofs_chunk_size(struct inode 
> *inode)
>  static inline void exofs_put_page(struct page *page)
>  {
>   kunmap(page);
> - page_cache_release(page);
> + put_page(page);
>  }
>  
>  static unsigned exofs_last_byte(struct inode *inode, unsigned long page_nr)
>  {
>   loff_t last_byte = inode->i_size;
>  
> - last_byte -= page_nr << PAGE_CACHE_SHIFT;
> - if (last_byte > PAGE_CACHE_SIZE)
> - last_byte = PAGE_CACHE_SIZE;
> + last_byte -= page_nr << PAGE_SHIFT;
> + if (last_byte > PAGE_SIZE)
> + last_byte = PAGE_SIZE;
>   return last_byte;
>  }
>  
> @@ -85,13 +85,13 @@ static void exofs_check_page(struct page *page)
>   unsigned chunk_size = exofs_chunk_size(dir);
>   char *kaddr = page_address(page);
>   unsigned offs, rec_len;
> - unsigned limit = PAGE_CACHE_SIZE;
> + unsigned limit = PAGE_SIZE;
>   struct exofs_dir_entry *p;
>   char *error;
>  
>   /* if the page is the last one in the directory */
> - if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
> - limit = dir->i_size & ~PAGE_CACHE_MASK;
> + if ((dir->i_size >> PAGE_SHIFT) == page->index) {
> + limit = dir->i_size & ~PAGE_MASK;
>   if (limit & (chunk_size - 1))
>   goto Ebadsize;
>   if (!limit)
> @@ -138,7 +138,7 @@ bad_entry:
>   EXOFS_ERR(
>   "ERROR [exofs_check_page]: bad entry in directory(0x%lx): %s - "
>   "offset=%lu, inode=0x%llu, rec_len=%d, name_len=%d\n",
> - dir->i_ino, error, (page->index< + dir->i_ino, error, (page->index<   _LLU(le64_to_cpu(p->inode_no)),
>   rec_len, p->name_len);
>   goto fail;
> @@ -147,7 +147,7 @@ Eend:
>   EXOFS_ERR("ERROR [exofs_check_page]: "
>   "entry in directory(0x%lx) spans the page boundary"
>   "offset=%lu, inode=0x%llx\n",
> - dir->i_ino, (page->index< + dir->i_ino, (page->index<   _LLU(le64_to_cpu(p->inode_no)));
>  fail:
>   SetPageChecked(page);
> @@ -237,8 +237,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
>  {
>   loff_t pos = ctx->pos;
>   struct inode *inode = file_inode(file);
> - unsigned int offset = pos & ~PAGE_CACHE_MASK;
> - unsigned long n = pos >> PAGE_CACHE_SHIFT;
> + unsigned int offset = pos & ~PAGE_MASK;
> + unsigned long n = pos >> PAGE_SHIFT;
>   unsigned long npages = dir_pages(inode);
>   unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
>   int need_revalidate = (file->f_version != inode->i_version);
> @@ -254,7 +254,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
>   if (IS_ERR(page)) {
>   EXOFS_ERR("ERROR: bad page in directory(0x%lx)\n",
> inode-&

Re: [PATCH] pmem: don't allocate unused major device number

2016-03-20 Thread Boaz Harrosh
On 03/09/2016 12:21 AM, NeilBrown wrote:
> 
> When alloc_disk(0) or alloc_disk-node(0, XX) is used, the ->major
> number is completely ignored:  all devices are allocated with a
> major of BLOCK_EXT_MAJOR.
> 
> So there is no point allocating pmem_major.
> 
> Signed-off-by: NeilBrown 

Tested-by: Boaz Harrosh 

Yes I sent in the same exact patch several times. This is not the
only driver that has this "wasted code" BTW.

I hope it will be finally removed. Thanks Neil
Boaz

> ---
>  drivers/nvdimm/pmem.c | 19 +--
>  1 file changed, 1 insertion(+), 18 deletions(-)
> 
> Hi Dan et al,
>  I was recently educating myself about the behavior of alloc_disk(0).
>  As I understand it, the ->major is ignored and all device numbers for all
>  partitions (including '0') are allocated on demand with major number of
>  BLOCK_EXT_MAJOR.
> 
>  So I was a little surprised to find that pmem.c allocated a major
>  number which is never used - historical anomaly I suspect.
>  I was a bit more surprised at the comment in:
> 
>   Commit: 9f53f9fa4ad1 ("libnvdimm, pmem: add libnvdimm support to the pmem 
> driver")
> 
>  "The minor numbers are also more predictable by passing 0 to alloc_disk()."
> 
>  How can they possibly be more predictable given that they are allocated
>  on-demand?  Maybe discovery order is very predictable???
> 
>  In any case, I propose this patch but cannot test it (beyond compiling)
>  as I don't have relevant hardware.  And maybe some user-space code greps
>  /proc/devices for "pmem" to determine if "pmem" is compiled in (though
>  I sincerely hope not).
>  So I cannot be certain that this patch won't break anything, but am
>  hoping that if you like it you might test it.
> 
>  If it does prove acceptable, then similar changes would be appropriate
>  for btt.c and blk.c.   And drivers/memstick/core/ms_block.c and
>  drivers/nvme/host/core.c. (gotta stamp out this cargo cult)
> 
>  drivers/lightnvm/core.c is the only driver which uses alloc_disk(0) and
>  doesn't provide a 'major' number. :-(
> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 8d0b54670184..ec7e9e6a768e 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -47,8 +47,6 @@ struct pmem_device {
>   struct badblocksbb;
>  };
>  
> -static int pmem_major;
> -
>  static bool is_bad_pmem(struct badblocks *bb, sector_t sector, unsigned int 
> len)
>  {
>   if (bb->count) {
> @@ -228,8 +226,6 @@ static int pmem_attach_disk(struct device *dev,
>   return -ENOMEM;
>   }
>  
> - disk->major = pmem_major;
> - disk->first_minor   = 0;
>   disk->fops  = &pmem_fops;
>   disk->private_data  = pmem;
>   disk->queue = pmem->pmem_queue;
> @@ -502,26 +498,13 @@ static struct nd_device_driver nd_pmem_driver = {
>  
>  static int __init pmem_init(void)
>  {
> - int error;
> -
> - pmem_major = register_blkdev(0, "pmem");
> - if (pmem_major < 0)
> - return pmem_major;
> -
> - error = nd_driver_register(&nd_pmem_driver);
> - if (error) {
> - unregister_blkdev(pmem_major, "pmem");
> - return error;
> - }
> -
> - return 0;
> + return nd_driver_register(&nd_pmem_driver);
>  }
>  module_init(pmem_init);
>  
>  static void pmem_exit(void)
>  {
>   driver_unregister(&nd_pmem_driver.drv);
> - unregister_blkdev(pmem_major, "pmem");
>  }
>  module_exit(pmem_exit);
>  
> 



Re: [PATCH v2] osd: remove deadcode

2016-02-24 Thread Boaz Harrosh
On 02/24/2016 01:21 PM, Sudip Mukherjee wrote:
> The variable is_ver1 is always true and so OSD_CAP_LEN can never be
> used.
> Reported by Coverity.
> 
> Signed-off-by: Sudip Mukherjee 

ACK-by: Boaz harrosh 

Thanks
> ---
> 
> v2: Joe Perches asked to mention the tool used in the commit log.
> 
>  drivers/scsi/osd/osd_initiator.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/osd/osd_initiator.c 
> b/drivers/scsi/osd/osd_initiator.c
> index d8a2b51..3b11aad 100644
> --- a/drivers/scsi/osd/osd_initiator.c
> +++ b/drivers/scsi/osd/osd_initiator.c
> @@ -2006,9 +2006,8 @@ EXPORT_SYMBOL(osd_sec_init_nosec_doall_caps);
>   */
>  void osd_set_caps(struct osd_cdb *cdb, const void *caps)
>  {
> - bool is_ver1 = true;
>   /* NOTE: They start at same address */
> - memcpy(&cdb->v1.caps, caps, is_ver1 ? OSDv1_CAP_LEN : OSD_CAP_LEN);
> + memcpy(&cdb->v1.caps, caps, OSDv1_CAP_LEN);
>  }
>  
>  bool osd_is_sec_alldata(struct osd_security_parameters *sec_parms __unused)
> 



Re: Another proposal for DAX fault locking

2016-02-14 Thread Boaz Harrosh
On 02/11/2016 12:38 PM, Jan Kara wrote:
> On Wed 10-02-16 19:38:21, Boaz Harrosh wrote:
>> On 02/09/2016 07:24 PM, Jan Kara wrote:
>>> Hello,
>>>
<>
>>>
>>> DAX will have an array of mutexes (the array can be made per device but
>>> initially a global one should be OK). We will use mutexes in the array as a
>>> replacement for page lock - we will use hashfn(mapping, index) to get
>>> particular mutex protecting our offset in the mapping. On fault / page
>>> mkwrite, we'll grab the mutex similarly to page lock and release it once we
>>> are done updating page tables. This deals with races in [1]. When flushing
>>> caches we grab the mutex before clearing writeable bit in page tables
>>> and clearing dirty bit in the radix tree and drop it after we have flushed
>>> caches for the pfn. This deals with races in [2].
>>>
>>> Thoughts?
>>>
>>
>> You could also use one of the radix-tree's special-bits as a bit lock.
>> So no need for any extra allocations.
> 
> Yes and I've suggested that once as well. But since we need sleeping
> locks, you need some wait queues somewhere as well. So some allocations are
> going to be needed anyway. 

They are already sleeping locks and there are all the proper "wait queues"
in place. I'm talking about
   lock:
err = wait_on_bit_lock(&some_long, SOME_BIT_LOCK, ...);
and
   unlock:
WARN_ON(!test_and_clear_bit(SOME_BIT_LOCK, &some_long));
wake_up_bit(&some_long, SOME_BIT_LOCK);

> And mutexes have much better properties than

Just saying that page-locks are implemented just this way these days
so it is the performance and characteristics we already know.
(You are replacing page locks, no?)

> bit-locks so I prefer mutexes over cramming bit locks into radix tree. Plus
> you'd have to be careful so that someone doesn't remove the bit from the
> radix tree while you are working with it.
> 

Sure! need to be careful, is our middle name.

That said. Is your call. Thank you for working on this. Your plan sounds
very good as well, and is very much needed, because DAX's mmap performance
success right now.
[Maybe one small enhancement perhaps allocate an array of mutexes per NUMA
 node and access the proper array through numa_node_id()]

>   Honza
> 

Thanks
Boaz



Re: Another proposal for DAX fault locking

2016-02-10 Thread Boaz Harrosh
On 02/09/2016 07:24 PM, Jan Kara wrote:
> Hello,
> 
> I was thinking about current issues with DAX fault locking [1] (data
> corruption due to racing faults allocating blocks) and also races which
> currently don't allow us to clear dirty tags in the radix tree due to races
> between faults and cache flushing [2]. Both of these exist because we don't
> have an equivalent of page lock available for DAX. While we have a
> reasonable solution available for problem [1], so far I'm not aware of a
> decent solution for [2]. After briefly discussing the issue with Mel he had
> a bright idea that we could used hashed locks to deal with [2] (and I think
> we can solve [1] with them as well). So my proposal looks as follows:
> 
> DAX will have an array of mutexes (the array can be made per device but
> initially a global one should be OK). We will use mutexes in the array as a
> replacement for page lock - we will use hashfn(mapping, index) to get
> particular mutex protecting our offset in the mapping. On fault / page
> mkwrite, we'll grab the mutex similarly to page lock and release it once we
> are done updating page tables. This deals with races in [1]. When flushing
> caches we grab the mutex before clearing writeable bit in page tables
> and clearing dirty bit in the radix tree and drop it after we have flushed
> caches for the pfn. This deals with races in [2].
> 
> Thoughts?
> 

You could also use one of the radix-tree's special-bits as a bit lock.
So no need for any extra allocations.

[latest page-lock is a bit-lock so performance is the same]

Thanks
Boaz




Re: [PATCH] [SCSI] osd: fix signed char versus %02x issue

2015-12-08 Thread Boaz Harrosh
On 12/08/2015 04:25 PM, Rasmus Villemoes wrote:
> If char is signed and one of these bytes happen to have a value
> outside the ascii range, the corresponding output will consist of
> "ff" followed by the two hex chars that were actually
> intended. One way to fix it would be to change the casts to (u8*) aka
> (unsigned char*), but it is much simpler (and generates smaller code)
> to use the %ph extension which was created for such short hexdumps.
> 

Ha real cool, thanks I hated that crap

ACK-by: Boaz Harrosh 

> Signed-off-by: Rasmus Villemoes 
> ---
>  drivers/scsi/osd/osd_initiator.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/osd/osd_initiator.c 
> b/drivers/scsi/osd/osd_initiator.c
> index 0cccd6033feb..d8a2b5185f56 100644
> --- a/drivers/scsi/osd/osd_initiator.c
> +++ b/drivers/scsi/osd/osd_initiator.c
> @@ -170,10 +170,7 @@ static int _osd_get_print_system_info(struct osd_dev *od,
>  
>   /* FIXME: Where are the time utilities */
>   pFirst = get_attrs[a++].val_ptr;
> - OSD_INFO("CLOCK  [0x%02x%02x%02x%02x%02x%02x]\n",
> - ((char *)pFirst)[0], ((char *)pFirst)[1],
> - ((char *)pFirst)[2], ((char *)pFirst)[3],
> - ((char *)pFirst)[4], ((char *)pFirst)[5]);
> + OSD_INFO("CLOCK  [0x%6phN]\n", pFirst);
>  
>   if (a < nelem) { /* IBM-OSD-SIM bug, Might not have it */
>   unsigned len = get_attrs[a].len;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, dax: fix DAX deadlocks (COW fault)

2015-11-17 Thread Boaz Harrosh
On 11/16/2015 08:34 PM, Ross Zwisler wrote:
> On Mon, Nov 16, 2015 at 10:15:56AM -0800, Dan Williams wrote:
>> On Mon, Nov 16, 2015 at 4:09 AM, Yigal Korman  wrote:
>>> DAX handling of COW faults has wrong locking sequence:
>>> dax_fault does i_mmap_lock_read
>>> do_cow_fault does i_mmap_unlock_write
>>>
>>> Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
>>> commit[3].
>>>
>>> Original COW locking logic was introduced by Matthew here[4].
>>>
>>> This should be applied to v4.3 as well.
>>>
>>> [1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
>>> [2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
>>> [3] 843172978bb9 dax: fix race between simultaneous faults
>>> [4] 2e4cdab0584f mm: allow page fault handlers to perform the COW
>>>
>>> Signed-off-by: Yigal Korman 
>>>
>>> Cc: Stable Tree 
>>> Cc: Boaz Harrosh 
>>> Cc: Ross Zwisler 
>>> Cc: Alexander Viro 
>>> Cc: Dan Williams 
>>> Cc: Dave Chinner 
>>> Cc: Jan Kara 
>>> Cc: "Kirill A. Shutemov" 
>>> Cc: Matthew Wilcox 
>>> ---
>>>  mm/memory.c | 8 
>>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index c716913..e5071af 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3015,9 +3015,9 @@ static int do_cow_fault(struct mm_struct *mm, struct 
>>> vm_area_struct *vma,
>>> } else {
>>> /*
>>>  * The fault handler has no page to lock, so it 
>>> holds
>>> -* i_mmap_lock for write to protect against 
>>> truncate.
>>> +* i_mmap_lock for read to protect against truncate.
>>>  */
>>> -   i_mmap_unlock_write(vma->vm_file->f_mapping);
>>> +   i_mmap_unlock_read(vma->vm_file->f_mapping);
>>> }
>>> goto uncharge_out;
>>> }
>>> @@ -3031,9 +3031,9 @@ static int do_cow_fault(struct mm_struct *mm, struct 
>>> vm_area_struct *vma,
>>> } else {
>>> /*
>>>  * The fault handler has no page to lock, so it holds
>>> -* i_mmap_lock for write to protect against truncate.
>>> +* i_mmap_lock for read to protect against truncate.
>>>  */
>>> -   i_mmap_unlock_write(vma->vm_file->f_mapping);
>>> +   i_mmap_unlock_read(vma->vm_file->f_mapping);
>>> }
>>> return ret;
>>>  uncharge_out:
>>
>> Looks good to me.  I'll include this with some other DAX fixes I have 
>> pending.
> 
> Looks good to me as well.  Thanks for catching this.
> 

Yes. None of the xfstests catch this. It needs a private-mapping mmap in some 
combination
of other activity on the file at the same time.

Which the linker of gcc does. We have a test of a git clone Kernel-tree and 
make. which
catches this in the make phase. For some reason on ext4 it is reliable to crash 
but on xfs 1/2
the runs go through, go figure.

Thanks Yigal for the fast fix
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sound: depend on ZONE_DMA

2015-11-16 Thread Boaz Harrosh
On 11/16/2015 11:28 AM, Takashi Iwai wrote:
<>
>>
>> Hi Greg
>>
>> Please include the mainline patch:
>>  [2db1a57] ALSA: pci: depend on ZONE_DMA (by Dan Williams)
>>
>> To the stable tree for v4.3.X Kernel.
>>
>> This patch is needed for proper operation of the 4.3 pmem.ko driver. Long
>> story, but without this patch the persistent-memory will not be able to
>> work with the new 4.3 support of page-struct which is needed if we want to
>> RDMA and/or IO directly to persistent memory.
>> [Is to do with the new ZONE_DEVICE and too many ZONE(s) if ZONE_DMA is 
>> enabled]
> 
> Well, it's not exactly true: you just need to deselect some drivers to
> allow CONFIG_ZONE_DMA to be disabled.  It's merely another side of
> coin.
> 
> I don't mind including this to 4.3, though, but just want to correct
> the statement.
> 

This is true. Sorry for not being clear. We already have a script that we
give to clients to enable ZONE_DEVICE for 4.3 based systems. Is currently
simple based on this patch. Without this patch it will need to be more
complicated.

Thanks Takashi
Boaz

> 
> Takashi
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sound: depend on ZONE_DMA

2015-11-16 Thread Boaz Harrosh
On 11/16/2015 09:40 AM, Takashi Iwai wrote:
> On Sun, 15 Nov 2015 11:53:11 +0100,
> Boaz Harrosh wrote:
>>
>> On 11/12/2015 10:38 PM, Takashi Iwai wrote:
>>> On Thu, 12 Nov 2015 21:13:57 +0100,
>>> Dan Williams wrote:
>>>>
>>>> There are several sound drivers that 'select ZONE_DMA'.  This is
>>>> backwards as ZONE_DMA is an architecture capability exported to drivers.
>>>> Switch the polarity of the dependency to disable these drivers when the
>>>> architecture does not support ZONE_DMA.  This was discovered in the
>>>> context of testing/enabling devm_memremap_pages() which depends on
>>>> ZONE_DEVICE.  ZONE_DEVICE in turn depends on !ZONE_DMA.
>>>
>>> Makes sense.  I applied it now, thanks.
>>>
>>
>> Please add:
>> CC: Stable Tree 
> 
> Sorry, too late, already merged.
> 
> 
> Takashi
> 

Hi Greg

Please include the mainline patch:
[2db1a57] ALSA: pci: depend on ZONE_DMA (by Dan Williams)

To the stable tree for v4.3.X Kernel.

This patch is needed for proper operation of the 4.3 pmem.ko driver. Long
story, but without this patch the persistent-memory will not be able to
work with the new 4.3 support of page-struct which is needed if we want to
RDMA and/or IO directly to persistent memory.
[Is to do with the new ZONE_DEVICE and too many ZONE(s) if ZONE_DMA is enabled]

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sound: depend on ZONE_DMA

2015-11-15 Thread Boaz Harrosh
On 11/12/2015 10:38 PM, Takashi Iwai wrote:
> On Thu, 12 Nov 2015 21:13:57 +0100,
> Dan Williams wrote:
>>
>> There are several sound drivers that 'select ZONE_DMA'.  This is
>> backwards as ZONE_DMA is an architecture capability exported to drivers.
>> Switch the polarity of the dependency to disable these drivers when the
>> architecture does not support ZONE_DMA.  This was discovered in the
>> context of testing/enabling devm_memremap_pages() which depends on
>> ZONE_DEVICE.  ZONE_DEVICE in turn depends on !ZONE_DMA.
> 
> Makes sense.  I applied it now, thanks.
> 

Please add:
CC: Stable Tree 

Thanks
Boaz

> 
> Takashi
> 
>>
>> Cc: Jaroslav Kysela 
>> Cc: Takashi Iwai 
>> Cc: 
>> Reported-by: Jeff Moyer 
>> Signed-off-by: Dan Williams 
>> ---
>>  sound/pci/Kconfig |   24 
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/sound/pci/Kconfig b/sound/pci/Kconfig
>> index edfc1b8d553e..656ce39bddbc 100644
>> --- a/sound/pci/Kconfig
>> +++ b/sound/pci/Kconfig
>> @@ -25,7 +25,7 @@ config SND_ALS300
>>  select SND_PCM
>>  select SND_AC97_CODEC
>>  select SND_OPL3_LIB
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say 'Y' or 'M' to include support for Avance Logic ALS300/ALS300+
>>  
>> @@ -50,7 +50,7 @@ config SND_ALI5451
>>  tristate "ALi M5451 PCI Audio Controller"
>>  select SND_MPU401_UART
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for the integrated AC97 sound
>>device on motherboards using the ALi M5451 Audio Controller
>> @@ -155,7 +155,7 @@ config SND_AZT3328
>>  select SND_PCM
>>  select SND_RAWMIDI
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for Aztech AZF3328 (PCI168)
>>soundcards.
>> @@ -463,7 +463,7 @@ config SND_EMU10K1
>>  select SND_HWDEP
>>  select SND_RAWMIDI
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y to include support for Sound Blaster PCI 512, Live!,
>>Audigy and E-mu APS (partially supported) soundcards.
>> @@ -479,7 +479,7 @@ config SND_EMU10K1X
>>  tristate "Emu10k1X (Dell OEM Version)"
>>  select SND_AC97_CODEC
>>  select SND_RAWMIDI
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for the Dell OEM version of the
>>Sound Blaster Live!.
>> @@ -513,7 +513,7 @@ config SND_ES1938
>>  select SND_OPL3_LIB
>>  select SND_MPU401_UART
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for soundcards based on ESS Solo-1
>>(ES1938, ES1946, ES1969) chips.
>> @@ -525,7 +525,7 @@ config SND_ES1968
>>  tristate "ESS ES1968/1978 (Maestro-1/2/2E)"
>>  select SND_MPU401_UART
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for soundcards based on ESS Maestro
>>1/2/2E chips.
>> @@ -612,7 +612,7 @@ config SND_ICE1712
>>  select SND_MPU401_UART
>>  select SND_AC97_CODEC
>>  select BITREVERSE
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for soundcards based on the
>>ICE1712 (Envy24) chip.
>> @@ -700,7 +700,7 @@ config SND_LX6464ES
>>  config SND_MAESTRO3
>>  tristate "ESS Allegro/Maestro3"
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for soundcards based on ESS Maestro 3
>>(Allegro) chips.
>> @@ -806,7 +806,7 @@ config SND_SIS7019
>>  tristate "SiS 7019 Audio Accelerator"
>>  depends on X86_32
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for the SiS 7019 Audio Accelerator.
>>  
>> @@ -818,7 +818,7 @@ config SND_SONICVIBES
>>  select SND_OPL3_LIB
>>  select SND_MPU401_UART
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for soundcards based on the S3
>>SonicVibes chip.
>> @@ -830,7 +830,7 @@ config SND_TRIDENT
>>  tristate "Trident 4D-Wave DX/NX; SiS 7018"
>>  select SND_MPU401_UART
>>  select SND_AC97_CODEC
>> -select ZONE_DMA
>> +depends on ZONE_DMA
>>  help
>>Say Y here to include support for soundcards based on Trident
>>4D-Wave DX/NX or SiS 7018 chips.
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from th

Re: [PATCH] sound: depend on ZONE_DMA

2015-11-15 Thread Boaz Harrosh
On 11/12/2015 10:13 PM, Dan Williams wrote:
> There are several sound drivers that 'select ZONE_DMA'.  This is
> backwards as ZONE_DMA is an architecture capability exported to drivers.
> Switch the polarity of the dependency to disable these drivers when the
> architecture does not support ZONE_DMA.  This was discovered in the
> context of testing/enabling devm_memremap_pages() which depends on
> ZONE_DEVICE.  ZONE_DEVICE in turn depends on !ZONE_DMA.
> 
> Cc: Jaroslav Kysela 
> Cc: Takashi Iwai 
> Cc: 
> Reported-by: Jeff Moyer 
> Signed-off-by: Dan Williams 

Yes sorry about that. I had the exact patch in my original 4.3-rc1
tree but forgot to send it.

Reviewed-by: Boaz Harrosh 

You will need stabe@ for 4.3 as well.

Thanks
Boaz

> ---
>  sound/pci/Kconfig |   24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/sound/pci/Kconfig b/sound/pci/Kconfig
> index edfc1b8d553e..656ce39bddbc 100644
> --- a/sound/pci/Kconfig
> +++ b/sound/pci/Kconfig
> @@ -25,7 +25,7 @@ config SND_ALS300
>   select SND_PCM
>   select SND_AC97_CODEC
>   select SND_OPL3_LIB
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say 'Y' or 'M' to include support for Avance Logic ALS300/ALS300+
>  
> @@ -50,7 +50,7 @@ config SND_ALI5451
>   tristate "ALi M5451 PCI Audio Controller"
>   select SND_MPU401_UART
>   select SND_AC97_CODEC
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for the integrated AC97 sound
> device on motherboards using the ALi M5451 Audio Controller
> @@ -155,7 +155,7 @@ config SND_AZT3328
>   select SND_PCM
>   select SND_RAWMIDI
>   select SND_AC97_CODEC
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for Aztech AZF3328 (PCI168)
> soundcards.
> @@ -463,7 +463,7 @@ config SND_EMU10K1
>   select SND_HWDEP
>   select SND_RAWMIDI
>   select SND_AC97_CODEC
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y to include support for Sound Blaster PCI 512, Live!,
> Audigy and E-mu APS (partially supported) soundcards.
> @@ -479,7 +479,7 @@ config SND_EMU10K1X
>   tristate "Emu10k1X (Dell OEM Version)"
>   select SND_AC97_CODEC
>   select SND_RAWMIDI
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for the Dell OEM version of the
> Sound Blaster Live!.
> @@ -513,7 +513,7 @@ config SND_ES1938
>   select SND_OPL3_LIB
>   select SND_MPU401_UART
>   select SND_AC97_CODEC
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for soundcards based on ESS Solo-1
> (ES1938, ES1946, ES1969) chips.
> @@ -525,7 +525,7 @@ config SND_ES1968
>   tristate "ESS ES1968/1978 (Maestro-1/2/2E)"
>   select SND_MPU401_UART
>   select SND_AC97_CODEC
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for soundcards based on ESS Maestro
> 1/2/2E chips.
> @@ -612,7 +612,7 @@ config SND_ICE1712
>   select SND_MPU401_UART
>   select SND_AC97_CODEC
>   select BITREVERSE
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for soundcards based on the
> ICE1712 (Envy24) chip.
> @@ -700,7 +700,7 @@ config SND_LX6464ES
>  config SND_MAESTRO3
>   tristate "ESS Allegro/Maestro3"
>   select SND_AC97_CODEC
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for soundcards based on ESS Maestro 3
> (Allegro) chips.
> @@ -806,7 +806,7 @@ config SND_SIS7019
>   tristate "SiS 7019 Audio Accelerator"
>   depends on X86_32
>   select SND_AC97_CODEC
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for the SiS 7019 Audio Accelerator.
>  
> @@ -818,7 +818,7 @@ config SND_SONICVIBES
>   select SND_OPL3_LIB
>   select SND_MPU401_UART
>   select SND_AC97_CODEC
> - select ZONE_DMA
> + depends on ZONE_DMA
>   help
> Say Y here to include support for soundcards based on the S3
> SonicVibes chip.
> @@ -830,7 +830,7 @@ config SND_TRIDENT
>   tristate "Trident 4D-Wave DX/NX; SiS 7018"
>   select SND_MPU401_UART
>   select SND_AC97_CODEC
> - select ZONE_DMA
> +  

Re: [PATCH] osd fs: __r4w_get_page rely on PageUptodate for uptodate

2015-11-03 Thread Boaz Harrosh
On 11/03/2015 04:49 AM, Hugh Dickins wrote:
> On Mon, 2 Nov 2015, Boaz Harrosh wrote:
>> On 11/02/2015 01:39 AM, Hugh Dickins wrote:
>> <>
>>>> This patch is not correct!
>>>
>>> I think you have actually confirmed that the patch is correct:
>>> why bother to test PageDirty or PageWriteback when PageUptodate
>>> already tells you what you need?
>>>
>>> Or do these filesystems do something unusual with PageUptodate
>>> when PageDirty is set?  I didn't find it.
>>>
>>
>> This is kind of delicate stuff. It took me a while to get it right
>> when I did it. I don't remember all the details.
>>
>> But consider this option:
> 
> Thanks, yes, it helps to have a concrete example in front of us.
> 
>>
>> exofs_write_begin on a full PAGE_CACHE_SIZE, the page is instantiated
>> new in page-cache is that PageUptodate(page) then? I thought not.
> 
> Right, PageUptodate must not be set until the page has been filled with
> the correct data.  Nor is PageDirty or PageWriteback set at this point,
> actually.
> 
> Once page is filled with the correct data, either exofs_write_end()
> (which uses simple_write_end()) or (internally) exofs_commit_chunk()
> is called.
> 
>> (exofs does not set that)
> 
> It's simple_write_end() or exofs_commit_chunk() which SetPageUptodate
> in this case.  And after that each calls set_page_dirty(), which does
> the SetPageDirty, before unlocking the page which was supplied locked
> by exofs_write_begin().
> 
> So I don't see where the page is PageDirty without being PageUptodate.
> 
>>
>> Now that page I do not want to read in. The latest data is in memory.
>> (Same when this page is in writeback, dirty-bit is cleared)
> 
> Understood, but that's what PageUptodate is for.
> 
> (Quite what happens if there's a write error is not so clear: I think
> that typically PageError gets set and PageUptodate cleared, to read
> back in from disk what's actually there - but lose the data we wanted
> to write; but I can understand different filesystems making different
> choices there, and didn't study exofs's choice.)
> 
>>
>> So for sure if page is dirty or writeback then we surly do not need a read.
>> only if not then we need to consider the  PageUptodate(page) state.
> 
> PageUptodate is the proper flag to check, to ask if the page contains
> the correct data: there is no need to consider Dirty or Writeback.
> 
>>
>> Do you think the code is actually wrong as is?
> 
> Not that I know of: just a little too complicated and confusing.
> 
> But becomes slightly wrong if my simplification to page migration
> goes through, since that introduces an instant when PageDirty is set
> before the new page contains the correct data and is marked Uptodate.
> Hence my patch.
> 
>>
>> BTW: Very similar code is in fs/nfs/objlayout/objio_osd.c::__r4w_get_page
> 
> Indeed, the patch makes the same adjustment to that code too.
> 

OK thanks. Let me setup and test your patch. On top of 4.3 is good?
I'll send you a tested-by once I'm done.

Boaz

>>
>>> Thanks,
>>> Hugh
>>>
>> <>
>>
>> Thanks
>> Boaz
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] osd fs: __r4w_get_page rely on PageUptodate for uptodate

2015-11-02 Thread Boaz Harrosh
On 11/02/2015 01:39 AM, Hugh Dickins wrote:
<>
>> This patch is not correct!
> 
> I think you have actually confirmed that the patch is correct:
> why bother to test PageDirty or PageWriteback when PageUptodate
> already tells you what you need?
> 
> Or do these filesystems do something unusual with PageUptodate
> when PageDirty is set?  I didn't find it.
> 

This is kind of delicate stuff. It took me a while to get it right
when I did it. I don't remember all the details.

But consider this option:

exofs_write_begin on a full PAGE_CACHE_SIZE, the page is instantiated
new in page-cache is that PageUptodate(page) then? I thought not.
(exofs does not set that)

Now that page I do not want to read in. The latest data is in memory.
(Same when this page is in writeback, dirty-bit is cleared)

So for sure if page is dirty or writeback then we surly do not need a read.
only if not then we need to consider the  PageUptodate(page) state.

Do you think the code is actually wrong as is?

BTW: Very similar code is in fs/nfs/objlayout/objio_osd.c::__r4w_get_page

> Thanks,
> Hugh
> 
<>

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] osd fs: __r4w_get_page rely on PageUptodate for uptodate

2015-11-01 Thread Boaz Harrosh
On 10/29/2015 08:43 PM, Hugh Dickins wrote:
> Patch "mm: migrate dirty page without clear_page_dirty_for_io etc",
> presently staged in mmotm and linux-next, simplifies the migration of
> a PageDirty pagecache page: one stat needs moving from zone to zone
> and that's about all.
> 
> It's convenient and safest for it to shift the PageDirty bit from old
> page to new, just before updating the zone stats: before copying data
> and marking the new PageUptodate.  This is all done while both pages
> are isolated and locked, just as before; and just as before, there's
> a moment when the new page is visible in the radix_tree, but not yet
> PageUptodate.  What's new is that it may now be briefly visible as
> PageDirty before it is PageUptodate.
> 
> When I scoured the tree to see if this could cause a problem anywhere,
> the only places I found were in two similar functions __r4w_get_page():
> which look up a page with find_get_page() (not using page lock), then
> claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.
> 
> I'm not sure whether that was right before, but now it might be wrong
> (on rare occasions): only claim the page is uptodate if PageUptodate.
> Or perhaps the page in question could never be migratable anyway?
> 

Hi Sir Hugh

I'm sorry, I admit the code is clear as mud, but your patch below is wrong.

The *uptodate return from __r4w_get_page is not really "up-to-date" at all
actually it means: "do we need to read the page from storage" writable/dirty 
pages
we do not read from storage but use the newest data in memory.

r4w means read-for-write which is when we need to bring in the full stripe to
re-calculate raid5/6 . (when only the partial stripe is written)

The scenario below of: "briefly visible as PageDirty before it is PageUptodate"
is fine in this case because in both cases we do not need to read the page.

Thanks for looking
Boaz

> Signed-off-by: Hugh Dickins 

This patch is not correct!

> ---
> 
>  fs/exofs/inode.c |5 +
>  fs/nfs/objlayout/objio_osd.c |5 +
>  2 files changed, 2 insertions(+), 8 deletions(-)
> 
> --- 4.3-next/fs/exofs/inode.c 2015-08-30 11:34:09.0 -0700
> +++ linux/fs/exofs/inode.c2015-10-28 16:55:18.795554294 -0700
> @@ -592,10 +592,7 @@ static struct page *__r4w_get_page(void
>   }
>   unlock_page(page);
>   }
> - if (PageDirty(page) || PageWriteback(page))
> - *uptodate = true;
> - else
> - *uptodate = PageUptodate(page);
> + *uptodate = PageUptodate(page);
>   EXOFS_DBGMSG2("index=0x%lx uptodate=%d\n", index, *uptodate);
>   return page;
>   } else {
> --- 4.3-next/fs/nfs/objlayout/objio_osd.c 2015-10-21 18:35:07.620645439 
> -0700
> +++ linux/fs/nfs/objlayout/objio_osd.c2015-10-28 16:53:55.083686639 
> -0700
> @@ -476,10 +476,7 @@ static struct page *__r4w_get_page(void
>   }
>   unlock_page(page);
>   }
> - if (PageDirty(page) || PageWriteback(page))
> - *uptodate = true;
> - else
> - *uptodate = PageUptodate(page);
> + *uptodate = PageUptodate(page);
>   dprintk("%s: index=0x%lx uptodate=%d\n", __func__, index, *uptodate);
>   return page;
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/17] fs/exofs: remove unnecessary new_valid_dev check

2015-09-30 Thread Boaz Harrosh
On 09/29/2015 06:46 PM, Yaowei Bai wrote:
> On Tue, Sep 29, 2015 at 04:47:30PM +0300, Boaz Harrosh wrote:
>> On 09/28/2015 04:43 PM, Yaowei Bai wrote:
>>> As new_valid_dev always returns 1, so !new_valid_dev check is not
>>> needed, remove it.
>>>
>>> Signed-off-by: Yaowei Bai 
>>
>> ACK-by: Boaz Harrosh 
> 
> Thanks.
> 
>>
>> Please submit this through some General tree like the vfs or mm-tree
> 
> This's my first fs-specific patch, so i think you mean i should cc vfs
> or mm-tree?
> 

Yes you should ask Al (or Morton) to take the all patchset through their
tree. Then once all users of new_valid_dev() are gone it can be removed
in the one tree.

Thanks
Boaz

>>
>> Thanks
>> Boaz
>>
>>> ---
>>>  fs/exofs/namei.c | 3 ---
>>>  1 file changed, 3 deletions(-)
>>>
>>> diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
>>> index 09a6bb1..994e078 100644
>>> --- a/fs/exofs/namei.c
>>> +++ b/fs/exofs/namei.c
>>> @@ -80,9 +80,6 @@ static int exofs_mknod(struct inode *dir, struct dentry 
>>> *dentry, umode_t mode,
>>> struct inode *inode;
>>> int err;
>>>  
>>> -   if (!new_valid_dev(rdev))
>>> -   return -EINVAL;
>>> -
>>> inode = exofs_new_inode(dir, mode);
>>> err = PTR_ERR(inode);
>>> if (!IS_ERR(inode)) {
>>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/17] fs/exofs: remove unnecessary new_valid_dev check

2015-09-29 Thread Boaz Harrosh
On 09/28/2015 04:43 PM, Yaowei Bai wrote:
> As new_valid_dev always returns 1, so !new_valid_dev check is not
> needed, remove it.
> 
> Signed-off-by: Yaowei Bai 

ACK-by: Boaz Harrosh 

Please submit this through some General tree like the vfs or mm-tree

Thanks
Boaz

> ---
>  fs/exofs/namei.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
> index 09a6bb1..994e078 100644
> --- a/fs/exofs/namei.c
> +++ b/fs/exofs/namei.c
> @@ -80,9 +80,6 @@ static int exofs_mknod(struct inode *dir, struct dentry 
> *dentry, umode_t mode,
>   struct inode *inode;
>   int err;
>  
> - if (!new_valid_dev(rdev))
> - return -EINVAL;
> -
>   inode = exofs_new_inode(dir, mode);
>   err = PTR_ERR(inode);
>   if (!IS_ERR(inode)) {
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax: fix deadlock in __dax_fault

2015-09-24 Thread Boaz Harrosh
On 09/24/2015 05:52 AM, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
>> Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
>> that was causing us to deadlock:
>>
>> 1) enter __dax_fault()
>> 2) page = find_get_page() gives us a page, so skip
>>  i_mmap_lock_write(mapping)
>> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
>>  passes, enter this block
>> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
>>  i_mmap_unlock_write(mapping);
>>  return dax_load_hole(mapping, page, vmf);
>>
>> This causes us to up_write() a semaphore that we weren't holding.
>>
>> The up_write() on a semaphore we didn't down_write() happens twice in
>> a row, and then the next time we try and i_mmap_lock_write(), we hang.
>>
>> Signed-off-by: Ross Zwisler 
>> Reported-by: Dave Chinner 
>> ---
>>  fs/dax.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 7ae6df7..df1b0ac 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct 
>> vm_fault *vmf,
>>  if (error)
>>  goto unlock;
>>  } else {
>> -i_mmap_unlock_write(mapping);
>> +if (!page)
>> +i_mmap_unlock_write(mapping);
>>  return dax_load_hole(mapping, page, vmf);
>>  }
>>  }
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
>   page = find_get_page(mapping, vmf->pgoff)
>   if (page) {
>   
>   } else {
>   i_mmap_lock_write(mapping);
>   }
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
>   if (vmf->cow_page) {
>   .
>   if (!page) {
>   /* can fall through */
>   }
>   return VM_FAULT_LOCKED;
>   }
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
>   /* Check we didn't race with a read fault installing a new page */
> if (!page && major)
> page = find_lock_page(mapping, vmf->pgoff);
> 
>   if (page) {
>   /* mapping invalidation  */
>   }
>   .
> 
>   if (!page)
>   i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is
> 

Hi hi

I hate this VM_FAULT_LOCKED + !page which means i_mmap_lock. I still
think it solves nothing and that we've done a really really bad job.

If we *easily* involve the FS in the locking here (Which btw I think XFS
already does), then this all i_mmap_lock can be avoided.

Please remind me again what race it is suppose to avoid? I get confused.

> Cheers,
> Dave.
> 

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] dax: fix NULL pointer in __dax_pmd_fault()

2015-09-24 Thread Boaz Harrosh
On 09/23/2015 12:04 PM, Dave Chinner wrote:
> On Tue, Sep 22, 2015 at 08:00:29PM -0700, Dan Williams wrote:
<>
>> The kaddr is coming from the devm_memremap() in the pmem driver that
>> gets unmapped after the device is released by the driver.
> 
> Perhaps the better solution is to not tear down the block device
> until all active references have gone away? i.e. unbind puts the
> device into a persistent error state and forces all active mappings
> to refault. Hence all future accesses error out and then when the
> user unmounts the unhappy filesystem the last reference to the
> blockdev goes away and the mappings can be torn down safely...
> 

Me too

> Cheers,
> 
> Dave.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 07:19 PM, Dave Hansen wrote:
> On 09/02/2015 09:00 AM, Boaz Harrosh wrote:
>>>> We are going to have 2-socket systems with 6TB of persistent memory in
>>>> them.  I think it's important to design this mechanism so that it scales
>>>> to memory sizes like that and supports large mmap()s.
>>>>
>>>> I'm not sure the application you've seen thus far are very
>>>> representative of what we want to design for.
>>>>
>> We have a patch pending to introduce a new mmap flag that pmem aware
>> applications can set to eliminate any kind of flushing. MMAP_PMEM_AWARE.
> 
> Great!  Do you have a link so that I can review it and compare it to
> Ross's approach?
> 

Ha? I have not seen a new mmap flag from Ross, yet I have been off lately
so it is logical that I might have missed it.

Could you send me a link?

(BTW my patch I did not release yet, I'll cc you once its done)

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 10:04 PM, Ross Zwisler wrote:
> On Tue, Sep 01, 2015 at 03:18:41PM +0300, Boaz Harrosh wrote:
<>
>> Apps expect all these to work:
>> 1. open mmap m-write msync ... close
>> 2. open mmap m-write fsync ... close
>> 3. open mmap m-write unmap ... fsync close
>>
>> 4. open mmap m-write sync ...
> 
> So basically you made close have an implicit fsync?  What about the flow that
> looks like this:
> 
> 5. open mmap close m-write
> 

What? no, close means ummap because you need a file* attached to your vma

And you miss-understood me, the vm_opts->close is the *unmap* operation not
the file::close() operation.

I meant memory-cl_flush on unmap before the vma goes away.

> This guy definitely needs an msync/fsync at the end to make sure that the
> m-write becomes durable.  
> 

Exactly done at unmap time.

> Also, the CLOSE(2) man page specifically says that a flush does not occur at
> close:
>   A successful close does not guarantee that the data has been
>   successfully  saved  to  disk,  as  the  kernel defers  writes.   It
>   is not common for a filesystem to flush the buffers when the stream is
>   closed.  If you need to be sure that the data is physically stored,
>   use fsync(2).  (It will depend on the disk  hardware  at this point.)
> 
> I don't think that adding an implicit fsync to close is the right solution -
> we just need to get msync and fsync correctly working.
> 

So above is not relevant, and we are doing that. taking care of cpu-cache 
flushing.
This is not disk-flushing, on a long memcpy from usermode most of the data is
already durable, is only the leftover margins. Like the dax_io in the kernel
dax implies direct_io always, all we are trying is to have the least
performance hit in memory-cache-flushing.

IS nothing to do with the text above.

>> The first 3 are supported with above, because what happens is that at [3]
>> the fsync actually happens on unmap and fsync is redundant in that case.
>>
>> The only broken scenario is [3]. We do not have a list of "dax-dirty" inodes
>> per sb to iterate on and call inode-sync on. This cause problems mostly in
>> freeze because with actual [3] scenario the file will be eventually closed
>> and persistent, but after the call to sync returns.
>>
>> Its on my TODO to fix [3] based on instructions from Dave.
>> The mmap call will put the inode on the list and the dax_vm_close will
>> remove it. One of the regular dirty list should be used as suggested by
>> Dave.
> 
> I believe in the above two paragraphs you meant [4], so the 
> 
> 4. open mmap m-write sync ...
> 
> case needs to be fixed so that we can detect DAX-dirty inodes?
> 

Yes I'll be working on sync (DAX-dirty-i_list) soon but it needs a working
fsync to be in place (eg: dax_fsync(inode))

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 06:39 PM, Dave Hansen wrote:
> On 09/02/2015 08:18 AM, Boaz Harrosh wrote:
>> On 09/02/2015 05:23 PM, Dave Hansen wrote:
>>>> I'd be curious what the cost is in practice.  Do you have any actual
>>>> numbers of the cost of doing it this way?
>>>>
>>>> Even if the instruction is a "noop", I'd really expect the overhead to
>>>> really add up for a tens-of-gigabytes mapping, no matter how much the
>>>> CPU optimizes it.
>> What tens-of-gigabytes mapping? I have yet to encounter an application
>> that does that. Our tests show that usually the mmaps are small.
> 
> We are going to have 2-socket systems with 6TB of persistent memory in
> them.  I think it's important to design this mechanism so that it scales
> to memory sizes like that and supports large mmap()s.
> 
> I'm not sure the application you've seen thus far are very
> representative of what we want to design for.
> 

We have a patch pending to introduce a new mmap flag that pmem aware
applications can set to eliminate any kind of flushing. MMAP_PMEM_AWARE.

This is good for the like of libnvdimm that does one large mmap of the
all 6T and does not want the clflush penalty on unmap.

>> I can send you a micro benchmark results of an mmap vs direct-io random
>> write. Our code will jump over holes in the file BTW, but I'll ask to also
>> run it with falloc that will make all blocks allocated.
> 
> I'm really just more curious about actual clflush performance on large
> ranges.  I'm curious how good the CPU is at optimizing it.
> 

Again our test does not do this, because it will only flush written-extents
of the file. the most we have in one machine is 64G of pmem, so even on a
very large mmap the most that can be is 64G of data, and the actual modify
of 64G of data will be much slower then the added clflush to each cache_line.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 05:23 PM, Dave Hansen wrote:
<>
> I'd be curious what the cost is in practice.  Do you have any actual
> numbers of the cost of doing it this way?
> 
> Even if the instruction is a "noop", I'd really expect the overhead to
> really add up for a tens-of-gigabytes mapping, no matter how much the
> CPU optimizes it.

What tens-of-gigabytes mapping? I have yet to encounter an application
that does that. Our tests show that usually the mmaps are small.

I can send you a micro benchmark results of an mmap vs direct-io random
write. Our code will jump over holes in the file BTW, but I'll ask to also
run it with falloc that will make all blocks allocated.

Give me a few days to collect this.

I guess one optimization we should do is jump over holes and zero-extents.
This will save the case of a mostly sparse very big file.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 12:47 PM, Kirill A. Shutemov wrote:
<>
> 
> I don't insist on applying the patch. And I worry about false-positives.
> 

Thanks, yes
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 08:17 AM, Dave Chinner wrote:
> On Tue, Sep 01, 2015 at 09:19:45PM -0600, Ross Zwisler wrote:
>> On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote:
>>> Which means applications that should "just work" without
>>> modification on DAX are now subtly broken and don't actually
>>> guarantee data is safe after a crash. That's a pretty nasty
>>> landmine, and goes against *everything* we've claimed about using
>>> DAX with existing applications.
>>>
>>> That's wrong, and needs fixing.
>>
>> I agree that we need to fix fsync as well, and that the fsync solution could
>> be used to implement msync if we choose to go that route.  I think we might
>> want to consider keeping the msync and fsync implementations separate, 
>> though,
>> for two reasons.
>>
>> 1) The current msync implementation is much more efficient than what will be
>> needed for fsync.  Fsync will need to call into the filesystem, traverse all
>> the blocks, get kernel virtual addresses from those and then call
>> wb_cache_pmem() on those kernel addresses.  I think this is a necessary evil
>> for fsync since you don't have a VMA, but for msync we do and we can just
>> flush using the user addresses without any fs lookups.
> 
> Yet you're ignoring the fact that flushing the entire range of the
> relevant VMAs may not be very efficient. It may be a very
> large mapping with only a few pages that need flushing from the
> cache, but you still iterate the mappings flushing GB ranges from
> the cache at a time.
> 

So actually you are wrong about this. We have a working system and as part
of our testing rig we do performance measurements, constantly. Our random
mmap 4k writes test preforms very well and is in par with the 
random-direct-write
implementation even though on every unmap, we do a VMA->start/end cl_flushing.

The cl_flush operation is a no-op if the cacheline is not dirty and is a
memory bus storm with all the CLs that are dirty. So the only cost
is the iteration of vma->start-to-vma->end i+=64

Let us please agree that we should do the correct thing for now, and let
the complains roll in about the slowness later. You will find that my
proposed solution is not so slow.

> We don't need struct pages to track page dirty state. We already
> have a method for doing this that does not rely on having a struct
> page and can be used for efficient lookup of exact dirty ranges. i.e
> the per-page dirty tag that is kept in the mapping radix tree. It's
> fine grained, and extremely efficient in terms of lookups to find
> dirty pages.
> 

In fact you will find that this solution is actually slower. Because
you need an extra lock on every major-page-fault and you need to
maintain the radix-tree populated. Today with dax the radix-tree is
completely empty, it is only ever used if one reads in holes. But we
found that this is not common at all. Usually mmap applications read
what is really there.

So the extra work per page will be more than actually doing the fast
no-op cl_flush.

> Indeed, the mapping tree tags were specifically designed to avoid
> this "fsync doesn't know what range to flush" problem for normal
> files. That same problem still exists here for msync - these patches
> are just hitting it with a goddamn massive hammer "because it is
> easy" rather than attempting to do the flushing efficiently.
> 

You come from the disk world and every extra synced block is a huge waist.
This is memory, is not what you are used too.

Again we have benchmarks and mmap works very very well. Including that
contraption of cl_flushing vma->start..vma->end

I'll try and open up some time to send a rough draft. My boss will kill me,
but he'll survive.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 06:19 AM, Ross Zwisler wrote:
> On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote:
>> Which means applications that should "just work" without
>> modification on DAX are now subtly broken and don't actually
>> guarantee data is safe after a crash. That's a pretty nasty
>> landmine, and goes against *everything* we've claimed about using
>> DAX with existing applications.
>>
>> That's wrong, and needs fixing.
> 
> I agree that we need to fix fsync as well, and that the fsync solution could
> be used to implement msync if we choose to go that route.  I think we might
> want to consider keeping the msync and fsync implementations separate, though,
> for two reasons.
> 
> 1) The current msync implementation is much more efficient than what will be
> needed for fsync.  Fsync will need to call into the filesystem, traverse all
> the blocks, get kernel virtual addresses from those and then call
> wb_cache_pmem() on those kernel addresses.  

I was thinking about this some more, and no this is not what we need to do
because of the virtual-based-cache ARCHs. And what we do for these systems
will also work for physical-based-cache ARCHs.

What we need to do, is dig into the mapping structure and pic up the current
VMA on the call to fsync. Then just flush that one on that virtual address,
(since it is current at the context of the fsync sys call)

And of course we need to do like I wrote, we must call fsync on 
vm_operations->close
before the VMA mappings goes away. Then an fsync after unmap is a no-op.

> I think this is a necessary evil
> for fsync since you don't have a VMA, but for msync we do and we can just
> flush using the user addresses without any fs lookups.
> 

right see above

> 2) I believe that the near-term fsync code will rely on struct pages for
> PMEM, which I believe are possible but optional as of Dan's last patch set:
> 
> https://lkml.org/lkml/2015/8/25/841
> 
> I believe that this means that if we don't have struct pages for PMEM (becuase
> ZONE_DEVICE et al. are turned off) fsync won't work.  I'd be nice not to lose
> msync as well.

Please see above it can be made to work. Actually what we do is the
traversal-kernel-ptr thing, and the fsync-on-unmap. And it works we have heavy
persistence testing and it is all very good.

So no, without pages it can all work very-well. There is only the sync problem
that I intend to fix soon, is only a matter of keeping a dax-dirty inode-list
per sb.

So no this is not an excuse.

Cheers
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 12:37 PM, Boaz Harrosh wrote:
>>  
>> +   /*
>> +* Make sure that for VM_MIXEDMAP VMA has both
>> +* vm_ops->page_mkwrite and vm_ops->pfn_mkwrite or has none.
>> +*/
>> +   if ((vma->vm_ops->page_mkwrite || vma->vm_ops->pfn_mkwrite) 
>> &&
>> +   vma->vm_flags & VM_MIXEDMAP) {
>> +   VM_BUG_ON_VMA(!vma->vm_ops->page_mkwrite, vma);
>> +   VM_BUG_ON_VMA(!vma->vm_ops->pfn_mkwrite, vma);
> 
> BTW: the page_mkwrite is used for reading of holes that put zero-pages at the 
> radix tree.
>  One can just map a single global zero-page in pfn-mode for that.
> 
> Kirill Hi. Please don't make these BUG_ONs its counter productive believe me.
> Please make them WARN_ON_ONCE() it is not a crashing bug to work like this.
> (Actually it is not a bug at all in some cases, but we can relax that when a 
> user
>  comes up)
> 
> Thanks
> Boaz
> 

Second thought I do not like this patch. This is why we have xftests for, the 
fact of it
is that test 080 catches this. For me this is enough.

An FS developer should test his code, and worst case we help him on ML, like we 
did
in this case.

Thanks
Boaz

>> +   }
>> addr = vma->vm_start;
>> vm_flags = vma->vm_flags;
>> } else if (vm_flags & VM_SHARED) {
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-02 Thread Boaz Harrosh
On 09/02/2015 12:13 PM, Kirill A. Shutemov wrote:
> On Wed, Sep 02, 2015 at 08:49:22AM +1000, Dave Chinner wrote:
>> On Tue, Sep 01, 2015 at 01:08:04PM +0300, Kirill A. Shutemov wrote:
>>> On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
 On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
 Even for DAX, msync has to call vfs_fsync_range() for the filesystem to 
 commit
 the backing store allocations to stable storage, so there's not
 getting around the fact msync is the wrong place to be flushing
 DAX mappings to persistent storage.
>>>
>>> Why?
>>> IIUC, msync() doesn't have any requirements wrt metadata, right?
>>
>> Of course it does. If the backing store allocation has not been
>> committed, then after a crash there will be a hole in file and
>> so it will read as zeroes regardless of what data was written and
>> flushed.
> 
> Any reason why backing store allocation cannot be committed on *_mkwrite?
> 
 I pointed this out almost 6 months ago (i.e. that fsync was broken)
 anf hinted at how to solve it. Fix fsync, and msync gets fixed for
 free:

 https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html

 I've also reported to Willy that DAX write page faults don't work
 correctly, either. xfstests generic/080 exposes this: a read
 from a page followed immediately by a write to that page does not
 result in ->page_mkwrite being called on the write and so
 backing store is not allocated for the page, nor are the timestamps
 for the file updated. This will also result in fsync (and msync)
 not working properly.
>>>
>>> Is that because XFS doesn't provide vm_ops->pfn_mkwrite?
>>
>> I didn't know that had been committed. I don't recall seeing a pull
>> request with that in it
> 
> It went though -mm tree.
> 
>> none of the XFS DAX patches conflicted
>> against it and there's been no runtime errors. I'll fix it up.
>>
>> As such, shouldn't there be a check in the VM (in ->mmap callers)
>> that if we have the vma is returned with VM_MIXEDMODE enabled that
>> ->pfn_mkwrite is not NULL?  It's now clear to me that any filesystem
>> that sets VM_MIXEDMODE needs to support both page_mkwrite and
>> pfn_mkwrite, and such a check would have caught this immediately...
> 
> I guess it's "both or none" case. We have VM_MIXEDMAP users who don't care
> about *_mkwrite.
> 
> I'm not yet sure it would be always correct, but something like this will
> catch the XFS case, without false-positive on other stuff in my KVM setup:
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3f78bceefe5a..f2e29a541e14 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1645,6 +1645,15 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
> vma->vm_ops = &dummy_ops;
> }
>  
> +   /*
> +* Make sure that for VM_MIXEDMAP VMA has both
> +* vm_ops->page_mkwrite and vm_ops->pfn_mkwrite or has none.
> +*/
> +   if ((vma->vm_ops->page_mkwrite || vma->vm_ops->pfn_mkwrite) &&
> +   vma->vm_flags & VM_MIXEDMAP) {
> +   VM_BUG_ON_VMA(!vma->vm_ops->page_mkwrite, vma);
> +   VM_BUG_ON_VMA(!vma->vm_ops->pfn_mkwrite, vma);

BTW: the page_mkwrite is used for reading of holes that put zero-pages at the 
radix tree.
 One can just map a single global zero-page in pfn-mode for that.

Kirill Hi. Please don't make these BUG_ONs its counter productive believe me.
Please make them WARN_ON_ONCE() it is not a crashing bug to work like this.
(Actually it is not a bug at all in some cases, but we can relax that when a 
user
 comes up)

Thanks
Boaz

> +   }
> addr = vma->vm_start;
> vm_flags = vma->vm_flags;
> } else if (vm_flags & VM_SHARED) {
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-01 Thread Boaz Harrosh
On 08/31/2015 09:59 PM, Ross Zwisler wrote:
> For DAX msync we just need to flush the given range using
> wb_cache_pmem(), which is now a public part of the PMEM API.
> 
> The inclusion of  in fs/dax.c was done to make checkpatch
> happy.  Previously it was complaining about a bunch of undeclared
> functions that could be made static.
> 
> Signed-off-by: Ross Zwisler 
> ---
> This patch is based on libnvdimm-for-next from our NVDIMM tree:
> 
> https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git/
> 
> with some DAX patches on top.  The baseline tree can be found here:
> 
> https://github.com/01org/prd/tree/dax_msync
> ---
>  arch/x86/include/asm/pmem.h | 13 +++--
>  fs/dax.c| 17 +
>  include/linux/dax.h |  1 +
>  include/linux/pmem.h| 22 +-
>  mm/msync.c  | 10 +-
>  5 files changed, 55 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index d8ce3ec..85c07b2 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
>  }
>  
>  /**
> - * __arch_wb_cache_pmem - write back a cache range with CLWB
> - * @vaddr:   virtual start address
> + * arch_wb_cache_pmem - write back a cache range with CLWB
> + * @addr:virtual start address
>   * @size:number of bytes to write back
>   *
>   * Write back a cache range using the CLWB (cache line write back)
>   * instruction.  This function requires explicit ordering with an
> - * arch_wmb_pmem() call.  This API is internal to the x86 PMEM 
> implementation.
> + * arch_wmb_pmem() call.
>   */
> -static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
> +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
>  {
>   u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
>   unsigned long clflush_mask = x86_clflush_size - 1;
> + void *vaddr = (void __force *)addr;
>   void *vend = vaddr + size;
>   void *p;
>  
> @@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem 
> *addr, size_t bytes,
>   len = copy_from_iter_nocache(vaddr, bytes, i);
>  
>   if (__iter_needs_pmem_wb(i))
> - __arch_wb_cache_pmem(vaddr, bytes);
> + arch_wb_cache_pmem(addr, bytes);
>  
>   return len;
>  }
> @@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, 
> size_t size)
>   else
>   memset(vaddr, 0, size);
>  
> - __arch_wb_cache_pmem(vaddr, size);
> + arch_wb_cache_pmem(addr, size);
>  }
>  
>  static inline bool __arch_has_wmb_pmem(void)
> diff --git a/fs/dax.c b/fs/dax.c
> index fbe18b8..ed6aec1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -25,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -753,3 +755,18 @@ int dax_truncate_page(struct inode *inode, loff_t from, 
> get_block_t get_block)
>   return dax_zero_page_range(inode, from, length, get_block);
>  }
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
> +
> +void dax_sync_range(unsigned long addr, size_t len)
> +{
> + while (len) {
> + size_t chunk_len = min_t(size_t, SZ_1G, len);
> +

Where does the  SZ_1G come from is it because you want to do cond_resched()
every 1G bytes so not to get stuck for a long time?

It took me a while to catch, At first I thought it might be do to 
wb_cache_pmem()
limitations. Would you put a comment in the next iteration?

> + wb_cache_pmem((void __pmem *)addr, chunk_len);
> + len -= chunk_len;
> + addr += chunk_len;
> + if (len)
> + cond_resched();
> + }
> + wmb_pmem();
> +}
> +EXPORT_SYMBOL_GPL(dax_sync_range);
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b415e52..504b33f 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -14,6 +14,7 @@ int dax_fault(struct vm_area_struct *, struct vm_fault *, 
> get_block_t,
>   dax_iodone_t);
>  int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
>   dax_iodone_t);
> +void dax_sync_range(unsigned long addr, size_t len);
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
>   unsigned int flags, get_block_t, dax_iodone_t);
> diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> index 85f810b3..aa29ebb 100644
> --- a/include/linux/pmem.h
> +++ b/include/linux/pmem.h
> @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, 
> size_t size)
>  {
>   BUG();

See below

>  }
> +
> +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
> +{
> + BUG();

There is a clflush_cache_range() defined for generic use. On ADR sys

Re: [PATCH] dax, pmem: add support for msync

2015-09-01 Thread Boaz Harrosh
On 09/01/2015 10:06 AM, Christoph Hellwig wrote:
> On Tue, Sep 01, 2015 at 09:38:03AM +1000, Dave Chinner wrote:
>> On Mon, Aug 31, 2015 at 12:59:44PM -0600, Ross Zwisler wrote:
>>> For DAX msync we just need to flush the given range using
>>> wb_cache_pmem(), which is now a public part of the PMEM API.
>>
>> This is wrong, because it still leaves fsync() broken on dax.
>>
>> Flushing dirty data to stable storage is the responsibility of the
>> writeback infrastructure, not the VMA/mm infrasrtucture. For non-dax
>> configurations, msync defers all that to vfs_fsync_range(), because
>> it has to be implemented there for fsync() to work.
>>
>> Even for DAX, msync has to call vfs_fsync_range() for the filesystem to 
>> commit
>> the backing store allocations to stable storage, so there's not
>> getting around the fact msync is the wrong place to be flushing
>> DAX mappings to persistent storage.
> 
> DAX does call ->fsync before and after this patch.  And with all
> the recent fixes we take care to ensure data is written though the
> cache for everything but mmap-access.  With this patch from Ross
> we ensure msync writes back the cache before calling ->fsync so that
> the filesystem can then do it's work like converting unwritten extents.
> 
> The only downside is that previously on Linux you could always use
> fsync as a replaement for msymc, which isn't true anymore for DAX.
> 

Hi Christoph

So the approach we took was a bit different to exactly solve these
problem, and to also not over flush too much. here is what we did.

* At vm_operations_struct we also override the .close vector (say call it 
dax_vm_close)

* At dax_vm_close() on writable files call ->fsync(,vma->vm_start, vma->vm_end,)
  (We have an inode flag if the file was actually dirtied, but even if not, 
that will
   not be that bad, so a file was opened for write, mmapped, but actually never
   modified. Not a lot of these, and the do nothing cl_flushing is very fast)

* At ->fsync() do the actual cl_flush for all cases but only iff
if (mapping_mapped(inode->i_mapping) == 0)
return 0;

  This is because data written not through mmap is already persistent and we
  do not need the cl_flushing

Apps expect all these to work:
1. open mmap m-write msync ... close
2. open mmap m-write fsync ... close
3. open mmap m-write unmap ... fsync close

4. open mmap m-write sync ...

The first 3 are supported with above, because what happens is that at [3]
the fsync actually happens on unmap and fsync is redundant in that case.

The only broken scenario is [3]. We do not have a list of "dax-dirty" inodes
per sb to iterate on and call inode-sync on. This cause problems mostly in
freeze because with actual [3] scenario the file will be eventually closed
and persistent, but after the call to sync returns.

Its on my TODO to fix [3] based on instructions from Dave.
The mmap call will put the inode on the list and the dax_vm_close will
remove it. One of the regular dirty list should be used as suggested by
Dave.

> But given that we need the virtual address to write back the cache
> I can't see how to do this differently given that clwb() needs the
> user virtual address to flush the cache.

On Intel or any systems that have physical-based caching this is not
a problem you just iterate on all get_block() of the range and flush
the Kernel's virt_addr of the block, this is easy.

With ARCHs with per VM caching you need to go through the i_mapping VMAs list
and flush like that. I guess there is a way to schedule yourself as a process 
VMA
somehow.
I'm not sure how to solve this split, perhaps two generic functions, that
are selected through the ARCH.

Just my $0.017
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] dax, pmem: add support for msync

2015-09-01 Thread Boaz Harrosh
On 09/01/2015 01:08 PM, Kirill A. Shutemov wrote:
<>
> 
> Is that because XFS doesn't provide vm_ops->pfn_mkwrite?
> 

Right that would explain it, because I sent that patch exactly to solve
this problem. I haven't looked at latest code for a while but I should
checkout the latest and make a patch for xfs if it is indeed missing.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, dax: VMA with vm_ops->pfn_mkwrite wants to be write-notified

2015-09-01 Thread Boaz Harrosh
On 09/01/2015 02:10 PM, Kirill A. Shutemov wrote:
> On Tue, Sep 01, 2015 at 01:54:26PM +0300, Boaz Harrosh wrote:
>> On 09/01/2015 01:22 PM, Kirill A. Shutemov wrote:
>>> For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
>>> vm_ops->page_mkwrite to notify abort write access. This means we want
>>> vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.
>>>
>>
>> Hi Kirill
>>
>> I will test with this right away and ACK on this.
>>
>> Hmm so are you saying we might be missing some buffer modifications right 
>> now.
>>
>> What would be a theoretical scenario that will cause these missed events?
> 
> On writable mapping with vm_ops->pfn_mkwrite, but without
> vm_ops->page_mkwrite: read fault followed by write access to the pfn.
> Writable pte will be set up on read fault and write fault will not be
> generated.
> 
> I found it examining Dave's complain on generic/080:
> 
> http://lkml.kernel.org/g/20150831233803.GO3902@dastard
> 
> Although I don't think it's the reason.
> 
>> I would like to put a test in our test rigs that should fail today and this
>> patch fixes.
>>
>> [In our system every modified pmem block is also RDMAed to a remote
>>  pmem for HA, a missed modification will make the two copies unsynced]
> 
> It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite
> and page_mkwrite.
> 

Ha right we have both as well, and so should xfs I think (because of the
zero pages thing, in fact any dax.c user should).

Thanks so this verifies why we could not see any such breakage.

ACK-by: Boaz Harrosh 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, dax: VMA with vm_ops->pfn_mkwrite wants to be write-notified

2015-09-01 Thread Boaz Harrosh
On 09/01/2015 01:22 PM, Kirill A. Shutemov wrote:
> For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
> vm_ops->page_mkwrite to notify abort write access. This means we want
> vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.
> 

Hi Kirill

I will test with this right away and ACK on this.

Hmm so are you saying we might be missing some buffer modifications right now.

What would be a theoretical scenario that will cause these missed events?
I would like to put a test in our test rigs that should fail today and this
patch fixes.

[In our system every modified pmem block is also RDMAed to a remote
 pmem for HA, a missed modification will make the two copies unsynced]

Thanks for catching this
Boaz

> Signed-off-by: Kirill A. Shutemov 
> Cc: Yigal Korman 
> Cc: Boaz Harrosh 
> Cc: Matthew Wilcox 
> Cc: Jan Kara 
> Cc: Dave Chinner 
> ---
>  mm/mmap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index df6d5f07035b..3f78bceefe5a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1498,7 +1498,8 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
>   return 0;
>  
>   /* The backer wishes to know when pages are first written to? */
> - if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> + if (vma->vm_ops &&
> + (vma->vm_ops->page_mkwrite || vma->vm_ops->pfn_mkwrite))
>   return 1;
>  
>   /* The open routine did something to the protections that pgprot_modify
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   >