Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation

2020-11-24 Thread h...@infradead.org
On Mon, Nov 23, 2020 at 08:44:12PM +, Edgecombe, Rick P wrote:
> Well, there were two reasons:
> 1. Non-standard naming for the PAGE_FOO flags. For example,
> PAGE_KERNEL_ROX vs PAGE_KERNEL_READ_EXEC. This could be unified. I
> think it's just riscv that breaks the conventions. Others are just
> missing some.

We need to standardize those anyway.  I've done that for a few
PAGE_* constants already but as you see there is more work to do.

> But I thought that using those pgprot flags was still sort overloading
> the meaning of pgprot. My understanding was that it is supposed to hold
> the actual bits set in the PTE. For example large pages or TLB hints
> (like PAGE_KERNEL_EXEC_CONT) could set or unset extra bits, so asking
> for PAGE_KERNEL_EXEC wouldn't necessarily mean "set these bits in all
> of the PTEs", it could mean something more like "infer what I want from
> these bits and do that".
> 
> x86's cpa will also avoid changing NX if it is not supported, so if the
> caller asked for PAGE_KERNEL->PAGE_KERNEL_EXEC in perm_change() it
> should not necessarily bother setting all of the PAGE_KERNEL_EXEC bits
> in the actual PTEs. Asking for PERM_RW->PERM_RWX on the other hand,
> would let the implementation do whatever it needs to set the memory
> executable, like set_memory_x() does. It should work either way but
> seems like the expectations would be a little clearer with the PERM_
> flags.

Ok, maybe that is an argument, and we should use the new flags more
broadly.

> Could easily wrap this one, but just to clarify, do you mean lines over
> 80 chars? There were already some over 80 in vmalloc before the move to
> 100 chars, so figured it was ok to stretch out now.

CodingStyle still says 80 characters unless you have an exception where
a longer line improves the readability.  The quoted code absolutely
does not fit the definition of an exception or improves readability.


Re: [RFC PATCH 1/9] cxl/acpi: Add an acpi_cxl module for the CXL interconnect

2020-11-10 Thread h...@infradead.org
On Wed, Nov 11, 2020 at 07:30:34AM +, Verma, Vishal L wrote:
> Hi Christpph,
> 
> I thought 100 col. lines were acceptable now.

Quote from the coding style document:

"The preferred limit on the length of a single line is 80 columns.

Statements longer than 80 columns should be broken into sensible chunks,
unless exceeding 80 columns significantly increases readability and does
not hide information."

So yes, they are acceptable as an expception.  Not for crap like this.


Re: [dm-devel] [PATCH 0/2] block layer filter and block device snapshot module

2020-10-23 Thread h...@infradead.org
On Fri, Oct 23, 2020 at 12:31:05PM +0200, Hannes Reinecke wrote:
> My thoughts went more into the direction of hooking into ->submit_bio,
> seeing that it's a NULL pointer for most (all?) block drivers.
> 
> But sure, I'll check how the interposer approach would turn out.

submit_bio is owned by the underlying device, and for a good reason
stored in a const struct..


Re: [PATCH 0/2] block layer filter and block device snapshot module

2020-10-23 Thread h...@infradead.org
On Thu, Oct 22, 2020 at 01:54:16PM -0400, Mike Snitzer wrote:
> On Thu, Oct 22, 2020 at 11:14 AM Darrick J. Wong
> > Stupid question: Why don't you change the block layer to make it
> > possible to insert device mapper devices after the blockdev has been set
> > up?
> 
> Not a stupid question.  Definitely something that us DM developers
> have wanted to do for a while.  Devil is in the details but it is the
> right way forward.
> 

Yes, I think that is the right thing to do.  And I don't think it should
be all that hard.  All we'd need in the I/O path is something like the
pseudo-patch below, which will allow the interposer driver to resubmit
bios using submit_bio_noacct as long as the driver sets BIO_INTERPOSED.

diff --git a/block/blk-core.c b/block/blk-core.c
index ac00d2fa4eb48d..3f6f1eb565e0a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1051,6 +1051,9 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
return BLK_QC_T_NONE;
}
 
+   if (blk_has_interposer(bio->bi_disk) &&
+   !(bio->bi_flags & BIO_INTERPOSED))
+   return __submit_bio_interposed(bio);
if (!bio->bi_disk->fops->submit_bio)
return __submit_bio_noacct_mq(bio);
return __submit_bio_noacct(bio);


Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-09-08 Thread h...@infradead.org
On Mon, Sep 07, 2020 at 12:31:42PM +0530, Kanchan Joshi wrote:
> But there are use-cases which benefit from supporting zone-append on
> raw block-dev path.
> Certain user-space log-structured/cow FS/DB will use the device that
> way. Aerospike is one example.
> Pass-through is synchronous, and we lose the ability to use io-uring.

So use zonefs, which is designed exactly for that use case.


Re: [PATCH] PCI/ASPM: Enable ASPM for links under VMD domain

2020-08-29 Thread h...@infradead.org
On Thu, Aug 27, 2020 at 05:49:40PM +, Limonciello, Mario wrote:
> Can you further elaborate what exactly you're wanting here?  VMD 
> enable/disable
> is something that is configured in firmware setup as the firmware does the 
> early
> configuration for the silicon related to it.  So it's up to the OEM whether to
> offer the knob to an end user.
> 
> At least for Dell this setting also does export to sysfs and can be turned 
> on/off
> around a reboot cycle via this: https://patchwork.kernel.org/patch/11693231/.
> 
> As was mentioned earlier in this thread VMD is likely to be defaulting to "on"
> for many machines with the upcoming silicon.  Making it work well on Linux is
> preferable to again having to change firmware settings between operating 
> systems
> like the NVME remapping thing from earlier silicon required.

And the right answer is to turn it off, but we really need to do that
at runtime, and not over a reboot cycle..


Re: [PATCH] PCI/ASPM: Enable ASPM for links under VMD domain

2020-08-29 Thread h...@infradead.org
On Thu, Aug 27, 2020 at 02:33:56PM -0700, Dan Williams wrote:
> > Just a few benefits and there are other users with unique use cases:
> > 1. Passthrough of the endpoint to OSes which don't natively support
> > hotplug can enable hotplug for that OS using the guest VMD driver
> > 2. Some hypervisors have a limit on the number of devices that can be
> > passed through. VMD endpoint is a single device that expands to many.
> > 3. Expansion of possible bus numbers beyond 256 by using other
> > segments.
> > 4. Custom RAID LED patterns driven by ledctl
> >
> > I'm not trying to market this. Just pointing out that this isn't
> > "bringing zero actual benefits" to many users.
> >
> 
> The initial intent of the VMD driver was to allow Linux to find and
> initialize devices behind a VMD configuration where VMD was required
> for a non-Linux OS. For Linux, if full native PCI-E is an available
> configuration option I think it makes sense to recommend Linux users
> to flip that knob rather than continue to wrestle with the caveats of
> the VMD driver. Where that knob isn't possible / available VMD can be
> a fallback, but full native PCI-E is what Linux wants in the end.

Agreed.  And the thing we need for this to really work is to turn VMD
off without a reboot when we find it.  That would solve all the problems
we have, and also allow an amazing kernel hacker like Derrick do more
productive work than coming up with one VMD workaround after another.


Re: [PATCH] PCI/ASPM: Enable ASPM for links under VMD domain

2020-08-27 Thread h...@infradead.org
On Thu, Aug 27, 2020 at 04:45:53PM +, Derrick, Jonathan wrote:
> Just a few benefits and there are other users with unique use cases:
> 1. Passthrough of the endpoint to OSes which don't natively support
> hotplug can enable hotplug for that OS using the guest VMD driver

Or they could just write a hotplug driver, which would be more useful
than writing a hotplug driver.

> 2. Some hypervisors have a limit on the number of devices that can be
> passed through. VMD endpoint is a single device that expands to many.

Or you just fix the hypervisor.  Never mind that this is so much
less likely than wanting to pass an individual device or VF to a guest,
which VMD makes impossible (at least without tons of hacks specifically
for it).

> 3. Expansion of possible bus numbers beyond 256 by using other
> segments.

Which we can trivially to with PCI domains.

> 4. Custom RAID LED patterns driven by ledctl

Which you can also do by any other vendor specific way.

> 
> I'm not trying to market this. Just pointing out that this isn't
> "bringing zero actual benefits" to many users.

Which of those are a benefit to a Linux user?  Serious, I really don't
care if Intel wants to sell VMD as a value add to those that have
a perceived or in rare cases even real need.  Just let Linux opt out
of it instead of needing special quirks all over.


Re: [PATCH] PCI/ASPM: Enable ASPM for links under VMD domain

2020-08-27 Thread h...@infradead.org
On Thu, Aug 27, 2020 at 04:13:44PM +, Derrick, Jonathan wrote:
> On Thu, 2020-08-27 at 06:34 +0000, h...@infradead.org wrote:
> > On Wed, Aug 26, 2020 at 09:43:27PM +, Derrick, Jonathan wrote:
> > > Feel free to review my set to disable the MSI remapping which will
> > > make
> > > it perform as well as direct-attached:
> > > 
> > > https://patchwork.kernel.org/project/linux-pci/list/?series=325681
> > 
> > So that then we have to deal with your schemes to make individual
> > device direct assignment work in a convoluted way?
> 
> That's not the intent of that patchset -at all-. It was to address the
> performance bottlenecks with VMD that you constantly complain about. 

I know.  But once we fix that bottleneck we fix the next issue,
then to tackle the next.  While at the same time VMD brings zero
actual benefits.

> > Please just give us
> > a disable nob for VMD, which solves _all_ these problems without
> > adding
> > any.
> 
> I don't see the purpose of this line of discussion. VMD has been in the
> kernel for 5 years. We are constantly working on better support.

Please just work with the platform people to allow the host to disable
VMD.  That is the only really useful value add here.


Re: [PATCH] PCI/ASPM: Enable ASPM for links under VMD domain

2020-08-27 Thread h...@infradead.org
On Wed, Aug 26, 2020 at 09:43:27PM +, Derrick, Jonathan wrote:
> Feel free to review my set to disable the MSI remapping which will make
> it perform as well as direct-attached:
> 
> https://patchwork.kernel.org/project/linux-pci/list/?series=325681

So that then we have to deal with your schemes to make individual
device direct assignment work in a convoluted way?  Please just give us
a disable nob for VMD, which solves _all_ these problems without adding
any.


Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-08-14 Thread h...@infradead.org
On Fri, Aug 14, 2020 at 08:27:13AM +, Damien Le Moal wrote:
> > 
> > O_APPEND pretty much implies out of order, as there is no way for an
> > application to know which thread wins the race to write the next chunk.
> 
> Yes and no. If the application threads do not synchronize their calls to
> io_submit(), then yes indeed, things can get out of order. But if the
> application threads are synchronized, then the offset set for each append AIO
> will be in sequence of submission, so the user will not see its writes
> completing at different write offsets than this implied offsets.

Nothing gurantees any kind of ordering for two separate io_submit calls.
The kernel may delay one of them for any reason.

Now if you are doing two fully synchronous write calls on an
O_APPEND fd, yes they are serialized.  But using Zone Append won't
change that.


Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-08-14 Thread h...@infradead.org
On Wed, Aug 05, 2020 at 07:35:28AM +, Damien Le Moal wrote:
> > the write pointer.  The only interesting addition is that we also want
> > to report where we wrote.  So I'd rather have RWF_REPORT_OFFSET or so.
> 
> That works for me. But that rules out having the same interface for raw block
> devices since O_APPEND has no meaning in that case. So for raw block devices, 
> it
> will have to be through zonefs. That works for me, and I think it was your 
> idea
> all along. Can you confirm please ?

Yes.  I don't think think raw syscall level access to the zone append
primitive makes sense.  Either use zonefs for a file-like API, or
use the NVMe pass through interface for 100% raw access.

> >  - take the exclusive per-inode (zone) lock and just issue either normal
> >writes or zone append at your choice, relying on the lock to
> >serialize other writers.  For the async case this means we need a
> >lock than can be release in a different context than it was acquired,
> >which is a little ugly but can be done.
> 
> Yes, that would be possible. But likely, this will also need calls to
> inode_dio_wait() to avoid ending up with a mix of regular write and zone 
> append
> writes in flight (which likely would result in the regular write failing as 
> the
> zone append writes would go straight to the device without waiting for the 
> zone
> write lock like regular writes do).

inode_dio_wait is a really bad implementation of almost a lock.  I've
started some work that I need to finish to just replace it with proper
non-owner rwsems (or even the range locks Dave has been looking into).

> 
> This all sound sensible to me. One last point though, specific to zonefs: if 
> the
> user opens a zone file with O_APPEND, I do want to have that necessarily mean
> "use zone append". And same for the "RWF_REPORT_OFFSET". The point here is 
> that
> both O_APPEND and RWF_REPORT_OFFSET can be used with both regular writes and
> zone append writes, but none of them actually clearly specify if the
> application/user tolerates writing data to disk in a different order than the
> issuing order... So another flag to indicate "atomic out-of-order writes" (==
> zone append) ?

O_APPEND pretty much implies out of order, as there is no way for an
application to know which thread wins the race to write the next chunk.


Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread h...@infradead.org
And FYI, this is what I'd do for a hacky aio-only prototype (untested):


diff --git a/fs/aio.c b/fs/aio.c
index 91e7cc4a9f179b..42b1934e38758b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1438,7 +1438,10 @@ static void aio_complete_rw(struct kiocb *kiocb, long 
res, long res2)
}
 
iocb->ki_res.res = res;
-   iocb->ki_res.res2 = res2;
+   if ((kiocb->ki_flags & IOCB_REPORT_OFFSET) && res > 0)
+   iocb->ki_res.res2 = kiocb->ki_pos - res;
+   else
+   iocb->ki_res.res2 = res2;
iocb_put(iocb);
 }
 
@@ -1452,6 +1455,8 @@ static int aio_prep_rw(struct kiocb *req, const struct 
iocb *iocb)
req->ki_flags = iocb_flags(req->ki_filp);
if (iocb->aio_flags & IOCB_FLAG_RESFD)
req->ki_flags |= IOCB_EVENTFD;
+   if (iocb->aio_flags & IOCB_FLAG_REPORT_OFFSET)
+   req->ki_flags |= IOCB_REPORT_OFFSET;
req->ki_hint = ki_hint_validate(file_write_hint(req->ki_filp));
if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f5abba86107d86..522b0a3437d420 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -316,6 +316,7 @@ enum rw_hint {
 #define IOCB_WRITE (1 << 6)
 #define IOCB_NOWAIT(1 << 7)
 #define IOCB_NOIO  (1 << 9)
+#define IOCB_REPORT_OFFSET (1 << 10)
 
 struct kiocb {
struct file *ki_filp;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 8387e0af0f768a..e4313d7aa3b7e7 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -55,6 +55,7 @@ enum {
  */
 #define IOCB_FLAG_RESFD(1 << 0)
 #define IOCB_FLAG_IOPRIO   (1 << 1)
+#define IOCB_FLAG_REPORT_OFFSET(1 << 2)
 
 /* read() from /dev/aio returns these structures. */
 struct io_event {


Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread h...@infradead.org
On Fri, Jul 31, 2020 at 10:16:49AM +, Damien Le Moal wrote:
> > 
> > Let's keep semantics and implementation separate.  For the case
> > where we report the actual offset we need a size imitation and no
> > short writes.
> 
> OK. So the name of the flag confused me. The flag name should reflect "Do zone
> append and report written offset", right ?

Well, we already have O_APPEND, which is the equivalent to append to
the write pointer.  The only interesting addition is that we also want
to report where we wrote.  So I'd rather have RWF_REPORT_OFFSET or so.

> Just to clarify, here was my thinking for zonefs:
> 1) file open with O_APPEND/aio has RWF_APPEND: then it is OK to assume that 
> the
> application did not set the aio offset since APPEND means offset==file size. 
> In
> that case, do zone append and report back the written offset.

Yes.

> 2) file open without O_APPEND/aio does not have RWF_APPEND: the application
> specified an aio offset and we must respect it, write it that exact same 
> order,
> so use regular writes.

Yes.

> 
> For regular file systems, with case (1) condition, the FS use whatever it 
> wants
> for the implementation, and report back the written offset, which  will always
> be the file size at the time the aio was issued.

Yes. 

> 
> Your method with a new flag to switch between (1) and (2) is OK with me, but 
> the
> "no short writes" may be difficult to achieve in a regular FS, no ? I do not
> think current FSes have such guarantees... Especially in the case of buffered
> async writes I think.

I'll have to check what Jens recently changed with io_uring, but
traditionally the only short write case for a normal file system is the
artifical INT_MAX limit for the write size.

> > Anything with those semantics can be implemented using Zone Append
> > trivially in zonefs, and we don't even need the exclusive lock in that
> > case.  But even without that flag anything that has an exclusive lock can
> > at least in theory be implemented using Zone Append, it just need
> > support for submitting another request from the I/O completion handler
> > of the first.  I just don't think it is worth it - with the exclusive
> > lock we do have access to the zone serialied so a normal write works
> > just fine.  Both for the sync and async case.
> 
> We did switch to have zonefs do append writes in the sync case, always. 
> Hmmm...
> Not sure anymore it was such a good idea.

It might be a good idea as long as we have the heavy handed scheduler
locking.  But if we don't have that there doesn't seem to be much of
a reason for zone append.

> OK. Makes sense. That said, taking Naohiro's work on btrfs as an example, zone
> append is used for every data write, no matter if it is O_APPEND/RWF_APPEND or
> not. The size limitation for zone append writes is not needed at all by
> applications. Maximum extent size is aligned to the max append write size
> internally, and if the application issued a larger write, it loops over 
> multiple
> extents, similarly to any regular write may do (if there is overwrite etc...).

True, we probably don't need the limit for a normal file system.

> For the regular FS case, my thinking on the semantic really was: if asked to 
> do
> so, return the written offset for a RWF_APPEND aios. And I think that
> implementing that does not depend in any way on what the FS does internally.

Exactly.

> 
> But I think I am starting to see the picture you are drawing here:
> 1) Introduce a fcntl() to get "maximum size for atomic append writes"
> 2) Introduce an aio flag specifying "Do atomic append write and report written
> offset"

I think we just need the 'report written offset flag', in fact even for
zonefs with the right locking we can handle unlimited write sizes, just
with lower performance.  E.g.

1) check if the write size is larger than the zone append limit

if no:

 - take the shared lock  and issue a zone append, done

if yes:

 - take the exclusive per-inode (zone) lock and just issue either normal
   writes or zone append at your choice, relying on the lock to
   serialize other writers.  For the async case this means we need a
   lock than can be release in a different context than it was acquired,
   which is a little ugly but can be done.

> And the implementation is actually completely free to use zone append writes 
> or
> regular writes regardless of the "Do atomic append write and report written
> offset" being used or not.

Yes, that was my point about keeping the semantics and implementation
separate.


Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread h...@infradead.org
On Fri, Jul 31, 2020 at 09:34:50AM +, Damien Le Moal wrote:
> Sync writes are done under the inode lock, so there cannot be other writers at
> the same time. And for the sync case, since the actual written offset is
> necessarily equal to the file size before the write, there is no need to 
> report
> it (there is no system call that can report that anyway). For this sync case,
> the only change that the use of zone append introduces compared to regular
> writes is the potential for more short writes.
> 
> Adding a flag for "report the actual offset for appending writes" is fine with
> me, but do you also mean to use this flag for driving zone append write vs
> regular writes in zonefs ?

Let's keep semantics and implementation separate.  For the case
where we report the actual offset we need a size imitation and no
short writes.

Anything with those semantics can be implemented using Zone Append
trivially in zonefs, and we don't even need the exclusive lock in that
case.  But even without that flag anything that has an exclusive lock can
at least in theory be implemented using Zone Append, it just need
support for submitting another request from the I/O completion handler
of the first.  I just don't think it is worth it - with the exclusive
lock we do have access to the zone serialied so a normal write works
just fine.  Both for the sync and async case.

> The fcntl or ioctl for getting the max atomic write size would be fine too.
> Given that zonefs is very close to the underlying zoned drive, I was assuming
> that the application can simply consult the device sysfs zone_append_max_bytes
> queue attribute.

For zonefs we can, yes.  But in many ways that is a lot more cumbersome
that having an API that works on the fd you want to write on.

> For regular file systems, this value would be used internally
> only. I do not really see how it can be useful to applications. Furthermore, 
> the
> file system may have a hard time giving that information to the application
> depending on its underlying storage configuration (e.g. erasure
> coding/declustered RAID).

File systems might have all kinds of limits of their own (e.g. extent
sizes).  And a good API that just works everywhere and is properly
documented is much better than heaps of cargo culted crap all over
applications.


Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread h...@infradead.org
On Fri, Jul 31, 2020 at 08:14:22AM +, Damien Le Moal wrote:
> 
> > This was one of the reason why we chose to isolate the operation by a
> > different IOCB flag and not by IOCB_APPEND alone.
> 
> For zonefs, the plan is:
> * For the sync write case, zone append is always used.
> * For the async write case, if we see IOCB_APPEND, then zone append BIOs are
> used. If not, regular write BIOs are used.
> 
> Simple enough I think. No need for a new flag.

Simple, but wrong.  Sync vs async really doesn't matter, even sync
writes will have problems if there are other writers.  We need a flag
for "report the actual offset for appending writes", and based on that
flag we need to not allow short writes (or split extents for real
file systems).  We also need a fcntl or ioctl to report this max atomic
write size so that applications can rely on it.


Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread h...@infradead.org
On Fri, Jul 31, 2020 at 06:42:10AM +, Damien Le Moal wrote:
> > - We may not be able to use RWF_APPEND, and need exposing a new
> > type/flag (RWF_INDIRECT_OFFSET etc.) user-space. Not sure if this
> > sounds outrageous, but is it OK to have uring-only flag which can be
> > combined with RWF_APPEND?
> 
> Why ? Where is the problem ? O_APPEND/RWF_APPEND is currently meaningless for
> raw block device accesses. We could certainly define a meaning for these in 
> the
> context of zoned block devices.

We can't just add a meaning for O_APPEND on block devices now,
as it was previously silently ignored.  I also really don't think any
of these semantics even fit the block device to start with.  If you
want to work on raw zones use zonefs, that's what is exists for.


Re: [v6 PATCH] RISC-V: Remove unsupported isa string info print

2019-10-07 Thread h...@infradead.org
On Wed, Oct 02, 2019 at 06:28:59AM +, Atish Patra wrote:
> On Wed, 2019-10-02 at 09:53 +0800, Alan Kao wrote:
> > On Tue, Oct 01, 2019 at 03:10:16AM -0700, h...@infradead.org wrote:
> > > On Tue, Oct 01, 2019 at 08:22:37AM +, Atish Patra wrote:
> > > > riscv_of_processor_hartid() or seems to be a better candidate. We
> > > > already check if "rv" is present in isa string or not. I will
> > > > extend
> > > > that to check for rv64i or rv32i. Is that okay ?
> > > 
> > > I'd rather lift the checks out of that into a function that is
> > > called
> > > exactly once per hart on boot (and future cpu hotplug).
> > 
> @Christoph
> Do you mean to lift the checks for "rv" as well from
> riscv_of_processor_hartid as well or leave that as it is? 

Sounds good to me (as a separate patch).  Again it makes much more
sense to validate this once early at boot time rather than a function
that can be called many tims during run time.


Re: [v6 PATCH] RISC-V: Remove unsupported isa string info print

2019-10-01 Thread h...@infradead.org
On Tue, Oct 01, 2019 at 08:22:37AM +, Atish Patra wrote:
> riscv_of_processor_hartid() or seems to be a better candidate. We
> already check if "rv" is present in isa string or not. I will extend
> that to check for rv64i or rv32i. Is that okay ?

I'd rather lift the checks out of that into a function that is called
exactly once per hart on boot (and future cpu hotplug).


Re: [RFC PATCH 0/2] Add support for SBI version to 0.2

2019-09-16 Thread h...@infradead.org
On Fri, Sep 13, 2019 at 08:54:27AM -0700, Palmer Dabbelt wrote:
> On Tue, Sep 3, 2019 at 12:38 AM h...@infradead.org  wrote:
> 
> > On Fri, Aug 30, 2019 at 11:13:25PM +, Atish Patra wrote:
> > > If I understood you clearly, you want to call it legacy in the spec and
> > > just say v0.1 extensions.
> > > 
> > > The whole idea of marking them as legacy extensions to indicate that it
> > > would be obsolete in the future.
> > > 
> > > But I am not too worried about the semantics here. So I am fine with
> > > just changing the text to v0.1 if that avoids confusion.
> >
> > So my main problems is that we are lumping all the "legacy" extensions
> > together.  While some of them are simply a bad idea and shouldn't
> > really be implemented for anything new ever, others like the sfence.vma
> > and ipi ones are needed until we have hardware support to avoid them
> > and possibly forever for virtualization.
> >
> > So either we use different markers of legacy for them, or we at least
> > define new extensions that replace them at the same time.  What I
> > want to avoid is the possibіly of an implementation using the really
> > legacy bits and new extensions at the same time.
> >
> 
> Nominally we've got to replace these as well because we didn't include
> the length of the hart mask. 

Well, let's do that as part of definining the first real post-0.1
SBI then, and don't bother defining the old ones as legacy at all.

Just two different specs that don't interact except that we reserve
extension space in the new one for the old one so that one SBI spec
can implement both.


Re: [v5 PATCH] RISC-V: Fix unsupported isa string info.

2019-09-10 Thread h...@infradead.org
On Fri, Sep 06, 2019 at 11:27:57PM +, Atish Patra wrote:
> > Agreed. May be something like this ?
> > 
> > Let's say f/d is enabled in kernel but cpu doesn't support it.
> > "unsupported isa" will only appear if there are any unsupported isa.
> > 
> > processor   : 3
> > hart: 4
> > isa : rv64imac
> > unsupported isa : fd
> > mmu : sv39
> > uarch   : sifive,u54-mc
> > 
> > May be I am just trying over optimize one corner case :) :).
> > /proc/cpuinfo should just print all the isa string. That's it.
> > 
> 
> Ping ?

Yes, I agree with the "dumb" reporting of all capabilities.


Re: [RFC PATCH 0/2] Add support for SBI version to 0.2

2019-09-03 Thread h...@infradead.org
On Fri, Aug 30, 2019 at 11:13:25PM +, Atish Patra wrote:
> If I understood you clearly, you want to call it legacy in the spec and
> just say v0.1 extensions.
> 
> The whole idea of marking them as legacy extensions to indicate that it
> would be obsolete in the future.
> 
> But I am not too worried about the semantics here. So I am fine with
> just changing the text to v0.1 if that avoids confusion.

So my main problems is that we are lumping all the "legacy" extensions
together.  While some of them are simply a bad idea and shouldn't
really be implemented for anything new ever, others like the sfence.vma
and ipi ones are needed until we have hardware support to avoid them
and possibly forever for virtualization.

So either we use different markers of legacy for them, or we at least
define new extensions that replace them at the same time.  What I
want to avoid is the possibіly of an implementation using the really
legacy bits and new extensions at the same time.


Re: [RFC PATCH 0/2] Add support for SBI version to 0.2

2019-08-29 Thread h...@infradead.org
On Tue, Aug 27, 2019 at 10:19:42PM +, Atish Patra wrote:
> I did not understand this part. All the legacy SBI calls are defined as
> a separate extension ID not single extension. How did it break the
> backward compatibility ?

Yes, sorry I mistead this.  The way is is defined is rather
non-intuitive, but actually backwards compatible.

> I think the confusion is because of legacy renaming. They are not
> single legacy extension. They are all separate extensions. The spec
> just called all those extensions as collectively as legacy. So I just
> tried to make the patch sync with the spec.
> 
> If that's the source of confusion, I can rename it to sbi_0.1_x in
> stead of legacy.

I think we actually need to fix the spec instead, even if it just the
naming and not the mechanism.

> >  (1) actually board specific and have not place in a cpu abstraction
> >  layer: getchar/putchar, these should just never be advertised in
> > a
> >   non-legacy setup, and the drivers using them should not probe
> >   on a sbi 0.2+ system
> 
> In that case, we have to update the drivers(earlycon-riscv-sbi &
> hvc_riscv_sbi) in kernel as well. Once these patches are merged, nobody
> will be able to use earlycon=sbi feature in mainline kernel.
> 
> Personally, I am fine with it. But there were some interest during
> RISC-V workshop in keeping these for now for easy debugging and early
> bringup.

The getchar/putchar calls unfortunately are fundamentally flawed, as
they mean the sbi can still access the console after the host has taken
it over using its own drivers.  Which will lead to bugs sooner or later.

And if you can bring up a console driver in opensbi it should be just
as trivial to bring up the kernel version.


Re: [RFC PATCH 1/2] RISC-V: Mark existing SBI as legacy SBI.

2019-08-29 Thread h...@infradead.org
On Tue, Aug 27, 2019 at 08:37:27PM +, Atish Patra wrote:
> That would split the implementation between C file & assembly file for
> no good reason.
> 
> How about moving everything in sbi.c and just write everything inline
> assembly there.

Well, if we implement it in pure assembly that would be the entire
implementation, wouldn't it?


Re: [v2 PATCH] RISC-V: Optimize tlb flush path.

2019-08-21 Thread h...@infradead.org
Btw, for the next version it also might make sense to do one
optimization at a time.  E.g. the empty cpumask one as the
first patch, the local cpu directly one next, and the threshold
based full flush as a third.


Re: [v2 PATCH] RISC-V: Optimize tlb flush path.

2019-08-21 Thread h...@infradead.org
On Tue, Aug 20, 2019 at 08:29:47PM +, Atish Patra wrote:
> Sounds good to me. Christoph has already mm/tlbflush.c in his mmu
> series. I will rebase on top of it.

It was't really intended for the nommu series but for the native
clint prototype.  But the nommu series grew so many cleanups and
micro-optimizations by now that I think I'll split those into
another prep series and will include a version of this.


Re: [v2 PATCH] RISC-V: Optimize tlb flush path.

2019-08-21 Thread h...@infradead.org
On Wed, Aug 21, 2019 at 09:22:48AM +0530, Anup Patel wrote:
> I agree that IPI mechanism should be standardized for RISC-V but I
> don't support the idea of mandating CLINT as part of the UNIX
> platform spec. For example, the AndesTech SOC does not use CLINT
> instead they have PLMT for per-HART timer and PLICSW for per-HART
> IPIs.

The point is not really mandating a CLINT as know right now.  The
point is to mandate one way to issue IPIs from S-mode to S-mode,
one way to read the time counter and one way to write the timer
threshold.


Re: [v2 PATCH] RISC-V: Optimize tlb flush path.

2019-08-20 Thread h...@infradead.org
On Wed, Aug 21, 2019 at 09:29:22AM +0800, Alan Kao wrote:
> IMHO, this approach should be avoided because CLINT is compatible to but
>  not mandatory in the privileged spec.  In other words, it is possible that
> a Linux-capable RISC-V platform does not contain a CLINT component but
> rely on some other mechanism to deal with SW/timer interrupts.

Hi Alan,

at this point the above is just a prototype showing the performance
improvement if we can inject IPIs and timer interrups directly from
S-mode and delivered directly to S-mode.  It is based on a copy of
the clint IPI block currently used by SiFive, qemu, Ariane and Kendryte.

If the experiment works out (which I think it does), I'd like to
define interfaces for the unix platform spec to make something like
this available.  My current plan for that is to have one DT node
each for the IPI registers, timer cmp and time val register each
as MMIO regions.  This would fit the current clint block but also
allow other register layouts.  Is that something you'd be fine with?
If not do you have another proposal?  (note that eventually the
dicussion should move to the unix platform spec list, but now that
I have you here we can at least brain storm a bit).


Re: [v2 PATCH] RISC-V: Optimize tlb flush path.

2019-08-20 Thread h...@infradead.org
On Tue, Aug 20, 2019 at 08:28:36PM +, Atish Patra wrote:
> > http://git.infradead.org/users/hch/riscv.git/commitdiff/ea4067ae61e20fcfcf46a6f6bd1cc25710ce3afe
> 
> This does seem a lot cleaner to me. We can reuse some of the code for
> this patch as well. Based on NATIVE_CLINT configuration, it will send
> an IPI or SBI call.
> 
> I can rebase my patch on top of yours and I can send it together or you
> can include in your series.
> 
> Let me know your preference.

I think the native clint for S-mode will need more discussion, so you
should not wait for it.


Re: [v2 PATCH] RISC-V: Optimize tlb flush path.

2019-08-20 Thread h...@infradead.org
On Tue, Aug 20, 2019 at 08:42:19AM +, Atish Patra wrote:
> cmask NULL is pretty common case and we would  be unnecessarily
> executing bunch of instructions everytime while not saving much. Kernel
> still have to make an SBI call and OpenSBI is doing a local flush
> anyways.
> 
> Looking at the code again, I think we can just use cpumask_weight and
> do local tlb flush only if local cpu is the only cpu present. 
> 
> Otherwise, it will just fall through and call sbi_remote_sfence_vma().

Maybe it is just time to split the different cases at a higher level.
The idea to multiple everything onto a single function always seemed
odd to me.

FYI, here is what I do for the IPI based tlbflush for the native S-mode
clint prototype, which seems much easier to understand:

http://git.infradead.org/users/hch/riscv.git/commitdiff/ea4067ae61e20fcfcf46a6f6bd1cc25710ce3afe


Re: [v2 PATCH] RISC-V: Optimize tlb flush path.

2019-08-20 Thread h...@infradead.org
On Tue, Aug 20, 2019 at 09:14:58AM +0200, Andreas Schwab wrote:
> On Aug 19 2019, "h...@infradead.org"  wrote:
> 
> > This looks a little odd to m and assumes we never pass a size smaller
> > than PAGE_SIZE.  Whule that is probably true, why not something like:
> >
> > if (size < PAGE_SIZE && size != -1)
> 
> ITYM size <= PAGE_SIZE.  And since size is unsigned it cannot be == -1
> at the same time.

Yes, the <= was obvious, that's what you get for hacking up a demo
patch on the plan.  And true for the -1.  That being said I find the
-1 convention rather annoying, a ULONG_MAX in the callers would be
a lot more obvious.


Re: [v2 PATCH] RISC-V: Optimize tlb flush path.

2019-08-19 Thread h...@infradead.org
On Mon, Aug 19, 2019 at 05:47:35PM -0700, Atish Patra wrote:
> In RISC-V, tlb flush happens via SBI which is expensive.
> If the target cpumask contains a local hartid, some cost
> can be saved by issuing a local tlb flush as we do that
> in OpenSBI anyways. There is also no need of SBI call if
> cpumask is empty.
> 
> Do a local flush first if current cpu is present in cpumask.
> Invoke SBI call only if target cpumask contains any cpus
> other than local cpu.

Btw, you can use up your 70-ish chars per line for commit logs..

> + if (cpumask_test_cpu(cpuid, cmask)) {
> + /* Save trap cost by issuing a local tlb flush here */
> + if ((start == 0 && size == -1) || (size > PAGE_SIZE))
> + local_flush_tlb_all();
> + else if (size == PAGE_SIZE)
> + local_flush_tlb_page(start);
> + }

This looks a little odd to m and assumes we never pass a size smaller
than PAGE_SIZE.  Whule that is probably true, why not something like:

if (size < PAGE_SIZE && size != -1)
local_flush_tlb_page(start);
else
local_flush_tlb_all();

?


Re: [PATCH] RISC-V: Issue a local tlb flush if possible.

2019-08-19 Thread h...@infradead.org
On Mon, Aug 19, 2019 at 08:39:02PM +0530, Anup Patel wrote:
> If we were using ASID then yes we don't need to flush anything
> but currently we don't use ASID due to lack of HW support and
> HW can certainly do speculatively page table walks so flushing
> local TLB when MM mask is empty might help.
> 
> This just my theory and we need to stress test more.

Well, when we context switch away from a mm we always flush the
local tlb.  So either the mm_struct has never been scheduled in,
or we alrady did a local_tlb_flush and we context switched it up.


Re: [PATCH] RISC-V: Issue a local tlb flush if possible.

2019-08-19 Thread h...@infradead.org
On Thu, Aug 15, 2019 at 08:37:04PM +, Atish Patra wrote:
> We get ton of them. Here is the stack dump.

Looks like we might not need to flush anything at all here as the
mm_struct was never scheduled to run on any cpu?


Re: [v5 PATCH] RISC-V: Fix unsupported isa string info.

2019-08-18 Thread h...@infradead.org
On Fri, Aug 16, 2019 at 07:21:52PM +, Atish Patra wrote:
> > > + if (isa[0] != '\0') {
> > > + /* Add remainging isa strings */
> > > + for (e = isa; *e != '\0'; ++e) {
> > > +#if !defined(CONFIG_VIRTUALIZATION)
> > > + if (e[0] != 'h')
> > > +#endif
> > > + seq_write(f, e, 1);
> > > + }
> > > + }
> > 
> > This one I don't get.  Why do we want to check CONFIG_VIRTUALIZATION?
> > 
> 
> If CONFIG_VIRTUALIZATION is not enabled, it shouldn't print that
> hypervisor extension "h" in isa extensions.

CONFIG_VIRTUALIZATION doesn't change anything in the kernels
capabilities, it just enables other config options.  But more
importantly the 'h' extension is only relevant for S-mode software
anyway.

> This is just an information to the userspace that some of the mandatory
> ISA extensions ("mafdcsu") are not supported in kernel which may lead
> to undesirable results.

I think we need to sit down decide what the purpose of /proc/cpuinfo
is.  IIRC on other architectures is just prints what the hardware
supports, not what you can actually make use of.  How else would you
find out that you'd need to enable more kernel options to fully
utilize the hardware?

Also printing this warning to the kernel log when someone reads the
procfs file is very strange.


Re: [PATCH] RISC-V: Issue a local tlb flush if possible.

2019-08-13 Thread h...@infradead.org
On Tue, Aug 13, 2019 at 12:15:15AM +, Atish Patra wrote:
> I thought if it recieves an empty cpumask, then it should at least do a
> local flush is the expected behavior. Are you saying that we should
> just skip all and return ? 

How could we ever receive an empty cpu mask?  I think it could only
be empty without the current cpu.  At least that is my reading of
the callers and a few other implementations.

> > if (!cpumask || cpumask_test_cpu(cpu, cpumask) {
> > if ((start == 0 && size == -1) || size > PAGE_SIZE)
> > local_flush_tlb_all();
> > else if (size == PAGE_SIZE)
> > local_flush_tlb_page(start);
> > cpumask_clear_cpu(cpuid, cpumask);
> 
> This will modify the original cpumask which is not correct. clear part
> has to be done on hmask to avoid a copy.

Indeed.  But looking at the x86 tlbflush implementation it seems like we
could use cpumask_any_but() to avoid having to modify the passed in
cpumask.


Re: [Xen-devel] [RFC] virtio_ring: check dma_mem for xen_domain

2019-01-28 Thread h...@infradead.org
On Fri, Jan 25, 2019 at 09:45:26AM +, Peng Fan wrote:
> Just have a question, 
> 
> Since vmalloc_to_page is ok for cma area, no need to take cma and per device
> cma into consideration right? 

The CMA area itself it a physical memory region.  If it is a non-highmem
region you can call virt_to_page on the virtual addresses for it.  If
it is in highmem it doesn't even have a kernel virtual address by
default.

> we only need to implement a piece code to handle per device specific region
> using RESERVEDMEM_OF_DECLARE, just like:
> RESERVEDMEM_OF_DECLARE(rpmsg-dma, "rpmsg-dma-pool", 
> rmem_rpmsg_dma_setup);
> And implement the device_init call back and build a map between page and phys.
> Then in rpmsg driver, scatter list could use page structure, no need 
> vmalloc_to_page
> for per device dma.
> 
> Is this the right way?

I think this should work fine.  If you have the cycles for it I'd
actually love to be able to have generic CMA DT glue for non DMA API
driver allocations, as there obviously is a need for it.  So basically
the same as above, just added to kernel/cma.c as a generic API.


Re: [Xen-devel] [RFC] virtio_ring: check dma_mem for xen_domain

2019-01-23 Thread h...@infradead.org
On Wed, Jan 23, 2019 at 01:04:33PM -0800, Stefano Stabellini wrote:
> If vring_use_dma_api is actually supposed to return true when
> dma_dev->dma_mem is set, then both Peng's patch and the patch I wrote
> are not fixing the real issue here.
> 
> I don't know enough about remoteproc to know where the problem actually
> lies though.

The problem is the following:

Devices can declare a specific memory region that they want to use when
the driver calls dma_alloc_coherent for the device, this is done using
the shared-dma-pool DT attribute, which comes in two variants that
would be a little to much to explain here.

remoteproc makes use of that because apparently the device can
only communicate using that region.  But it then feeds back memory
obtained with dma_alloc_coherent into the virtio code.  For that
it calls vmalloc_to_page on the dma_alloc_coherent, which is a huge
no-go for the ĐMA API and only worked accidentally on a few platform,
and apparently arm64 just changed a few internals that made it stop
working for remoteproc.

The right answer is to not use the DMA API to allocate memory from
a device-speficic region, but to tie the driver directly into the
DT reserved memory API in a way that allows it to easilt obtain
a struct device for it.

This is orthogonal to another issue, and that is that hardware
virtio devices really always need to use the DMA API, otherwise
we'll bypass such features as the device specific DMA pools,
DMA offsets, cache flushing, etc, etc.


Re: [Xen-devel] [RFC] virtio_ring: check dma_mem for xen_domain

2019-01-22 Thread h...@infradead.org
On Tue, Jan 22, 2019 at 11:59:31AM -0800, Stefano Stabellini wrote:
> > if (!virtio_has_iommu_quirk(vdev))
> > return true;
> >  
> > @@ -260,7 +262,7 @@ static bool vring_use_dma_api(struct virtio_device 
> > *vdev)
> >  * the DMA API if we're a Xen guest, which at least allows
> >  * all of the sensible Xen configurations to work correctly.
> >  */
> > -   if (xen_domain())
> > +   if (xen_domain() && !dma_dev->dma_mem)
> > return true;
> >  
> > return false;
> 
> I can see you spotted a real issue, but this is not the right fix. We
> just need something a bit more flexible than xen_domain(): there are
> many kinds of Xen domains on different architectures, we basically want
> to enable this (return true from vring_use_dma_api) only when the xen
> swiotlb is meant to be used. Does the appended patch fix the issue you
> have?

The problem generally is the other way around - if dma_dev->dma_mem
is set the device decription in the device tree explicitly requires
using this memory, so we must _always_ use the DMA API.

The problem is just that that rproc driver absuses the DMA API
in horrible ways.


Re: [RFC] virtio_ring: check dma_mem for xen_domain

2019-01-21 Thread h...@infradead.org
On Mon, Jan 21, 2019 at 04:51:57AM +, Peng Fan wrote:
> on i.MX8QM, M4_1 is communicating with DomU using rpmsg with a fixed
> address as the dma mem buffer which is predefined.
> 
> Without this patch, the flow is:
> vring_map_one_sg -> vring_use_dma_api
>  -> dma_map_page
>  -> __swiotlb_map_page
>   ->swiotlb_map_page
>   ->__dma_map_area(phys_to_virt(dma_to_phys(dev, 
> dev_addr)), size, dir);
> However we are using per device dma area for rpmsg, phys_to_virt
> could not return a correct virtual address for virtual address in
> vmalloc area. Then kernel panic.

And that is the right thing to do.  You must not call dma_map_* on
memory that was allocated from dma_alloc_*.

We actually have another thread which appears to be for this same issue.


Re: [PATCH v4] RISC-V: defconfig: Enable Generic PCIE by default

2019-01-15 Thread h...@infradead.org
On Tue, Jan 15, 2019 at 05:45:34PM +, Alistair Francis wrote:
> > Alistair, is there a ready-made qemu machine with riscv + pcie?
> > That would be really useful for testing..
> 
> Yep, if you use master QEMU (it isn't in a release yet) the virt
> machine has PCIe support.

Cool.  Time to play around with the latest qemu tree..


Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-08 Thread h...@infradead.org
On Thu, May 05, 2016 at 09:39:14PM +, Verma, Vishal L wrote:
> How is it any 'less direct'? All it does now is follow the blockdev
> O_DIRECT path. There still isn't any page cache involved..

It's still more overhead than the play DAX I/O path.


Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-08 Thread h...@infradead.org
On Thu, May 05, 2016 at 09:39:14PM +, Verma, Vishal L wrote:
> How is it any 'less direct'? All it does now is follow the blockdev
> O_DIRECT path. There still isn't any page cache involved..

It's still more overhead than the play DAX I/O path.


Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-08 Thread h...@infradead.org
On Thu, May 05, 2016 at 09:45:07PM +, Verma, Vishal L wrote:
> I'm not sure I completely understand how this will work? Can you explain
> a bit? Would we have to export rw_bytes up to layers above the pmem
> driver? Where does get_user_pages come in?

A DAX filesystem can directly use the nvdimm layer the same way btt
doe,s what's the problem?

Re get_user_pages my idea was to simply use that to lock down the user
pages so that we can call rw_bytes on it.  How else would you do it?  Do
a kmalloc, copy_from_user and then another memcpy?


Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io

2016-05-08 Thread h...@infradead.org
On Thu, May 05, 2016 at 09:45:07PM +, Verma, Vishal L wrote:
> I'm not sure I completely understand how this will work? Can you explain
> a bit? Would we have to export rw_bytes up to layers above the pmem
> driver? Where does get_user_pages come in?

A DAX filesystem can directly use the nvdimm layer the same way btt
doe,s what's the problem?

Re get_user_pages my idea was to simply use that to lock down the user
pages so that we can call rw_bytes on it.  How else would you do it?  Do
a kmalloc, copy_from_user and then another memcpy?


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread h...@infradead.org
On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> - Application hits EIO doing dax_IO or load/store io
> 
> - It checks badblocks and discovers it's files have lost data
> 
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> ?? ?? * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector

This sounds like a mess.

> I think if we want to keep allowing arbitrary alignments for the
> dax_do_io path, we'd need:
> 1. To represent badblocks at a finer granularity (likely cache lines)
> 2. To allow the driver to do IO to a *block device* at sub-sector
> granularity

It's not a block device if it supports DAX.  It's byte addressable
memory masquerading as a block device.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread h...@infradead.org
On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> - Application hits EIO doing dax_IO or load/store io
> 
> - It checks badblocks and discovers it's files have lost data
> 
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> ?? ?? * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector

This sounds like a mess.

> I think if we want to keep allowing arbitrary alignments for the
> dax_do_io path, we'd need:
> 1. To represent badblocks at a finer granularity (likely cache lines)
> 2. To allow the driver to do IO to a *block device* at sub-sector
> granularity

It's not a block device if it supports DAX.  It's byte addressable
memory masquerading as a block device.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread h...@infradead.org
On Mon, Apr 25, 2016 at 11:32:08AM -0400, Jeff Moyer wrote:
> > EINVAL is a concern here.  Not due to the right error reported, but
> > because it means your current scheme is fundamentally broken - we
> > need to support I/O at any alignment for DAX I/O, and not fail due to
> > alignbment concernes for a highly specific degraded case.
> >
> > I think this whole series need to go back to the drawing board as I
> > don't think it can actually rely on using direct I/O as the EIO
> > fallback.
> 
> The only callers of dax_do_io are direct_IO methods.

They are because the DAX I/O pass is a mess, but that doesn't mean
the user specific O_DIRECT on the open nessecarily.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread h...@infradead.org
On Mon, Apr 25, 2016 at 11:32:08AM -0400, Jeff Moyer wrote:
> > EINVAL is a concern here.  Not due to the right error reported, but
> > because it means your current scheme is fundamentally broken - we
> > need to support I/O at any alignment for DAX I/O, and not fail due to
> > alignbment concernes for a highly specific degraded case.
> >
> > I think this whole series need to go back to the drawing board as I
> > don't think it can actually rely on using direct I/O as the EIO
> > fallback.
> 
> The only callers of dax_do_io are direct_IO methods.

They are because the DAX I/O pass is a mess, but that doesn't mean
the user specific O_DIRECT on the open nessecarily.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread h...@infradead.org
On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
> to some allocation failing, and I thought we should return the original
> -EIO in such cases so that the application doesn't lose the information
> that the bad block is actually causing the error.

EINVAL is a concern here.  Not due to the right error reported, but
because it means your current scheme is fundamentally broken - we
need to support I/O at any alignment for DAX I/O, and not fail due to
alignbment concernes for a highly specific degraded case.

I think this whole series need to go back to the drawing board as I
don't think it can actually rely on using direct I/O as the EIO
fallback.



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread h...@infradead.org
On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
> to some allocation failing, and I thought we should return the original
> -EIO in such cases so that the application doesn't lose the information
> that the bad block is actually causing the error.

EINVAL is a concern here.  Not due to the right error reported, but
because it means your current scheme is fundamentally broken - we
need to support I/O at any alignment for DAX I/O, and not fail due to
alignbment concernes for a highly specific degraded case.

I think this whole series need to go back to the drawing board as I
don't think it can actually rely on using direct I/O as the EIO
fallback.



Re: [PATCH 0/4] Drivers: scsi: storvsc: Fix miscellaneous issues

2014-12-30 Thread h...@infradead.org
On Mon, Dec 29, 2014 at 09:07:59PM +, KY Srinivasan wrote:
> Should I be resending these patches.

I don't need a resend, I need a review for the patches.  Note that for
driver patches I'm also fine with a review from a co worker, as long as
it's a real review not just a rubber stamp.

Talking about process:  for the next submission please only use
"storvsc:  " as the subject prefix, I had to remove "Drivers: scsi: "
which doesn'd add useful information every time so far.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Drivers: scsi: storvsc: Fix miscellaneous issues

2014-12-30 Thread h...@infradead.org
On Mon, Dec 29, 2014 at 09:07:59PM +, KY Srinivasan wrote:
 Should I be resending these patches.

I don't need a resend, I need a review for the patches.  Note that for
driver patches I'm also fine with a review from a co worker, as long as
it's a real review not just a rubber stamp.

Talking about process:  for the next submission please only use
storvsc:   as the subject prefix, I had to remove Drivers: scsi: 
which doesn'd add useful information every time so far.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread h...@infradead.org
On Fri, Jul 18, 2014 at 10:12:38AM -0700, h...@infradead.org wrote:
> This is what I plan to put in after it passes basic testing:

And that one was on top of my previous version.  One that applies
against core-for-3.17 below:

---
>From 8a79783e5f72ec034a724e16c1f46604bd97bf68 Mon Sep 17 00:00:00 2001
From: "K. Y. Srinivasan" 
Date: Fri, 18 Jul 2014 17:11:27 +0200
Subject: sd: fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O
 timeout

Commit ID: 7e660100d85af860e7ad763202fff717adcdaacd added code to derive the
FLUSH_TIMEOUT from the basic I/O timeout. However, this patch did not use the
basic I/O timeout of the device. Fix this bug.

Signed-off-by: K. Y. Srinivasan 
Reviewed-by: James Bottomley 
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 377a520..9ffb393 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -880,7 +880,7 @@ static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
cmd->transfersize = 0;
cmd->allowed = SD_MAX_RETRIES;
 
-   rq->timeout *= SD_FLUSH_TIMEOUT_MULTIPLIER;
+   rq->timeout = rq->q->rq_timeout * SD_FLUSH_TIMEOUT_MULTIPLIER;
return BLKPREP_OK;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread h...@infradead.org
This is what I plan to put in after it passes basic testing:

---
>From bb617c9465b839d70ecbbc69002a20ccf5f935bd Mon Sep 17 00:00:00 2001
From: "K. Y. Srinivasan" 
Date: Fri, 18 Jul 2014 19:12:58 +0200
Subject: sd: fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O
 timeout

Commit ID: 7e660100d85af860e7ad763202fff717adcdaacd added code to derive the
FLUSH_TIMEOUT from the basic I/O timeout. However, this patch did not use the
basic I/O timeout of the device. Fix this bug.

Signed-off-by: K. Y. Srinivasan 
Reviewed-by: James Bottomley 
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bef4e78..9ffb393 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -880,7 +880,7 @@ static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
cmd->transfersize = 0;
cmd->allowed = SD_MAX_RETRIES;
 
-   rq->timeout = SD_TIMEOUT * SD_FLUSH_TIMEOUT_MULTIPLIER;
+   rq->timeout = rq->q->rq_timeout * SD_FLUSH_TIMEOUT_MULTIPLIER;
return BLKPREP_OK;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread h...@infradead.org
On Fri, Jul 18, 2014 at 04:57:13PM +, James Bottomley wrote:
> Actually, no you didn't.  The difference is in the derivation of the
> timeout.  Christoph's patch is absolute in terms of SD_TIMEOUT; yours is
> relative to the queue timeout setting ... I thought there was a reason
> for preferring the relative version.

Yes, KYs version is better.  It takes the base timeout drivers set
on the request queue into account.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread Christoph Hellwig (h...@infradead.org)
On Fri, Jul 18, 2014 at 12:51:06AM +, Elliott, Robert (Server Storage) 
wrote:
> SYNCHRONIZE CACHE (16) should be favored over SYNCHRONIZE 
> CACHE (10) unless SYNCHRONIZE CACHE (10) is not supported.

I gues you mean (16) for the last occurance?  What's the benefit of
using SYNCHRONIZE CACHE (16) if we don't pass a LBA range?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread Christoph Hellwig (h...@infradead.org)
On Thu, Jul 17, 2014 at 11:53:33PM +, KY Srinivasan wrote:
> I still see this problem. There was talk of fixing it elsewhere.

Well, what we have right not is entirely broken, given that the
block layer doesn't initialize ->timeout on TYPE_FS requeuests.

We either need to revert that initial commit or apply something like
the attached patch as a quick fix.

>From ecbf154d15f4022676219b9ff90e542d1db64392 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Fri, 18 Jul 2014 17:11:27 +0200
Subject: sd: set a non-zero timeout for flush requests

rq->timeout for TYPE_FS commands needs to be initialized by the driver,
so we can't simply multiply it.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 377a520..bef4e78 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -880,7 +880,7 @@ static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
cmd->transfersize = 0;
cmd->allowed = SD_MAX_RETRIES;
 
-   rq->timeout *= SD_FLUSH_TIMEOUT_MULTIPLIER;
+   rq->timeout = SD_TIMEOUT * SD_FLUSH_TIMEOUT_MULTIPLIER;
return BLKPREP_OK;
 }
 
-- 
1.9.1



Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread Christoph Hellwig (h...@infradead.org)
On Thu, Jul 17, 2014 at 11:53:33PM +, KY Srinivasan wrote:
 I still see this problem. There was talk of fixing it elsewhere.

Well, what we have right not is entirely broken, given that the
block layer doesn't initialize -timeout on TYPE_FS requeuests.

We either need to revert that initial commit or apply something like
the attached patch as a quick fix.

From ecbf154d15f4022676219b9ff90e542d1db64392 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig h...@lst.de
Date: Fri, 18 Jul 2014 17:11:27 +0200
Subject: sd: set a non-zero timeout for flush requests

rq-timeout for TYPE_FS commands needs to be initialized by the driver,
so we can't simply multiply it.

Signed-off-by: Christoph Hellwig h...@lst.de
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 377a520..bef4e78 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -880,7 +880,7 @@ static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
cmd-transfersize = 0;
cmd-allowed = SD_MAX_RETRIES;
 
-   rq-timeout *= SD_FLUSH_TIMEOUT_MULTIPLIER;
+   rq-timeout = SD_TIMEOUT * SD_FLUSH_TIMEOUT_MULTIPLIER;
return BLKPREP_OK;
 }
 
-- 
1.9.1



Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread Christoph Hellwig (h...@infradead.org)
On Fri, Jul 18, 2014 at 12:51:06AM +, Elliott, Robert (Server Storage) 
wrote:
 SYNCHRONIZE CACHE (16) should be favored over SYNCHRONIZE 
 CACHE (10) unless SYNCHRONIZE CACHE (10) is not supported.

I gues you mean (16) for the last occurance?  What's the benefit of
using SYNCHRONIZE CACHE (16) if we don't pass a LBA range?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread h...@infradead.org
On Fri, Jul 18, 2014 at 04:57:13PM +, James Bottomley wrote:
 Actually, no you didn't.  The difference is in the derivation of the
 timeout.  Christoph's patch is absolute in terms of SD_TIMEOUT; yours is
 relative to the queue timeout setting ... I thought there was a reason
 for preferring the relative version.

Yes, KYs version is better.  It takes the base timeout drivers set
on the request queue into account.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread h...@infradead.org
This is what I plan to put in after it passes basic testing:

---
From bb617c9465b839d70ecbbc69002a20ccf5f935bd Mon Sep 17 00:00:00 2001
From: K. Y. Srinivasan k...@microsoft.com
Date: Fri, 18 Jul 2014 19:12:58 +0200
Subject: sd: fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O
 timeout

Commit ID: 7e660100d85af860e7ad763202fff717adcdaacd added code to derive the
FLUSH_TIMEOUT from the basic I/O timeout. However, this patch did not use the
basic I/O timeout of the device. Fix this bug.

Signed-off-by: K. Y. Srinivasan k...@microsoft.com
Reviewed-by: James Bottomley jbottom...@parallels.com
Signed-off-by: Christoph Hellwig h...@lst.de
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bef4e78..9ffb393 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -880,7 +880,7 @@ static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
cmd-transfersize = 0;
cmd-allowed = SD_MAX_RETRIES;
 
-   rq-timeout = SD_TIMEOUT * SD_FLUSH_TIMEOUT_MULTIPLIER;
+   rq-timeout = rq-q-rq_timeout * SD_FLUSH_TIMEOUT_MULTIPLIER;
return BLKPREP_OK;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] [SCSI] Fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout

2014-07-18 Thread h...@infradead.org
On Fri, Jul 18, 2014 at 10:12:38AM -0700, h...@infradead.org wrote:
 This is what I plan to put in after it passes basic testing:

And that one was on top of my previous version.  One that applies
against core-for-3.17 below:

---
From 8a79783e5f72ec034a724e16c1f46604bd97bf68 Mon Sep 17 00:00:00 2001
From: K. Y. Srinivasan k...@microsoft.com
Date: Fri, 18 Jul 2014 17:11:27 +0200
Subject: sd: fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O
 timeout

Commit ID: 7e660100d85af860e7ad763202fff717adcdaacd added code to derive the
FLUSH_TIMEOUT from the basic I/O timeout. However, this patch did not use the
basic I/O timeout of the device. Fix this bug.

Signed-off-by: K. Y. Srinivasan k...@microsoft.com
Reviewed-by: James Bottomley jbottom...@parallels.com
Signed-off-by: Christoph Hellwig h...@lst.de
---
 drivers/scsi/sd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 377a520..9ffb393 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -880,7 +880,7 @@ static int sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
cmd-transfersize = 0;
cmd-allowed = SD_MAX_RETRIES;
 
-   rq-timeout *= SD_FLUSH_TIMEOUT_MULTIPLIER;
+   rq-timeout = rq-q-rq_timeout * SD_FLUSH_TIMEOUT_MULTIPLIER;
return BLKPREP_OK;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-17 Thread h...@infradead.org
On Wed, Jul 16, 2014 at 03:20:00PM -0400, Martin K. Petersen wrote:
> The block layer can only describe one contiguous block range in a
> request. My copy offload patches introduces the bi_special field that
> allows us to attach additional information to an I/O. I have
> experimented with doing that for discards to overcome the suck of DSM
> TRIM. Splitting and merging requests in MD/DM gets much more cumbersome,
> though.

I had a bunch of prototypes for this years ago that didn't really work
out.  I always made ranged didscards something that driver had to opt
in for.  In my case md and dm never opted in, although for mirroring or
multipath it should of course handle it fairly easily.

> It also wasn't evident that it was worth the hassle. While UNMAP allows
> us to express large regions, DSM TRIM on the SATA side is limited to 32
> MB per range. So in many cases we end up maxing out the payload capacity
> even with a single contiguous range.

That's mostly because we don't support larger than 512 byte TRIM payloads
yet..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-17 Thread h...@infradead.org
On Wed, Jul 16, 2014 at 03:20:00PM -0400, Martin K. Petersen wrote:
 The block layer can only describe one contiguous block range in a
 request. My copy offload patches introduces the bi_special field that
 allows us to attach additional information to an I/O. I have
 experimented with doing that for discards to overcome the suck of DSM
 TRIM. Splitting and merging requests in MD/DM gets much more cumbersome,
 though.

I had a bunch of prototypes for this years ago that didn't really work
out.  I always made ranged didscards something that driver had to opt
in for.  In my case md and dm never opted in, although for mirroring or
multipath it should of course handle it fairly easily.

 It also wasn't evident that it was worth the hassle. While UNMAP allows
 us to express large regions, DSM TRIM on the SATA side is limited to 32
 MB per range. So in many cases we end up maxing out the payload capacity
 even with a single contiguous range.

That's mostly because we don't support larger than 512 byte TRIM payloads
yet..
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-16 Thread h...@infradead.org
On Wed, Jul 16, 2014 at 01:47:35PM -0400, Martin K. Petersen wrote:
> There were several SSDs that did not want to support wearing out flash
> by writing gobs of zeroes and only support the UNMAP case.

Given that SSDs usually aren't hard provisioned anyway that seems like
an odd decision.  But SAS SSD would be something I'd at least expect
to properly fail these..

> I don't have a problem with a BLIST_PREFER_UNMAP flag or something like
> that. But BLIST_TRY_VPD_PAGES seems more generally useful and it does
> fix the problem at hand. That's why I went that route.

As I said I'm perfectly fine with your patch and I think we'll find more
uses for it.  I'll apply it as soon as I get a second review.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-16 Thread h...@infradead.org
On Wed, Jul 16, 2014 at 11:44:18AM -0400, Martin K. Petersen wrote:
> There are lots of devices out there that support WRITE SAME(10) or (16)
> without the UNMAP bit. And there are devices that support WRITE SAME w/
> UNMAP functionality but not "regular" WRITE SAME.

Oh, we actually have devices that support WRITE SAME with unmap, but not
without?  That's defintively a little strange.

> no_write_same is there to prevent the REQ_WRITE_SAME use case (for which
> we have really weak heuristics). Your patch overloads no_write_same so
> it also governs a REQ_DISCARD use case.

Yes, and it did this intentionally.  I really wouldn't expect devices
to support WRITE SAME with UNMAP but blow up on a WRITE SAME without
it (and not just simple fail it in an orderly way).

> My proposed black list patch fixes the hyperv discard issue. So I don't
> see why we'd need to overload no_write_same which was meant for an
> entirely different purpose.

It definitively seems odd to default to trying WRITE SAME for unmap
for a device that explicitly tells us that it doesn't support WRITE
SAME.

Note that I'm not against your patch - I suspect forcing us to read
EVPD pages even for devices that claim to be SPC-2 will come in useful
in various scenarios.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-16 Thread h...@infradead.org
On Sun, Jul 13, 2014 at 08:58:34AM -0400, Martin K. Petersen wrote:
> > "KY" == KY Srinivasan  writes:
> 
> KY> Windows hosts do support UNMAP and set the field in the
> KY> EVPD. However, since the host advertises SPC-2 compliance, Linux
> KY> does not even query the VPD page.
>  
> >> If we want to enable UNMAP in this case I'd prefer a blacklist entry
> >> than trying UNMAP despite the device not advertising it.
> 
> I agree with that. We could do something like the patch below.
> 
> However, I do think it's a good idea that you guys are looking into
> reporting SPC-3.

KY mentioned that they have a prototype for that now.

Btw, I looked over sd.c a bit more, and I think I understand why they
get the WRITE SAME commands now:

read_capacity_16 calls sd_config_discard(sdkp, SD_LBP_WS16) if the LPBME
bit is set.  At least older SBC drafts left it wide open if a target
supports WRITE SAME with UNMAP or UNMAP in this case.  So I think we'd
still want a patch to use UNMAP instead of WRITE SAME for this case,
which should also fix hyperv.  Below is the quick hack version of that
that just checks the host no_write_same flag, as the one on the device
isn't set yet - I guess we need to refactor some of that logic.


diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 87566b5..4480fdf 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2035,7 +2035,10 @@ static int read_capacity_16(struct scsi_disk *sdkp, 
struct scsi_device *sdp,
if (buffer[14] & 0x40) /* LBPRZ */
sdkp->lbprz = 1;
 
-   sd_config_discard(sdkp, SD_LBP_WS16);
+   if (sdp->host->no_write_same)
+   sd_config_discard(sdkp, SD_LBP_UNMAP);
+   else
+   sd_config_discard(sdkp, SD_LBP_WS16);
}
 
sdkp->capacity = lba + 1;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-16 Thread h...@infradead.org
On Sun, Jul 13, 2014 at 08:58:34AM -0400, Martin K. Petersen wrote:
  KY == KY Srinivasan k...@microsoft.com writes:
 
 KY Windows hosts do support UNMAP and set the field in the
 KY EVPD. However, since the host advertises SPC-2 compliance, Linux
 KY does not even query the VPD page.
  
  If we want to enable UNMAP in this case I'd prefer a blacklist entry
  than trying UNMAP despite the device not advertising it.
 
 I agree with that. We could do something like the patch below.
 
 However, I do think it's a good idea that you guys are looking into
 reporting SPC-3.

KY mentioned that they have a prototype for that now.

Btw, I looked over sd.c a bit more, and I think I understand why they
get the WRITE SAME commands now:

read_capacity_16 calls sd_config_discard(sdkp, SD_LBP_WS16) if the LPBME
bit is set.  At least older SBC drafts left it wide open if a target
supports WRITE SAME with UNMAP or UNMAP in this case.  So I think we'd
still want a patch to use UNMAP instead of WRITE SAME for this case,
which should also fix hyperv.  Below is the quick hack version of that
that just checks the host no_write_same flag, as the one on the device
isn't set yet - I guess we need to refactor some of that logic.


diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 87566b5..4480fdf 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2035,7 +2035,10 @@ static int read_capacity_16(struct scsi_disk *sdkp, 
struct scsi_device *sdp,
if (buffer[14]  0x40) /* LBPRZ */
sdkp-lbprz = 1;
 
-   sd_config_discard(sdkp, SD_LBP_WS16);
+   if (sdp-host-no_write_same)
+   sd_config_discard(sdkp, SD_LBP_UNMAP);
+   else
+   sd_config_discard(sdkp, SD_LBP_WS16);
}
 
sdkp-capacity = lba + 1;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-16 Thread h...@infradead.org
On Wed, Jul 16, 2014 at 11:44:18AM -0400, Martin K. Petersen wrote:
 There are lots of devices out there that support WRITE SAME(10) or (16)
 without the UNMAP bit. And there are devices that support WRITE SAME w/
 UNMAP functionality but not regular WRITE SAME.

Oh, we actually have devices that support WRITE SAME with unmap, but not
without?  That's defintively a little strange.

 no_write_same is there to prevent the REQ_WRITE_SAME use case (for which
 we have really weak heuristics). Your patch overloads no_write_same so
 it also governs a REQ_DISCARD use case.

Yes, and it did this intentionally.  I really wouldn't expect devices
to support WRITE SAME with UNMAP but blow up on a WRITE SAME without
it (and not just simple fail it in an orderly way).

 My proposed black list patch fixes the hyperv discard issue. So I don't
 see why we'd need to overload no_write_same which was meant for an
 entirely different purpose.

It definitively seems odd to default to trying WRITE SAME for unmap
for a device that explicitly tells us that it doesn't support WRITE
SAME.

Note that I'm not against your patch - I suspect forcing us to read
EVPD pages even for devices that claim to be SPC-2 will come in useful
in various scenarios.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-16 Thread h...@infradead.org
On Wed, Jul 16, 2014 at 01:47:35PM -0400, Martin K. Petersen wrote:
 There were several SSDs that did not want to support wearing out flash
 by writing gobs of zeroes and only support the UNMAP case.

Given that SSDs usually aren't hard provisioned anyway that seems like
an odd decision.  But SAS SSD would be something I'd at least expect
to properly fail these..

 I don't have a problem with a BLIST_PREFER_UNMAP flag or something like
 that. But BLIST_TRY_VPD_PAGES seems more generally useful and it does
 fix the problem at hand. That's why I went that route.

As I said I'm perfectly fine with your patch and I think we'll find more
uses for it.  I'll apply it as soon as I get a second review.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-11 Thread h...@infradead.org
On Wed, Jul 09, 2014 at 10:27:24PM +, James Bottomley wrote:
> If we fix it at source, why would there be any need to filter?  That's
> the reason the no_write_same flag was introduced.  If we can find and
> fix the bug, it can go back into the stable trees as a bug fix, hence
> nothing should ever emit write_same(10 or 16) and additional driver code
> is redundant (and counter productive, since if this ever breaks again
> you're our best canary).
> 
> This looks like it might be the problem but Martin should confirm (I
> think the problem comes to us from the RC16 code which unconditionally
> sets WS16).

I think the problem is a differnet one.  If we have the logical
provisioning EVPD it configures what method to use, but if we don't have
one we simply check for a max unmap blocks field, and if that's not
present use WRITE SAME.

The patch checks the no_write_same flag before doing that, for which
we also have to do the write_same setup before the discard setup
in sd_revalidate_disk.

Ky: does hyperv support UNMAP?  If so any idea why it doesn't set
the maximum unmap block count field in the EVPD?

If we want to enable UNMAP in this case I'd prefer a blacklist entry
than trying UNMAP despite the device not advertising it.

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ba756b1..fbccfd2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2614,9 +2614,10 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 
if (sdkp->max_unmap_blocks)
sd_config_discard(sdkp, SD_LBP_UNMAP);
-   else
+   else if (!sdkp->device->no_write_same)
sd_config_discard(sdkp, SD_LBP_WS16);
-
+   else
+   sd_config_discard(sdkp, SD_LBP_DISABLE);
} else {/* LBP VPD page tells us what to use */
 
if (sdkp->lbpu && sdkp->max_unmap_blocks)
@@ -2766,6 +2767,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 */
if (sdkp->media_present) {
sd_read_capacity(sdkp, buffer);
+   sd_read_write_same(sdkp, buffer);
 
if (sd_try_extended_inquiry(sdp)) {
sd_read_block_provisioning(sdkp);
@@ -2776,7 +2778,6 @@ static int sd_revalidate_disk(struct gendisk *disk)
sd_read_write_protect_flag(sdkp, buffer);
sd_read_cache_type(sdkp, buffer);
sd_read_app_tag_own(sdkp, buffer);
-   sd_read_write_same(sdkp, buffer);
}
 
sdkp->first_scan = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-11 Thread h...@infradead.org
On Wed, Jul 09, 2014 at 10:27:24PM +, James Bottomley wrote:
 If we fix it at source, why would there be any need to filter?  That's
 the reason the no_write_same flag was introduced.  If we can find and
 fix the bug, it can go back into the stable trees as a bug fix, hence
 nothing should ever emit write_same(10 or 16) and additional driver code
 is redundant (and counter productive, since if this ever breaks again
 you're our best canary).
 
 This looks like it might be the problem but Martin should confirm (I
 think the problem comes to us from the RC16 code which unconditionally
 sets WS16).

I think the problem is a differnet one.  If we have the logical
provisioning EVPD it configures what method to use, but if we don't have
one we simply check for a max unmap blocks field, and if that's not
present use WRITE SAME.

The patch checks the no_write_same flag before doing that, for which
we also have to do the write_same setup before the discard setup
in sd_revalidate_disk.

Ky: does hyperv support UNMAP?  If so any idea why it doesn't set
the maximum unmap block count field in the EVPD?

If we want to enable UNMAP in this case I'd prefer a blacklist entry
than trying UNMAP despite the device not advertising it.

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ba756b1..fbccfd2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2614,9 +2614,10 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 
if (sdkp-max_unmap_blocks)
sd_config_discard(sdkp, SD_LBP_UNMAP);
-   else
+   else if (!sdkp-device-no_write_same)
sd_config_discard(sdkp, SD_LBP_WS16);
-
+   else
+   sd_config_discard(sdkp, SD_LBP_DISABLE);
} else {/* LBP VPD page tells us what to use */
 
if (sdkp-lbpu  sdkp-max_unmap_blocks)
@@ -2766,6 +2767,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 */
if (sdkp-media_present) {
sd_read_capacity(sdkp, buffer);
+   sd_read_write_same(sdkp, buffer);
 
if (sd_try_extended_inquiry(sdp)) {
sd_read_block_provisioning(sdkp);
@@ -2776,7 +2778,6 @@ static int sd_revalidate_disk(struct gendisk *disk)
sd_read_write_protect_flag(sdkp, buffer);
sd_read_cache_type(sdkp, buffer);
sd_read_app_tag_own(sdkp, buffer);
-   sd_read_write_same(sdkp, buffer);
}
 
sdkp-first_scan = 0;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-10 Thread h...@infradead.org
On Wed, Jul 09, 2014 at 10:36:26PM +, KY Srinivasan wrote:
> Ok; I am concerned about older kernels that do not have no_write_same flag.
> I suppose I can work directly with these Distros and give them a choice: 
> either implement
> the no_write_same flag or filter the command in our driver. I will not 
> include this patch when

Exactly.  In general they should be interested in the flag as various
raid controllers react badly to it as well.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8] Drivers: scsi: storvsc: Filter WRITE_SAME_16

2014-07-10 Thread h...@infradead.org
On Wed, Jul 09, 2014 at 10:36:26PM +, KY Srinivasan wrote:
 Ok; I am concerned about older kernels that do not have no_write_same flag.
 I suppose I can work directly with these Distros and give them a choice: 
 either implement
 the no_write_same flag or filter the command in our driver. I will not 
 include this patch when

Exactly.  In general they should be interested in the flag as various
raid controllers react badly to it as well.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/