from:"Darrick J. Wong"

Re: [LSF/MM TOPIC] Patch Submission process and Handling Internal Conflict

2018-01-24 Thread Darrick J. Wong

On Wed, Jan 24, 2018 at 01:36:00PM -0800, James Bottomley wrote:
> On Wed, 2018-01-24 at 11:20 -0800, Mike Kravetz wrote:
> > On 01/24/2018 11:05 AM, James Bottomley wrote:
> > > 
> > > I've got two community style topics, which should probably be
> > > discussed
> > > in the plenary
> > > 
> > > 1. Patch Submission Process
> > > 
> > > Today we don't have a uniform patch submission process across
> > > Storage, Filesystems and MM.  The question is should we (or at
> > > least should we adhere to some minimal standards).  The standard
> > > we've been trying to hold to in SCSI is one review per accepted
> > > non-trivial patch.  For us, it's useful because it encourages
> > > driver writers to review each other's patches rather than just
> > > posting and then complaining their patch hasn't gone in.  I can
> > > certainly think of a couple of bugs I've had to chase in mm where
> > > the underlying patches would have benefited from review, so I'd
> > > like to discuss making the one review per non-trival patch our base
> > > minimum standard across the whole of LSF/MM; it would certainly
> > > serve to improve our Reviewed-by statistics.
> > 
> > Well, the mm track at least has some discussion of this last year:
> > https://lwn.net/Articles/718212/
> 
> The pushback in your session was mandating reviews would mean slowing
> patch acceptance or possibly causing the dropping of patches that
> couldn't get reviewed.  Michal did say that XFS didn't have the
> problem, however there not being XFS people in the room, discussion
> stopped there.

I actually /was/ lurking in the session, but a year later I have more
thoughts:

Now that I've been maintainer for more than a year I feel more confident
in actually talking about our review processes, though I can only speak
about my own experiences and hope the other xfs developers chime in if
they choose.

In xfs we are fortunate enough that most of the codebase is at least
one software layer up from the raw hardware, which means that anybody
can build xfs with all kconfig options enabled and use it to try to
create all possible metadata structures, which means that the ability to
review a given patch and try it out isn't restricted to the subset of
people with a particular hardware device.  This means that there aren't
any patches that cannot be reviewed, which is not something I'm so sure
of for the mm layer.

Requiring review on the vast majority of non-maintainer patches that
goes into xfs (and xfsprogs) doesn't has the effect of increasing the
time to upstream acceptance, since the fact that it was committed at all
implies that the maintainer probably looked at it.

The dangerous part of course is when the maintainer commits non-trivial
code without a review -- did they look at it, or just commit whatever
made the symptoms go away?  So that's argument #1 for creating a group
norm that yes, everyone should be involved in review on a semi regular
basis.  Certainly if they're also *submitting* patches.

Argument #2 is that encouraging review of everything most likely reduces
the overall time it takes for a feature to mature because that means
that at least one of the regular participants in the group have taken
the time to read and understand how the patches mesh with the existing
systems and will ask questions when they see ill-fitting pieces.  It
definitely reduces code churn from not having to walk back bad patches
and rushed microcode updates.  That said, I've no data to back up this
assertion, merely my observations of the past decade.

My third argument is that the most time consuming part of
maintainership isn't gluing patches onto a git tree and running tests,
it's reviewing the patches.  It's a big help to know that other people
who are more familiar with various subcomponents of xfs review patches
regularly, so I don't feel as much pressure to know all things at all
times, and I worry less about blind spots because we work as a group of
people who don't see every xfs component in exactly the same way.

(Granted it helps that Dave Chinner is a fountain of historical context
indexing...)

That said, I also get rally itchy to commit my own patches at times,
especially things that look like trivial one-liners.  However, I find
that nothing in xfs is simple, and moreover the reviewers are
knowledgeable enough that even trivial patches can get reviewed quickly.

For bigger things like new features or large refactorings, there's a
strong need for updating documentation like the disk format
specification, developing a test plan, and integrating new tests into
xfstests.  That's where review is most useful, because it is the
submitter's opportunity to increase everyone's knowledge levels.  It is
also the reviewers' chance to anticipate design problems when it is
easy/cheap to fix them, and for everyone to build confidence about the
code that's going in.

The challenge for everyone, then, is to get together to decide on a
reasonable target for the amount and the

Re: [trivial PATCH] treewide: Align function definition open/close braces

2017-12-18 Thread Darrick J. Wong

4
> --- a/drivers/message/fusion/mptsas.c
> +++ b/drivers/message/fusion/mptsas.c
> @@ -2968,7 +2968,7 @@ mptsas_exp_repmanufacture_info(MPT_ADAPTER *ioc,
>   mutex_unlock(>sas_mgmt.mutex);
>  out:
>   return ret;
> - }
> +}
>  
>  static void
>  mptsas_parse_device_info(struct sas_identify *identify,
> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c 
> b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> index 3dd973475125..0ea141ece19e 100644
> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> @@ -603,7 +603,7 @@ static struct uni_table_desc *nx_get_table_desc(const u8 
> *unirom, int section)
>  
>  static int
>  netxen_nic_validate_header(struct netxen_adapter *adapter)
> - {
> +{
>   const u8 *unirom = adapter->fw->data;
>   struct uni_table_desc *directory = (struct uni_table_desc *) [0];
>   u32 fw_file_size = adapter->fw->size;
> diff --git a/drivers/net/wireless/ath/ath9k/xmit.c 
> b/drivers/net/wireless/ath/ath9k/xmit.c
> index bd438062a6db..baedc7186b10 100644
> --- a/drivers/net/wireless/ath/ath9k/xmit.c
> +++ b/drivers/net/wireless/ath/ath9k/xmit.c
> @@ -196,7 +196,7 @@ ath_tid_pull(struct ath_atx_tid *tid)
>   }
>  
>   return skb;
> - }
> +}
>  
>  static struct sk_buff *ath_tid_dequeue(struct ath_atx_tid *tid)
>  {
> diff --git a/drivers/platform/x86/eeepc-laptop.c 
> b/drivers/platform/x86/eeepc-laptop.c
> index 5a681962899c..4c38904a8a32 100644
> --- a/drivers/platform/x86/eeepc-laptop.c
> +++ b/drivers/platform/x86/eeepc-laptop.c
> @@ -492,7 +492,7 @@ static void eeepc_platform_exit(struct eeepc_laptop 
> *eeepc)
>   * potentially bad time, such as a timer interrupt.
>   */
>  static void tpd_led_update(struct work_struct *work)
> - {
> +{
>   struct eeepc_laptop *eeepc;
>  
>   eeepc = container_of(work, struct eeepc_laptop, tpd_led_work);
> diff --git a/drivers/rtc/rtc-ab-b5ze-s3.c b/drivers/rtc/rtc-ab-b5ze-s3.c
> index a319bf1e49de..ef5c16dfabfa 100644
> --- a/drivers/rtc/rtc-ab-b5ze-s3.c
> +++ b/drivers/rtc/rtc-ab-b5ze-s3.c
> @@ -648,7 +648,7 @@ static int abb5zes3_rtc_set_alarm(struct device *dev, 
> struct rtc_wkalrm *alarm)
>   ret);
>  
>   return ret;
> - }
> +}
>  
>  /* Enable or disable battery low irq generation */
>  static inline int _abb5zes3_rtc_battery_low_irq_enable(struct regmap *regmap,
> diff --git a/drivers/scsi/dpt_i2o.c b/drivers/scsi/dpt_i2o.c
> index fd172b0890d3..a00d822e3142 100644
> --- a/drivers/scsi/dpt_i2o.c
> +++ b/drivers/scsi/dpt_i2o.c
> @@ -3524,7 +3524,7 @@ static int adpt_i2o_systab_send(adpt_hba* pHba)
>  #endif
>  
>   return ret; 
> - }
> +}
>  
>  
>  
> /*
> diff --git a/drivers/scsi/sym53c8xx_2/sym_glue.c 
> b/drivers/scsi/sym53c8xx_2/sym_glue.c
> index 791a2182de53..7320d5fe4cbc 100644
> --- a/drivers/scsi/sym53c8xx_2/sym_glue.c
> +++ b/drivers/scsi/sym53c8xx_2/sym_glue.c
> @@ -1393,7 +1393,7 @@ static struct Scsi_Host *sym_attach(struct 
> scsi_host_template *tpnt, int unit,
>   scsi_host_put(shost);
>  
>   return NULL;
> - }
> +}
>  
>  
>  /*
> diff --git a/fs/locks.c b/fs/locks.c
> index 21b4dfa289ee..d2399d001afe 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -559,7 +559,7 @@ static const struct lock_manager_operations 
> lease_manager_ops = {
>   * Initialize a lease, use the default lock manager operations
>   */
>  static int lease_init(struct file *filp, long type, struct file_lock *fl)
> - {
> +{
>   if (assign_type(fl, type) != 0)
>   return -EINVAL;
>  
> diff --git a/fs/ocfs2/stack_user.c b/fs/ocfs2/stack_user.c
> index dae9eb7c441e..d2fb97b173da 100644
> --- a/fs/ocfs2/stack_user.c
> +++ b/fs/ocfs2/stack_user.c
> @@ -398,7 +398,7 @@ static int ocfs2_control_do_setnode_msg(struct file *file,
>  
>  static int ocfs2_control_do_setversion_msg(struct file *file,
>  struct ocfs2_control_message_setv 
> *msg)
> - {
> +{
>   long major, minor;
>   char *ptr = NULL;
>   struct ocfs2_control_private *p = file->private_data;
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 0da80019a917..217108f765d5 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2401,7 +2401,7 @@ static bool
>  xfs_agf_verify(
>   struct xfs_mount *mp,
>   struct xfs_buf  *bp)
> - {
> +{
>   struct xfs_agf  *agf = XFS_BUF_TO_AGF(bp);
>  
>   if (xfs_

Re: [PATCH 0/3] Improve block device testing coverage

2017-03-31 Thread Darrick J. Wong

On Fri, Mar 31, 2017 at 03:11:28PM +, Bart Van Assche wrote:
> On Fri, 2017-03-31 at 13:02 +0300, Dmitry Monakhov wrote:
> > Another good example may be a bug with dirty page cache after blkdiscard
> > https://lkml.org/lkml/2017/3/22/789 . This simple bug  result in crappy
> > fsimage if mkfs relay on discard_zeroes_data behaviour.
> > So IMHO basic blkdev test coverage is important filesystem testing. i.e.
> > important for xfstests.
> 
> Mixing up filesystem tests and block layer / block driver tests in the same
> directory is completely wrong. Block driver developers will be primarily
> interested in the block tests and may want to skip the filesystem tests.
> Filesystem developers will probably run the block tests only once and will
> likely run the filesystem tests repeatedly. Mixing up different kinds of
> tests in the same directory makes it unnecessarily hard to run block and
> filesystem tests separately.

During LSF I had started to wonder if we should just create a new
FSTYP=blockdev fs type with a no-op mkfs & mount.  "_require_fs generic"
could be taught to ignore FSTYP=blockdev; blockdev tests that should
work on all block devices can stay in tests/generic, and blockdev tests
that require specific features or complicated setup can go in
tests/blockdev.

The benefit (for the fs developers, anyway) of having complex block
device setup code helper functions in common/ is that then we can also
start writing tests to see how the fs reacts with more complex storage
setups.  We already have some of that for dm_{thin,flakey,delay,error}.

That way we keep the tests together and make it easy to run them (when
applicable) as part of regular fs testing, and avoid the situation where
bdevtests and xfstests slowly drift apart in terms of behaviors and
command line switches.

The downside ofc is the potential for bloat. :)

(The blockdev fallocate tests fit the fs/block split awkwardly --
they call what is nominally a fs feature on something that isn't itself
a filesystem...)

 Just my 5c.

--D

> 
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BLKZEROOUT not zeroing md dev on VMDK

2016-05-26 Thread Darrick J. Wong

On Wed, May 18, 2016 at 11:39:30PM +0100, Sitsofe Wheeler wrote:
> Hi,
> 
> With Ubuntu's 4.4.0-22-generic kernel and a Fedora 23
> 4.6.0-1.vanilla.knurd.1.fc23.x86_64 kernel I've found that the
> BLKZEROOUT syscall can malfunction and not zero data.
> 
> When BLKZEROOUT is issued to an MD device atop a PVSCSI controller
> supplied VMDK from ESXi 6.0 the call returns immediately and with a zero
> return code. Unfortunately, inspecting the data on the MD device shows
> that it has not been zeroed and is in fact untouched. The easiest way to
> see this behaviour is to boot the VM, create an mdadm device atop
> /dev/sd?, scribble some non-zero value on the disk and then use
> blkdiscard --zeroout /dev/md??? . If you then inspect the MD disk (e.g.
> with hexdump) you will still see the old data and using POSIX_FADV_DONTNEED
> on the MD device doesn't change the outcome.
> 
> The only clue I've seen is that
> /sys/block/sd?/queue/write_same_max_bytes starts out being 33553920 but
> after a WRITE SAME is issued it becomes 0. If the MD device is created
> after write_same_max_bytes has become 0 on the backing disk then
> BLKZEROOUT seems to work correctly.

It's possible that the pvscsi device advertised WRITE SAME, but if the device
sends back ILLEGAL REQUEST then the SCSI disk driver will set
write_same_max_bytes=0.  Subsequent BLKZEROOUT attempts will then issue writes
of zeroes to the drive.

--D

> 
> -- 
> Sitsofe | http://sucs.org/~sits/
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf] LSF/MM Schedule and improving discard support

2016-04-13 Thread Darrick J. Wong

On Wed, Apr 13, 2016 at 09:51:04AM -0700, James Bottomley wrote:
> On Wed, 2016-04-13 at 09:29 -0700, Bart Van Assche wrote:
> > On 04/13/2016 09:21 AM, Martin K. Petersen wrote:
> > > From a filesystem/ioctl perspective, BLKDISCARD is a hint. We
> > > should not be
> > > rounding off or aligning anything.
> > 
> > Hello Martin,
> > 
> > Today if a BLKDISCARD ioctl passes a non-aligned start and/or end 
> > sector to the kernel then the block layer will submit invalid (non
> > -aligned) REQ_DISCARD requests to the block driver the ioctl applies 
> > to. This is not acceptable. Does the above mean that you are 
> > proposing to fail such BLKDISCARD ioctls with an error code?
> 
> The answer would be of course not.  discard is a hint so malformed
> discard gets ignored by the device and success is returned because you
> can't oblige devices to obey hints (that's why they're called hints).

Agree.  For blockdev FALLOC_FL_PUNCH_HOLE I think we can simply check for
logical block size ("lbs") alignment and then pass the request to the
device with the understanding that it can do as it pleases.  We asked the
device to try to deallocate blocks, and perhaps it cannot.

Just to be clear, this only applies to zeroing discard; the "discard and who
knows what you can now read back" thing that nobody likes has been temporarily
wired up to FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE. :)

> However, the problem of needing a mandatory discard for scrubbing
> blocks is part of the fallocate discussion, I think.

The third fallocate mode (FALLOC_FL_ZERO_RANGE) doesn't fit with the phrase
"mandatory discard for scrubbing blocks", though if one removed "discard" from
that phrase then it would.  The only thing that ZERO_RANGE guarantees is that
subsequent reads return zeroes.  XFS punches the entire range and reallocates
it with unwritten extents; ext4 fills the holes in the range with unwritten
extents and converts real extents to unwritten.  Both also write zeroes to any
part of the range that doesn't align to an FS block.

Yes, I think there are several questions to resolve here for mandatory zeroing
with FALLOC_FL_ZERO_RANGE (summarizing the issues I've come up with so far):

a) Should blockdev fallocate accept byte-granular offset/length arguments, even
if it has to use the page cache to write zeroes to the device?  This is what
file fallocate does today.

b) If blockdev fallocate does impose alignment requirements, should it return
EINVAL to a request that isn't aligned to the logical block size?

c) If a device really really prefers that its requests are aligned to
min_io_size (which can be much larger than the logical block size), should it
reject requests that aren't aligned to min_io?  Or perhaps it should take care
of the alignment problems on its own somehow?

For allocate mode (the thing Mike Snitzer brought up in another thread
yesterday), the alignment problems are much easier because we're allowed to
round the start down and the end up to fit whatever alignment we require.

Should we promote this to a storage track session at LSF next week?

--D

> 
> James
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please submit specific discussion proposals for the File Storage miniconf at LPC2015

2015-06-04 Thread Darrick J. Wong

Hi folks,

Well, we made it!  As of yesterday, the File  Storage systems microconf has
been approved for Plumbers!  If you're interested in attending, I highly
recommend that you register[0] immediately, as the earlybird deadline is
tomorrow, June 5th.

We have a solid list of discussion ideas on the wiki page[1], and three hours
in which to conduct those discussions!  If you are interested in leading one of
the three hourlong sessions, it is now time to submit[2] a specific proposal
for consideration.  People selected to be session leaders can have their
registrations changed to the speaker package even after registering.

Proposals needn't be strictly limited to the fourteen bullet points on the wiki
page.  I will try to have the three key discussions lined up by the end of the
month, so please send in proposals!

--Darrick

[0] 
https://www.regonline.com/register/login.aspx?eventID=1623891MethodId=0EventsessionId=
[1] http://wiki.linuxplumbersconf.org/2015:file_and_storage_systems
[2] https://linuxplumbersconf.org/2015/ocw/events/LPC2015/proposals
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [dm-devel] Proposal for annotating _unstable_ pages

2015-05-22 Thread Darrick J. Wong

On Thu, May 21, 2015 at 09:21:12PM +0200, Jan Kara wrote:
 On Thu 21-05-15 11:09:55, Kent Overstreet wrote:
  On Thu, May 21, 2015 at 06:54:53PM +0200, Jan Kara wrote:
   On Wed 20-05-15 18:04:40, Kent Overstreet wrote:
 Yeah.  I never figured out a sane way to migrate pages and keep 
 everything
 else happy.  Daniel Phillips is having a go at page forking for tux3; 
 let's
 see if the questions about that get resolved.

That would be great, we need something.

I'd also be really curious what btrfs is doing today - is it just 
bouncing
everything internally, or did they come up with something more clever?
   
   Btrfs is just waiting for IO to complete.
   
  Also, there's probably always going to be situations where we're 
  reading or
  writing to pages user space can stomp on (dio) - IMO we need to add 
  a bio flag
  to annotate this - if you need this to be stable you have to 
  bounce it.
  Otherwise either filesystems/block drivers are going to be stuck 
  bouncing
  everything, or it'll just (continue to be) buggy.
 
 Well, for now there's BIO_SNAP_STABLE that forces the block layer to 
 bounce it,
 but right now ext3 is the last user of it, and afaict btrfs is the 
 only other
 FS that takes care of stable pages on its own.

I have no idea what BIO_SNAP_STABLE was supposed to be for, but I don't 
see how
it's useful for anything sane.
   
   It's for the case where lower layer requests it needs stable pages but
   upper layer isn't able to provide them (as is the case of ext3). Then 
   block
   layer bounces the data for the caller.
   
But that's the complete opposite of the problem stable pages are 
supposed to
solve: stable pages are for when the _lower_ layer (be it filesystem, 
bcache,
md, lvm) needs the memory being either read to or written from (both, 
it's not
just writes) to not be diddled over while the IO is in flight.

Now, a point that I think has been missed is that stable pages are 
_not_ a
complete solution, at least for consumers in the block layer.

The situation today is that if I'm in the block layer, and I get a 
handed a read
or write bio, I _don't know_ if it's from something that's going to 
diddle over
those pages or not. So if I require stable pages - be it for data 
checksumming
or for other things - I've just got to bounce the bio myself.

And then the really annoying thing is that if you've got stacked things 
that all
need stable pages (maybe btrfs on top of bcache on top of md) - they 
_all_ have
to assume the pages aren't going to be stable, so if they need them 
they _all_
have to bounce - even though once the first layer bounced the bio that 
made it
stable for everything underneath it.
   
   The current design is that if you need stable pages for your device, set
   bdi capability BDI_CAP_STABLE_WRITES, fs then takes care of not scribbling
   over your page while it is under writeback or uses BIO_SNAP_STABLE if it
   cannot.
  
  But if I need stable pages, I still have to bounce because that _does not_
  guarantee stable pages, it only gives me stable pages for some of the IOs 
  and in
  the lower layers you can't tell which is which.
  
  Do you see the problem? What good is BDI_CAP_STABLE_WRITES if it's not a
  guarantee and I can't tell if I need to bounce or not?
   So fix the upper layers to make it a guarantee? You mentioned direct IO
 needs fixing. Anything else?

Back when I was writing the stable pages patches, I observed that some of the
filesystems didn't hold the pages containing their own metadata stable during
writeback on a stable-writes device.  The journalling filesystems were fine
because they had various means to take care of that.

ISTR ext2 and vfat were the biggest culprits, but both maintainers rejected
the patches to fix that behavior.  This might no longer be the case; those
patches were so long ago I can't find them in Google.

--D

 
   Honza
 -- 
 Jan Kara j...@suse.cz
 SUSE Labs, CR
 
 --
 dm-devel mailing list
 dm-de...@redhat.com
 https://www.redhat.com/mailman/listinfo/dm-devel
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

LPC2015: File and Storage Systems uconf

2015-04-03 Thread Darrick J. Wong

Hi everyone,

Linux Plumbers is coming up in just four months!  I would like for there to be
a file  storage miniconf at this year's LPC, so I've started assembling a plan
for what we might discuss.  As a starting point, I've filled the planning page
with the topics that didn't achieve any sort of resolution at LSF/MM:

http://wiki.linuxplumbersconf.org/2015:file_and_storage_systems

There are undoubtedly things that I missed in my initial list, and it would be
very helpful to figure out who's going.

If you'd like to visit Seattle in mid-August (I promise it probably won't be
raining!) and/or have a topic that you'd like to talk about that I missed,
I'd appreciate it if you wrote it into the wiki page.

Thanks,

--Darrick
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2015-01-21 Thread Darrick J. Wong

Create a new ioctl to expose the block layer's newfound ability to
issue either a zeroing discard, a WRITE SAME with a zero page, or a
regular write with the zero page.  This BLKZEROOUT2 ioctl takes
{start, length, flags} as parameters.  So far, the only flag available
is to enable the zeroing discard part -- without it, the call invokes
the old BLKZEROOUT behavior.  start and length have the same meaning
as in BLKZEROOUT.

Furthermore, because BLKZEROOUT2 issues commands directly to the
storage device, we must invalidate the page cache (as a regular
O_DIRECT write would do) to avoid returning stale cache contents at a
later time.

Depends on block: Add discard flag to blkdev_issue_zeroout() function.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 block/ioctl.c   |   45 ++---
 include/uapi/linux/fs.h |7 +++
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 7d8befd..ff623d5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -186,19 +186,39 @@ static int blk_ioctl_discard(struct block_device *bdev, 
uint64_t start,
 }
 
 static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
-uint64_t len)
+uint64_t len, uint32_t flags)
 {
+   int ret;
+   struct address_space *mapping;
+   uint64_t end = start + len - 1;
+
+   if (flags  ~BLKZEROOUT2_DISCARD_OK)
+   return -EINVAL;
if (start  511)
return -EINVAL;
if (len  511)
return -EINVAL;
-   start = 9;
-   len = 9;
-
-   if (start + len  (i_size_read(bdev-bd_inode)  9))
+   if (end = i_size_read(bdev-bd_inode))
return -EINVAL;
 
-   return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
+   /* Invalidate the page cache, including dirty pages */
+   mapping = bdev-bd_inode-i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   ret = blkdev_issue_zeroout(bdev, start  9, len  9, GFP_KERNEL,
+  flags  BLKZEROOUT2_DISCARD_OK);
+   if (ret)
+   goto out;
+
+   /*
+* Invalidate again; if someone wandered in and dirtied a page,
+* the caller will be given -EBUSY.
+*/
+   ret = invalidate_inode_pages2_range(mapping,
+   start  PAGE_CACHE_SHIFT,
+   end  PAGE_CACHE_SHIFT);
+out:
+   return ret;
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
@@ -326,7 +346,18 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
if (copy_from_user(range, (void __user *)arg, sizeof(range)))
return -EFAULT;
 
-   return blk_ioctl_zeroout(bdev, range[0], range[1]);
+   return blk_ioctl_zeroout(bdev, range[0], range[1], 0);
+   }
+   case BLKZEROOUT2: {
+   struct blkzeroout2 p;
+
+   if (!(mode  FMODE_WRITE))
+   return -EBADF;
+
+   if (copy_from_user(p, (void __user *)arg, sizeof(p)))
+   return -EFAULT;
+
+   return blk_ioctl_zeroout(bdev, p.start, p.length, p.flags);
}
 
case HDIO_GETGEO: {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..54d24ea 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -150,6 +150,13 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+struct blkzeroout2 {
+   __u64 start;
+   __u64 length;
+   __u32 flags;
+};
+#define BLKZEROOUT2_DISCARD_OK 1
+#define BLKZEROOUT2 _IOR(0x12, 127, struct blkzeroout2)
 
 #define BMAP_IOCTL 1   /* obsolete - kept for compatibility */
 #define FIBMAP_IO(0x00,1)  /* bmap access */
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-11 Thread Darrick J. Wong

On Wed, Dec 10, 2014 at 05:41:54PM -0800, Darrick J. Wong wrote:
 On Wed, Dec 10, 2014 at 02:29:29AM -0800, Darrick J. Wong wrote:
  On Wed, Dec 10, 2014 at 02:15:14AM -0800, Darrick J. Wong wrote:
   On Wed, Dec 10, 2014 at 01:04:58AM -0800, Darrick J. Wong wrote:
On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
 Hi,
 
 On 09-12-14 20:31, Darrick J. Wong wrote:
 Hi,
 
 I have an Apricorn USB 3 disk dongle thing that claims to support 
 UAS.
 However, the kernel crashes when I plug it in[1].
 
 Yes there are some known issues with uas error handling which are 
 fixed
 in 3.18, can you try with a 3.18 kernel please ?

The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a 
fuller
dmesg output.  Looking at the code, it looks like we end up in
queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we 
fall off
the end.
 
 Well, there are (at least) two issues going on here.  The first is that the
 SCSI layer passes us zero-length READ10 commands, which is causing this crash.
 Zero length means the sglist is empty, so the usb host has nothing to map, and
 hence urb-num_mapped_sgs == 0 and the loop goes boom.  I don't know what it
 means to send a bulk URB with no buffers, so...
 
 ...then I took a tour of how SCSI LLDDs deal with zero-length read/write
 commands.  mpt2sas attaches a junk sg and pushes the command out.  libata
 detects zero-length READ/WRITE SCSI commands and completes the scsi command
 without ever touching hardware.  I wasn't able to get any of my parallel SCSI
 disks to boot, so I could not try that.
 
 The other problem is when I plug in a different disk (same mfg/model), READ
 CAPACITY 16 intermittently returns the string USBSUSBSUSBS, which of course
 is garbage.  The kernel then tries to use these values; fortunately, it 
 rejects
 a sector size of 1431519827 (USBS) and sets the size to zero.

It turns out that this dongle will return USBSUSBSUSB to just about
*any* command, such as READ10.  In fact, that's the root cause of the
crash.  The partition code issues a 4k read to the disk (looking for
partition tables).  The dongle returns USBSUSBSUSB (13 bytes) which
causes the bio to be advanced by 13 bytes because the URB's
actual_length is stuffed into the SCSI resid(ual length) field.  The
block layer code now wants to read 4083 bytes starting at byte 13,
which, results in 3584 bytes being read ... to somewhere.  This leaves
499 bytes in the bio, which is rounded down to 0 sectors, and thus we
crash on a zero-length READ10 when we try to read the remaining piece
and there's no sg to land the data.  Worse yet, if you somehow patch
all *that* up, now the reader sees USBSUSBSUSB when the bio completes.

Let's disable UAS on this thing entirely.  (Well, you /could/ hack it
to detect USBSUSBSUSB and fail the SCSI command entirely, but... meh.)

Though we should shortcut a zero-length read to avoid crashing the
kernel, since sg_raw can issue such commands.

Patches soon,

--D

 So, I can code up a couple of patches -- one to teach UAS how to deal with 
 zero
 length read and writes; and a second patch to set US_FL_IGNORE_UAS on Apricorn
 bridges.  I tried setting US_FL_NO_READ_CAPACITY_16, but for whatever reason
 sd.c was still trying RC16.
 
 --D
 

(Alas it's now 1am here, so I'm going to bed. :/ )
   
   Eh, nuts to sleeping.  dmesg produces this:
   
   [  231.128074] usbcore: registered new interface driver usb-storage
   [  231.133822] usbcore: registered new interface driver uas
   [  252.121353] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
   [  252.136927] scsi host6: uas
   [  252.141679] scsi 6:0:0:0: Direct-Access Apricorn  
   0128 PQ: 0 ANSI: 6
   [  252.145433] sd 6:0:0:0: Attached scsi generic sg2 type 0
   [  252.145525] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 
   GB/149 GiB)
   [  252.145527] sd 6:0:0:0: [sdc] 4096-byte physical blocks
   [  252.145891] sd 6:0:0:0: [sdc] Write Protect is off
   [  252.145973] sd 6:0:0:0: [sdc] No Caching mode page found
   [  252.145975] sd 6:0:0:0: [sdc] Assuming drive cache: write through
  
  Huh.  4096-byte physical blocks??  That drive is /not/ a 4k sector drive.
  Here's what the kernel said when I plugged in the other (Plugable brand) 
  UAS
  bridge[1]:
  
  [   32.466870] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
  [   32.498996] usbcore: registered new interface driver usb-storage
  [   37.660963] scsi host6: uas
  [   37.661193] usbcore: registered new interface driver uas
  [   37.661292] queue_bulk_sg_tx: num=1 sg=880447764500 addr=45af41000 
  len=0 pagelink=ea00116bd042
  [   37.661550] queue_bulk_sg_tx: num=1 sg=8804483fb600 addr=45af41000 
  len=0 pagelink=ea00116bd042
  [   37.661744] scsi 6:0:0:0: Direct-Access Plugable USB3-SATA-UASP1  0  
PQ: 0 ANSI: 6
  [   37.661865] queue_bulk_sg_tx: num=1 sg

[PATCH] uas: disable UAS on Apricorn SATA dongles

2014-12-11 Thread Darrick J. Wong

The Apricorn SATA dongle will occasionally return USBSUSBSUSB in
response to SCSI commands when running in UAS mode.  Therefore,
disable UAS mode on this dongle.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 drivers/usb/storage/unusual_uas.h |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/usb/storage/unusual_uas.h 
b/drivers/usb/storage/unusual_uas.h
index 18a283d..3530cb0 100644
--- a/drivers/usb/storage/unusual_uas.h
+++ b/drivers/usb/storage/unusual_uas.h
@@ -40,6 +40,16 @@
  * and don't forget to CC: the USB development list linux-...@vger.kernel.org
  */
 
+/*
+ * Apricorn USB3 dongle sometimes returns USBSUSBSUSBS in response to SCSI
+ * commands in UAS mode.  Observed with the 1.28 firmware; are there others?
+ */
+UNUSUAL_DEV(0x0984, 0x0301, 0x0128, 0x0128,
+   Apricorn,
+   ,
+   USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+   US_FL_IGNORE_UAS),
+
 /* https://bugzilla.kernel.org/show_bug.cgi?id=79511 */
 UNUSUAL_DEV(0x0bc2, 0x2312, 0x, 0x,
Seagate,
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-10 Thread Darrick J. Wong

On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
 Hi,
 
 On 09-12-14 20:31, Darrick J. Wong wrote:
 Hi,
 
 I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
 However, the kernel crashes when I plug it in[1].
 
 Yes there are some known issues with uas error handling which are fixed
 in 3.18, can you try with a 3.18 kernel please ?

The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a fuller
dmesg output.  Looking at the code, it looks like we end up in
queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we fall off
the end.

(Alas it's now 1am here, so I'm going to bed. :/ )

--D

 
 Note that the device will likely still not work, but it should no
 longer crash things. When running 3.18 please collect the output of
 dmesg after plugging in the drive and send that to me, then we'll see
 if we can get it to work from there.
 
 Regards,
 
 Hans
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-10 Thread Darrick J. Wong

On Wed, Dec 10, 2014 at 01:04:58AM -0800, Darrick J. Wong wrote:
 On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
  Hi,
  
  On 09-12-14 20:31, Darrick J. Wong wrote:
  Hi,
  
  I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
  However, the kernel crashes when I plug it in[1].
  
  Yes there are some known issues with uas error handling which are fixed
  in 3.18, can you try with a 3.18 kernel please ?
 
 The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a fuller
 dmesg output.  Looking at the code, it looks like we end up in
 queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we fall off
 the end.
 
 (Alas it's now 1am here, so I'm going to bed. :/ )

Eh, nuts to sleeping.  dmesg produces this:

[  231.128074] usbcore: registered new interface driver usb-storage
[  231.133822] usbcore: registered new interface driver uas
[  252.121353] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
[  252.136927] scsi host6: uas
[  252.141679] scsi 6:0:0:0: Direct-Access Apricorn  0128 
PQ: 0 ANSI: 6
[  252.145433] sd 6:0:0:0: Attached scsi generic sg2 type 0
[  252.145525] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 
GiB)
[  252.145527] sd 6:0:0:0: [sdc] 4096-byte physical blocks
[  252.145891] sd 6:0:0:0: [sdc] Write Protect is off
[  252.145973] sd 6:0:0:0: [sdc] No Caching mode page found
[  252.145975] sd 6:0:0:0: [sdc] Assuming drive cache: write through
[  252.171739] queue_bulk_sg_tx: num=4294967295 sg=8804584e0b00 addr=   
   (null) len=0 pagelink=116b8882
[  252.173706] queue_bulk_sg_tx: num=4294967295 sg=  (null), ABORT
KABOOM

I wrote in a printk to spit out num_sgs and some of the sg data right before
the sg_next() call.  Looks like num_sgs is originally zero?  I then patched
the code to break early if num_sgs == 0:

/* Calculate length for next transfer --
 * Are we done queueing all the TRBs for this sg entry?
 */
this_sg_len -= trb_buff_len;
printk(KERN_ERR %s: num=%u sg=%p addr=%lx len=%u pagelink=%lx\n, __func__, 
num_sgs, sg, addr, this_sg_len, sg-page_link);
if (this_sg_len == 0) {
if (num_sgs == 0) {
printk(KERN_ERR %s: breaking early, no sgs??\n, __func__);
break;
}
--num_sgs;
if (num_sgs == 0)
break;
sg = sg_next(sg);
addr = (u64) sg_dma_address(sg);
this_sg_len = sg_dma_len(sg);

This produced this log[1] which I've excerpted here:

[   96.944791] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
[   96.972881] usbcore: registered new interface driver usb-storage
[  128.315902] scsi host6: uas
[  128.318605] usbcore: registered new interface driver uas
[  128.318691] queue_bulk_sg_tx: num=1 sg=88044650ed00 addr=446958000 len=0 
pagelink=ea00111a5602
[  128.318960] queue_bulk_sg_tx: num=1 sg=880457a03300 addr=446958000 len=0 
pagelink=ea00111a5602
[  128.321144] scsi 6:0:0:0: Direct-Access Apricorn  0128 
PQ: 0 ANSI: 6
[  128.321165] queue_bulk_sg_tx: num=1 sg=880457a03300 addr=45cbb1000 len=0 
pagelink=ea001172ec42
[  128.323714] queue_bulk_sg_tx: num=1 sg=880457a02100 addr=447738000 len=0 
pagelink=ea00111dce02
[  128.326233] queue_bulk_sg_tx: num=1 sg=880457a02600 addr=45a4c8000 len=0 
pagelink=ea0011693202
[  128.329157] sd 6:0:0:0: Attached scsi generic sg2 type 0
[  128.331328] queue_bulk_sg_tx: num=1 sg=88045795ce00 addr=456ad7000 len=0 
pagelink=ea00115ab5c2
[  128.331428] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 
GiB)
[  128.331431] sd 6:0:0:0: [sdc] 4096-byte physical blocks
[  128.331448] queue_bulk_sg_tx: num=1 sg=880457a02100 addr=456ad7000 len=0 
pagelink=ea00115ab5c2
[  128.333772] queue_bulk_sg_tx: num=1 sg=880457a03300 addr=44649e000 len=0 
pagelink=ea0011192782
[  128.336191] queue_bulk_sg_tx: num=1 sg=880457a02700 addr=45683b000 len=0 
pagelink=ea00115a0ec2
[  128.338561] queue_bulk_sg_tx: num=1 sg=880457a02600 addr=37355000 len=0 
pagelink=eadcd542
[  128.340979] queue_bulk_sg_tx: num=1 sg=880457a02c00 addr=8a8e3000 len=0 
pagelink=ea00022a38c2
[  128.343246] sd 6:0:0:0: [sdc] Write Protect is off
[  128.343263] queue_bulk_sg_tx: num=1 sg=880457a02400 addr=8a8e2000 len=0 
pagelink=ea00022a3882
[  128.345461] sd 6:0:0:0: [sdc] No Caching mode page found
[  128.345463] sd 6:0:0:0: [sdc] Assuming drive cache: write through
[  128.345475] queue_bulk_sg_tx: num=1 sg=880457a02000 addr=45ba6ba00 len=0 
pagelink=ea00116e9ac2
[  128.347752] queue_bulk_sg_tx: num=1 sg=880457a02000 addr=8ab21000 len=0 
pagelink=ea00022ac842
[  128.352127] queue_bulk_sg_tx: num=1 sg=880457a02c00 addr=8637f000 len=0 
pagelink=ea000218dfc2
[  128.354225

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-10 Thread Darrick J. Wong

On Wed, Dec 10, 2014 at 02:15:14AM -0800, Darrick J. Wong wrote:
 On Wed, Dec 10, 2014 at 01:04:58AM -0800, Darrick J. Wong wrote:
  On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
   Hi,
   
   On 09-12-14 20:31, Darrick J. Wong wrote:
   Hi,
   
   I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
   However, the kernel crashes when I plug it in[1].
   
   Yes there are some known issues with uas error handling which are fixed
   in 3.18, can you try with a 3.18 kernel please ?
  
  The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a 
  fuller
  dmesg output.  Looking at the code, it looks like we end up in
  queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we fall 
  off
  the end.
  
  (Alas it's now 1am here, so I'm going to bed. :/ )
 
 Eh, nuts to sleeping.  dmesg produces this:
 
 [  231.128074] usbcore: registered new interface driver usb-storage
 [  231.133822] usbcore: registered new interface driver uas
 [  252.121353] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
 [  252.136927] scsi host6: uas
 [  252.141679] scsi 6:0:0:0: Direct-Access Apricorn  0128 
 PQ: 0 ANSI: 6
 [  252.145433] sd 6:0:0:0: Attached scsi generic sg2 type 0
 [  252.145525] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 
 GB/149 GiB)
 [  252.145527] sd 6:0:0:0: [sdc] 4096-byte physical blocks
 [  252.145891] sd 6:0:0:0: [sdc] Write Protect is off
 [  252.145973] sd 6:0:0:0: [sdc] No Caching mode page found
 [  252.145975] sd 6:0:0:0: [sdc] Assuming drive cache: write through

Huh.  4096-byte physical blocks??  That drive is /not/ a 4k sector drive.
Here's what the kernel said when I plugged in the other (Plugable brand) UAS
bridge[1]:

[   32.466870] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
[   32.498996] usbcore: registered new interface driver usb-storage
[   37.660963] scsi host6: uas
[   37.661193] usbcore: registered new interface driver uas
[   37.661292] queue_bulk_sg_tx: num=1 sg=880447764500 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.661550] queue_bulk_sg_tx: num=1 sg=8804483fb600 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.661744] scsi 6:0:0:0: Direct-Access Plugable USB3-SATA-UASP1  0
PQ: 0 ANSI: 6
[   37.661865] queue_bulk_sg_tx: num=1 sg=8804483fba00 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.662053] queue_bulk_sg_tx: num=1 sg=8804483fba00 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.662294] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45af41000 len=0 
pagelink=ea00116bd042
[   37.662488] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b6ab000 len=0 
pagelink=ea00116daac2
[   37.663041] sd 6:0:0:0: Attached scsi generic sg2 type 0
[   37.663138] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=44897c000 len=0 
pagelink=ea0011225f02
[   37.664420] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 
GiB)
[   37.664599] queue_bulk_sg_tx: num=1 sg=880447764400 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.664833] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665022] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665255] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665421] sd 6:0:0:0: [sdc] Write Protect is off
[   37.665532] queue_bulk_sg_tx: num=1 sg=88045b9e0a00 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665735] queue_bulk_sg_tx: num=1 sg=88045b9e0a00 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.665877] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
[   37.666003] queue_bulk_sg_tx: num=1 sg=88045b9e1700 addr=4587a8e00 len=0 
pagelink=ea001161ea02
[   37.666293] queue_bulk_sg_tx: num=1 sg=88045b9e1700 addr=45b5c len=0 
pagelink=ea00116d7002
[   37.670190] queue_bulk_sg_tx: num=1 sg=88045b9e1600 addr=44897c000 len=0 
pagelink=ea0011225f02
[   37.676364] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.681800] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.687125] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.692335] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.697451] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.702429] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.707312] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=457692000 len=0 
pagelink=ea00115da482
[   37.712109] queue_bulk_sg_tx: num=1 sg=88045b9e0e00 addr=448b56000 len=0 
pagelink=ea001122d582
[   38.077805] queue_bulk_sg_tx: num

Re: UAS crash with Apricorn USB3 SATA bridge

2014-12-10 Thread Darrick J. Wong

On Wed, Dec 10, 2014 at 02:29:29AM -0800, Darrick J. Wong wrote:
 On Wed, Dec 10, 2014 at 02:15:14AM -0800, Darrick J. Wong wrote:
  On Wed, Dec 10, 2014 at 01:04:58AM -0800, Darrick J. Wong wrote:
   On Wed, Dec 10, 2014 at 09:19:04AM +0100, Hans de Goede wrote:
Hi,

On 09-12-14 20:31, Darrick J. Wong wrote:
Hi,

I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
However, the kernel crashes when I plug it in[1].

Yes there are some known issues with uas error handling which are fixed
in 3.18, can you try with a 3.18 kernel please ?
   
   The crash pic was from 3.18.0, blk_mq disabled.  I'll work on getting a 
   fuller
   dmesg output.  Looking at the code, it looks like we end up in
   queue_bulk_sg_tx() with a sg list that is shorter than num_sgs, so we 
   fall off
   the end.

Well, there are (at least) two issues going on here.  The first is that the
SCSI layer passes us zero-length READ10 commands, which is causing this crash.
Zero length means the sglist is empty, so the usb host has nothing to map, and
hence urb-num_mapped_sgs == 0 and the loop goes boom.  I don't know what it
means to send a bulk URB with no buffers, so...

...then I took a tour of how SCSI LLDDs deal with zero-length read/write
commands.  mpt2sas attaches a junk sg and pushes the command out.  libata
detects zero-length READ/WRITE SCSI commands and completes the scsi command
without ever touching hardware.  I wasn't able to get any of my parallel SCSI
disks to boot, so I could not try that.

The other problem is when I plug in a different disk (same mfg/model), READ
CAPACITY 16 intermittently returns the string USBSUSBSUSBS, which of course
is garbage.  The kernel then tries to use these values; fortunately, it rejects
a sector size of 1431519827 (USBS) and sets the size to zero.

So, I can code up a couple of patches -- one to teach UAS how to deal with zero
length read and writes; and a second patch to set US_FL_IGNORE_UAS on Apricorn
bridges.  I tried setting US_FL_NO_READ_CAPACITY_16, but for whatever reason
sd.c was still trying RC16.

--D

   
   (Alas it's now 1am here, so I'm going to bed. :/ )
  
  Eh, nuts to sleeping.  dmesg produces this:
  
  [  231.128074] usbcore: registered new interface driver usb-storage
  [  231.133822] usbcore: registered new interface driver uas
  [  252.121353] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
  [  252.136927] scsi host6: uas
  [  252.141679] scsi 6:0:0:0: Direct-Access Apricorn  
  0128 PQ: 0 ANSI: 6
  [  252.145433] sd 6:0:0:0: Attached scsi generic sg2 type 0
  [  252.145525] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 
  GB/149 GiB)
  [  252.145527] sd 6:0:0:0: [sdc] 4096-byte physical blocks
  [  252.145891] sd 6:0:0:0: [sdc] Write Protect is off
  [  252.145973] sd 6:0:0:0: [sdc] No Caching mode page found
  [  252.145975] sd 6:0:0:0: [sdc] Assuming drive cache: write through
 
 Huh.  4096-byte physical blocks??  That drive is /not/ a 4k sector drive.
 Here's what the kernel said when I plugged in the other (Plugable brand) UAS
 bridge[1]:
 
 [   32.466870] usb 2-4: new SuperSpeed USB device number 2 using xhci_hcd
 [   32.498996] usbcore: registered new interface driver usb-storage
 [   37.660963] scsi host6: uas
 [   37.661193] usbcore: registered new interface driver uas
 [   37.661292] queue_bulk_sg_tx: num=1 sg=880447764500 addr=45af41000 
 len=0 pagelink=ea00116bd042
 [   37.661550] queue_bulk_sg_tx: num=1 sg=8804483fb600 addr=45af41000 
 len=0 pagelink=ea00116bd042
 [   37.661744] scsi 6:0:0:0: Direct-Access Plugable USB3-SATA-UASP1  0
 PQ: 0 ANSI: 6
 [   37.661865] queue_bulk_sg_tx: num=1 sg=8804483fba00 addr=45af41000 
 len=0 pagelink=ea00116bd042
 [   37.662053] queue_bulk_sg_tx: num=1 sg=8804483fba00 addr=45af41000 
 len=0 pagelink=ea00116bd042
 [   37.662294] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45af41000 
 len=0 pagelink=ea00116bd042
 [   37.662488] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b6ab000 
 len=0 pagelink=ea00116daac2
 [   37.663041] sd 6:0:0:0: Attached scsi generic sg2 type 0
 [   37.663138] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=44897c000 
 len=0 pagelink=ea0011225f02
 [   37.664420] sd 6:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 
 GB/149 GiB)
 [   37.664599] queue_bulk_sg_tx: num=1 sg=880447764400 addr=45b5c 
 len=0 pagelink=ea00116d7002
 [   37.664833] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c 
 len=0 pagelink=ea00116d7002
 [   37.665022] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c 
 len=0 pagelink=ea00116d7002
 [   37.665255] queue_bulk_sg_tx: num=1 sg=88045b9e1200 addr=45b5c 
 len=0 pagelink=ea00116d7002
 [   37.665421] sd 6:0:0:0: [sdc] Write Protect is off
 [   37.665532] queue_bulk_sg_tx: num=1 sg=88045b9e0a00 addr=45b5c 
 len=0 pagelink=ea00116d7002
 [   37.665735

UAS crash with Apricorn USB3 SATA bridge

2014-12-09 Thread Darrick J. Wong

Hi,

I have an Apricorn USB 3 disk dongle thing that claims to support UAS.
However, the kernel crashes when I plug it in[1].

I'm not sure what this is caused by, but I also have an ASMedia 2105 SATA
bridge that works with UAS just fine.  Not sure if the Apricorn thing is simply
broken, or if this is a bug in UAS.

I've attached the lsusb -v output[2] if that'll help.  I can try to poke around
with the source code if there's time.

--D

[1] 
https://lh6.googleusercontent.com/-oiOwZmkROQk/VIdNGPTWFDI/C3w/bEw6fSmZpkc/s0-U-I/IMG_0167.JPG

[2] Bus 002 Device 004: ID 0984:0301 Apricorn 
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   3.00
  bDeviceClass0 (Defined at Interface level)
  bDeviceSubClass 0 
  bDeviceProtocol 0 
  bMaxPacketSize0 9
  idVendor   0x0984 Apricorn
  idProduct  0x0301 
  bcdDevice1.28
  iManufacturer   1 Apricorn
  iProduct2   
  iSerial 3 303930363130464232323031
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength  121
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xc0
  Self Powered
MaxPower2mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   2
  bInterfaceClass 8 Mass Storage
  bInterfaceSubClass  6 SCSI
  bInterfaceProtocol 80 Bulk-Only
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x8b  EP 11 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x0a  EP 10 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   1
  bNumEndpoints   4
  bInterfaceClass 8 Mass Storage
  bInterfaceSubClass  6 SCSI
  bInterfaceProtocol 98 
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x08  EP 8 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   0
Command pipe (0x01)
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x89  EP 9 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
MaxStreams 32
Status pipe (0x02)
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x0a  EP 10 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
MaxStreams 32
Data-out pipe (0x04)
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x8b  EP 11 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst   7
MaxStreams 32
Data-in pipe (0x03)
Binary Object Store Descriptor:
  bLength 5
  bDescriptorType15
  wTotalLength   22
  bNumDeviceCaps  2
  USB 2.0 Extension Device Capability:
bLength 7
bDescriptorType

Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function

2014-11-17 Thread Darrick J. Wong

On Fri, Nov 14, 2014 at 03:22:05PM -0500, Martin K. Petersen wrote:
  Martin == Martin K Petersen martin.peter...@oracle.com writes:
 
 Martin What would you prefer as the default for the ext4 use case? To
 Martin allocate or to discard?
 
 I didn't get a preference for whether sb_issue_zeroout() should discard
 or allocate.

In the discussions I've had on the ext4 list, we seem to be leaning towards
discard and falling back to allocate if necessary.

--D

 
 But here's an updated patch 3...
 
 commit eb23c9e71e08b7f467cbc36990a1a01a94a7b959
 Author: Martin K. Petersen martin.peter...@oracle.com
 Date:   Thu Nov 6 14:36:05 2014 -0500
 
 block: Add discard flag to blkdev_issue_zeroout() function
 
 blkdev_issue_discard() will zero a given block range. This is done by
 way of explicit writing, thus provisioning or allocating the blocks on
 disk.
 
 There are use cases where the desired behavior is to zero the blocks but
 unprovision them if possible. The blocks must deterministically contain
 zeroes when they are subsequently read back.
 
 This patch adds a flag to blkdev_issue_zeroout() that provides this
 variant. If the discard flag is set and a block device guarantees
 discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
 the device does not support discard_zeroes_data or if the discard
 request fails we will fall back to first REQ_WRITE_SAME and then a
 regular REQ_WRITE.
 
 Also update the callers of blkdev_issue_zero() to reflect the new flag
 and make sb_issue_zeroout() prefer the discard approach.
 
 Signed-off-by: Martin K. Petersen martin.peter...@oracle.com
 
 diff --git a/block/blk-lib.c b/block/blk-lib.c
 index 8411be3c19d3..715e948f58a4 100644
 --- a/block/blk-lib.c
 +++ b/block/blk-lib.c
 @@ -283,23 +283,45 @@ static int __blkdev_issue_zeroout(struct block_device 
 *bdev, sector_t sector,
   * @sector:  start sector
   * @nr_sects:number of sectors to write
   * @gfp_mask:memory allocation flags (for bio_alloc)
 + * @discard: whether to discard the block range
   *
   * Description:
 - *  Generate and issue number of bios with zerofiled pages.
 +
 + *  Zero-fill a block range.  If the discard flag is set and the block
 + *  device guarantees that subsequent READ operations to the block range
 + *  in question will return zeroes, the blocks will be discarded. Should
 + *  the discard request fail, if the discard flag is not set, or if
 + *  discard_zeroes_data is not supported, this function will resort to
 + *  zeroing the blocks manually, thus provisioning (allocating,
 + *  anchoring) them. If the block device supports the WRITE SAME command
 + *  blkdev_issue_zeroout() will use it to optimize the process of
 + *  clearing the block range. Otherwise the zeroing will be performed
 + *  using regular WRITE calls.
   */
  
  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 -  sector_t nr_sects, gfp_t gfp_mask)
 +  sector_t nr_sects, gfp_t gfp_mask, bool discard)
  {
 + struct request_queue *q = bdev_get_queue(bdev);
 + unsigned char bdn[BDEVNAME_SIZE];
 +
 + if (discard  blk_queue_discard(q)  q-limits.discard_zeroes_data) {
 +
 + if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0))
 + return 0;
 +
 + bdevname(bdev, bdn);
 + pr_warn(%s: DISCARD failed. Manually zeroing.\n, bdn);
 + }
 +
   if (bdev_write_same(bdev)) {
 - unsigned char bdn[BDEVNAME_SIZE];
  
   if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
ZERO_PAGE(0)))
   return 0;
  
   bdevname(bdev, bdn);
 - pr_err(%s: WRITE SAME failed. Manually zeroing.\n, bdn);
 + pr_warn(%s: WRITE SAME failed. Manually zeroing.\n, bdn);
   }
  
   return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 diff --git a/block/ioctl.c b/block/ioctl.c
 index 6c7bf903742f..7d8befde2aca 100644
 --- a/block/ioctl.c
 +++ b/block/ioctl.c
 @@ -198,7 +198,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, 
 uint64_t start,
   if (start + len  (i_size_read(bdev-bd_inode)  9))
   return -EINVAL;
  
 - return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL);
 + return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
  }
  
  static int put_ushort(unsigned long arg, unsigned short val)
 diff --git a/drivers/block/drbd/drbd_receiver.c 
 b/drivers/block/drbd/drbd_receiver.c
 index 6960fb064731..ee5b9611c51c 100644
 --- a/drivers/block/drbd/drbd_receiver.c
 +++ b/drivers/block/drbd/drbd_receiver.c
 @@ -1388,7 +1388,7 @@ int drbd_submit_peer_request(struct drbd_device *device,
   list_add_tail(peer_req-w.list, device-active_ee);
   spin_unlock_irq(device-resource-req_lock);

[PATCH] block: create ioctl to discard-or-zeroout a range of blocks

2014-11-17 Thread Darrick J. Wong

Create a new ioctl to expose the block layer's newfound ability to
issue either a zeroing discard, a WRITE SAME with a zero page, or a
regular write with the zero page.  This BLKZEROOUT2 ioctl takes
{start, length, flags} as parameters.  So far, the only flag available
is to enable the zeroing discard part -- without it, the call invokes
the old BLKZEROOUT behavior.  start and length have the same meaning
as in BLKZEROOUT.

Furthermore, because BLKZEROOUT2 issues commands directly to the
storage device, we must invalidate the page cache (as a regular
O_DIRECT write would do) to avoid returning stale cache contents at a
later time.

This patch depends on mkp's earlier patch block: Introduce
blkdev_issue_zeroout_discard() function.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 block/ioctl.c   |   45 ++---
 include/uapi/linux/fs.h |7 +++
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 7d8befd..ff623d5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -186,19 +186,39 @@ static int blk_ioctl_discard(struct block_device *bdev, 
uint64_t start,
 }
 
 static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
-uint64_t len)
+uint64_t len, uint32_t flags)
 {
+   int ret;
+   struct address_space *mapping;
+   uint64_t end = start + len - 1;
+
+   if (flags  ~BLKZEROOUT2_DISCARD_OK)
+   return -EINVAL;
if (start  511)
return -EINVAL;
if (len  511)
return -EINVAL;
-   start = 9;
-   len = 9;
-
-   if (start + len  (i_size_read(bdev-bd_inode)  9))
+   if (end = i_size_read(bdev-bd_inode))
return -EINVAL;
 
-   return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
+   /* Invalidate the page cache, including dirty pages */
+   mapping = bdev-bd_inode-i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   ret = blkdev_issue_zeroout(bdev, start  9, len  9, GFP_KERNEL,
+  flags  BLKZEROOUT2_DISCARD_OK);
+   if (ret)
+   goto out;
+
+   /*
+* Invalidate again; if someone wandered in and dirtied a page,
+* the caller will be given -EBUSY.
+*/
+   ret = invalidate_inode_pages2_range(mapping,
+   start  PAGE_CACHE_SHIFT,
+   end  PAGE_CACHE_SHIFT);
+out:
+   return ret;
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
@@ -326,7 +346,18 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
if (copy_from_user(range, (void __user *)arg, sizeof(range)))
return -EFAULT;
 
-   return blk_ioctl_zeroout(bdev, range[0], range[1]);
+   return blk_ioctl_zeroout(bdev, range[0], range[1], 0);
+   }
+   case BLKZEROOUT2: {
+   struct blkzeroout2 p;
+
+   if (!(mode  FMODE_WRITE))
+   return -EBADF;
+
+   if (copy_from_user(p, (void __user *)arg, sizeof(p)))
+   return -EFAULT;
+
+   return blk_ioctl_zeroout(bdev, p.start, p.length, p.flags);
}
 
case HDIO_GETGEO: {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..54d24ea 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -150,6 +150,13 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+struct blkzeroout2 {
+   __u64 start;
+   __u64 length;
+   __u32 flags;
+};
+#define BLKZEROOUT2_DISCARD_OK 1
+#define BLKZEROOUT2 _IOR(0x12, 127, struct blkzeroout2)
 
 #define BMAP_IOCTL 1   /* obsolete - kept for compatibility */
 #define FIBMAP_IO(0x00,1)  /* bmap access */
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] block: Introduce blkdev_issue_zeroout_discard() function

2014-11-10 Thread Darrick J. Wong

On Fri, Nov 07, 2014 at 12:08:14AM -0500, Martin K. Petersen wrote:
 blkdev_issue_discard() will zero a given block range on disk. This is
 done by way of either WRITE SAME or regular WRITE. I.e. the blocks on
 disk will be written and thus provisioned.
 
 There are use cases where the desired behavior is to zero the blocks but
 unprovision them if possible. The blocks must deterministically contain
 zeroes when they are subsequently read back.
 
 This patch introduces a blkdev_issue_zeroout_discard() call that
 provides this functionality. If a block device guarantees
 discard_zeroes_data the new function will use discard to clear the block
 range. If the device does not support discard_zeroes_data or if the
 discard request fails we will fall back to blkdev_issue_zeroout() to
 ensure predictable results.

Can this be plumbed into a BLK* ioctl too?  I'll write a patch, if this is ok
with everyone:

struct blkzeroout_t {
__u64 start;
__u64 end;
__u32 flags;
};
#define BLKZEROOUT_DISCARD_OK   1

#define BLKZEROOUT_V2   _IOR(0x12, 127, sizeof(struct blkzeroout_t))

...and make it zap the page cache per earlier discussion.  This seems to be a
good fit with what we've been discussing for mke2fs.

--D

 
 Signed-off-by: Martin K. Petersen martin.peter...@oracle.com
 ---
  block/blk-lib.c| 44 ++--
  include/linux/blkdev.h |  2 ++
  2 files changed, 44 insertions(+), 2 deletions(-)
 
 diff --git a/block/blk-lib.c b/block/blk-lib.c
 index 8411be3c19d3..2ffec6a01c71 100644
 --- a/block/blk-lib.c
 +++ b/block/blk-lib.c
 @@ -278,14 +278,18 @@ static int __blkdev_issue_zeroout(struct block_device 
 *bdev, sector_t sector,
  }
  
  /**
 - * blkdev_issue_zeroout - zero-fill a block range
 + * blkdev_issue_zeroout - zero-fill and provision a block range
   * @bdev:blockdev to write
   * @sector:  start sector
   * @nr_sects:number of sectors to write
   * @gfp_mask:memory allocation flags (for bio_alloc)
   *
   * Description:
 - *  Generate and issue number of bios with zerofiled pages.
 + *  Zero-fill a block range. The blocks will be provisioned
 + *  (allocated/anchored) and are guaranteed to return zeroes when read
 + *  back. This function will attempt to use WRITE SAME to optimize the
 + *  process if the block device supports it. Otherwise it will fall back
 + *  to zeroing the blocks using regular WRITE calls.
   */
  
  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 @@ -305,3 +309,39 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
 sector_t sector,
   return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
  }
  EXPORT_SYMBOL(blkdev_issue_zeroout);
 +
 +/**
 + * blkdev_issue_zeroout_discard - zero-fill and attempt to discard block 
 range
 + * @bdev:blockdev to write
 + * @sector:  start sector
 + * @nr_sects:number of sectors to write
 + * @gfp_mask:memory allocation flags (for bio_alloc)
 + *
 + * Description:
 + *  Zero-fill a block range. In contrast to blkdev_issue_zeroout() this
 + *  function will attempt to deprovision (deallocate/discard) the blocks
 + *  in question. It will only do so if the underlying device guarantees
 + *  that subsequent READ operations to the block range in question will
 + *  return zeroes. If the device does not provide hard guarantees or if
 + *  the DISCARD attempt should fail the block range will be explicitly
 + *  zeroed using blkdev_issue_zeroout().
 + */
 +
 +int blkdev_issue_zeroout_discard(struct block_device *bdev, sector_t sector,
 +  sector_t nr_sects, gfp_t gfp_mask)
 +{
 + struct request_queue *q = bdev_get_queue(bdev);
 +
 + if (blk_queue_discard(q)  q-limits.discard_zeroes_data) {
 + unsigned char bdn[BDEVNAME_SIZE];
 +
 + if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, 0))
 + return 0;
 +
 + bdevname(bdev, bdn);
 + pr_err(%s: DISCARD failed. Manually zeroing.\n, bdn);
 + }
 +
 + return blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 +}
 +EXPORT_SYMBOL(blkdev_issue_zeroout_discard);
 diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
 index aac0f9ea952a..078b6e5f488a 100644
 --- a/include/linux/blkdev.h
 +++ b/include/linux/blkdev.h
 @@ -1164,6 +1164,8 @@ extern int blkdev_issue_write_same(struct block_device 
 *bdev, sector_t sector,
   sector_t nr_sects, gfp_t gfp_mask, struct page *page);
  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
   sector_t nr_sects, gfp_t gfp_mask);
 +extern int blkdev_issue_zeroout_discard(struct block_device *bdev,
 + sector_t sector, sector_t nr_sects, gfp_t gfp_mask);
  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
   sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
  {
 -- 
 1.9.3
 
 --
 To

Re: [PATCH 1/6] fs/bio-integrity: remove duplicate code

2014-04-02 Thread Darrick J. Wong

On Wed, Apr 02, 2014 at 12:17:58PM -0700, Zach Brown wrote:
  +static int bio_integrity_generate_verify(struct bio *bio, int operate)
   {
 
  +   if (operate)
  +   sector = bio-bi_iter.bi_sector;
  +   else
  +   sector = bio-bi_integrity-bip_iter.bi_sector;
 
  +   if (operate) {
  +   bi-generate_fn(bix);
  +   } else {
  +   ret = bi-verify_fn(bix);
  +   if (ret) {
  +   kunmap_atomic(kaddr);
  +   return ret;
  +   }
  +   }
 
 I was glad to see this replaced with explicit sector and func arguments
 in later refactoring in the 6/ patch.
 
 But I don't think the function poiner casts in that 6/ patch are wise
 (Or even safe all the time, given crazy function pointer trampolines?
 Is that still a thing?).  I'd have made a single walk_fn type that
 returns and have the non-returning iterators just return 0.

Noted.  I cleaned all that crap out just yesterday, so now there's only one
walk function and some context data that gets passed to the iterator function.
Much less horrifying.

(I really only included this patch so that I'd have less rebasing work when
3.15-rc1 comes out.)

--D
 
 - z
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] aio/dio: enable PI passthrough

2014-04-02 Thread Darrick J. Wong

On Wed, Apr 02, 2014 at 01:01:33PM -0700, Zach Brown wrote:
  +static int setup_pi_ext(struct kiocb *req, int is_write)
  +{
  +   struct file *file = req-ki_filp;
  +   struct io_extension *ext = req-ki_ioext-ke_kern;
  +   void *p;
  +   unsigned long start, end;
  +   int retval;
  +
  +   if (!(file-f_flags  O_DIRECT)) {
  +   pr_debug(EINVAL: can't use PI without O_DIRECT.\n);
  +   return -EINVAL;
  +   }
  +
  +   BUG_ON(req-ki_ioext-ke_pi_iter.pi_userpages);
  +
  +   end = (((unsigned long)ext-ie_pi_buf) + ext-ie_pi_buflen +
  +   PAGE_SIZE - 1)  PAGE_SHIFT;
  +   start = ((unsigned long)ext-ie_pi_buf)  PAGE_SHIFT;
  +   req-ki_ioext-ke_pi_iter.pi_offset = offset_in_page(ext-ie_pi_buf);
  +   req-ki_ioext-ke_pi_iter.pi_len = ext-ie_pi_buflen;
  +   req-ki_ioext-ke_pi_iter.pi_nrpages = end - start;
  +   p = kzalloc(req-ki_ioext-ke_pi_iter.pi_nrpages *
  +   sizeof(struct page *),
  +   GFP_NOIO);
 
 Can userspace give us bad data and get us to generate insane allcation
 attempt warnings?

Easily.  One of the bits I have to work on for the PI part is figuring out how
to check with the PI provider that the arguments (the iovec and the pi buffer)
actually make any sense, in terms of length and alignment requirements (PI
tuples can't cross pages).  I think it's as simple as adding a bio_integrity
ops call, and then calling down to it from the kiocb level.

One thing I'm not sure about: What's the largest IO (in terms of # of blocks,
not # of struct iovecs) that I can throw at the kernel?

  +   if (p == NULL) {
  +   pr_err(%s: no room for page array?\n, __func__);
  +   return -ENOMEM;
  +   }
  +   req-ki_ioext-ke_pi_iter.pi_userpages = p;
  +
  +   retval = get_user_pages_fast((unsigned long)ext-ie_pi_buf,
  +req-ki_ioext-ke_pi_iter.pi_nrpages,
  +is_write,
 
 Isn't this is_write backwards?  If it's a write syscall then the PI
 pages is going to be read from.

Yes, I think so.  Good catch!

--D
 
 - z
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Darrick J. Wong

On Wed, Apr 02, 2014 at 03:22:20PM -0400, Jeff Moyer wrote:
 Darrick J. Wong darrick.w...@oracle.com writes:
 
  Define a generic interface to allow userspace to attach metadata to an
  IO operation.  This interface will be used initially to implement
  protection information (PI) pass through, though it ought to be usable
  by anyone else desiring to extend the IO interface.  It should not be
  difficult to modify the non-AIO calls to use this mechanism.
 
 My main issue with this patch is determining what exactly gets returned
 to userspace when there is an issue in the teardown_extensions path.
 It looks like you'll get the first error propagated from
 io_teardown_extensions, others are ignored.  Then, in aio_complete, if
 there was no error with the I/O, then you'll get the teardown error
 reported in event-res, otherwise you'll get it in event-res2.  So,
 what are the valid errors returned by the teardown routine for
 extensions?  How is the userspace app supposed to determine where the
 error came from, the I/O or a failure in the extension teardown?

There's also the question of which extension spat out the error.  One solution
would be to augment struct io_extension with all the error fields that we want
(an extension can declare its own if needed) as we do now, and if errors happen
during setup, we can just copy_to_user them back.  If nothing else fails with
the IO setup, the setup routine can return -EINVAL, and userspace can look for
updated error fields in the struct.

Unfortunately for the teardown error case you'd have to pin the whole page in
memory for the duration of the IO just to have it around.  For now this isn't a
problem because teardown can't fail anyway.

 I think it may make sense to only use res2 for reporting io extension
 teardown failures.  Any new code that will use extensions can certainly
 be written to check both res and res2, and this method would prevent the
 ambiguity I mentioned.

Hmm, doesn't look like anyone actually uses res2 except for USB gadgets.

It's tempting just to shove the first ioextension error code that comes along
into res2 and abort the whole thing, and let userspace guess where the res2
code came from.  I think there's an additional problem with stuffing return
codes: in the case of synchronous IO syscalls, we'd have to deal with how to
cram error codes from (potentially) multiple sources into the single return
value, while not giving userspace any help as to where the code came from.

Now that I've written all that out, I don't like this idea so I'll drop it. :)

 Finally, I know this is an RFC, but please add some man-page changes to
 your patch set, and CC linux-man.  Michael Kerrisk typically has
 valuable advice on new APIs.

I'll do that the next time I rev the patches.  Thank you for the suggestion.

--D
 
 Cheers,
 Jeff
 
 
  Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
  ---
   fs/aio.c |  180 
  +-
   include/linux/aio.h  |7 ++
   include/uapi/linux/aio_abi.h |   15 +++-
   3 files changed, 197 insertions(+), 5 deletions(-)
 
 
  diff --git a/fs/aio.c b/fs/aio.c
  index 062a5f6..0c40bdc 100644
  --- a/fs/aio.c
  +++ b/fs/aio.c
  @@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt;
   static const struct file_operations aio_ring_fops;
   static const struct address_space_operations aio_ctx_aops;
   
  +static int io_teardown_extensions(struct kiocb *req);
  +static int io_setup_extensions(struct kiocb *req, int is_write,
  +  struct io_extension __user *ioext);
  +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req);
  +
   static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
   {
  struct qstr this = QSTR_INIT([aio], 5);
  @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long 
  res2)
  struct io_event *ev_page, *event;
  unsigned long   flags;
  unsigned tail, pos;
  +   int ret;
  +
  +   ret = io_teardown_extensions(iocb);
  +   if (ret) {
  +   if (!res)
  +   res = ret;
  +   else if (!res2)
  +   res2 = ret;
  +   else
  +   pr_err(error %d tearing down aio extensions\n, ret);
  +   }
   
  /*
   * Special case handling for sync iocbs:
  @@ -1350,15 +1366,167 @@ rw_common:
  return 0;
   }
   
  +/* IO extension code */
  +#define REQUIRED_STRUCTURE_SIZE(type, member)  \
  +   (offsetof(type, member) + sizeof(((type *)NULL)-member))
  +#define IO_EXT_SIZE(member) \
  +   REQUIRED_STRUCTURE_SIZE(struct io_extension, member)
  +
  +struct io_extension_type {
  +   unsigned int type;
  +   unsigned int extension_struct_size;
  +   int (*setup_fn)(struct kiocb *, int is_write);
  +   int (*destroy_fn)(struct kiocb *);
  +};
  +
  +static struct io_extension_type extensions[] = {
  +   {IO_EXT_INVALID, 0, NULL, NULL},
  +};
  +
  +static int is_write_iocb(struct iocb *iocb

Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Darrick J. Wong

On Wed, Apr 02, 2014 at 12:49:47PM -0700, Zach Brown wrote:
  @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long 
  res2)
  struct io_event *ev_page, *event;
  unsigned long   flags;
  unsigned tail, pos;
  +   int ret;
  +
  +   ret = io_teardown_extensions(iocb);
  +   if (ret) {
  +   if (!res)
  +   res = ret;
  +   else if (!res2)
  +   res2 = ret;
  +   else
  +   pr_err(error %d tearing down aio extensions\n, ret);
  +   }
 
 This ends up trying to copy the kernel's io_extension copy back to
 userspace from interrupts, which obviously won't fly.
 
 And to what end?  So that maybe someone can later add an 'extension'
 that can fill in some field that's then copied to userspace?  But by
 copying the entire argument struct back?
 
 Let's not get ahead of ourselves.  If they're going to try and give
 userspace some feedback after IO completion they're going to have to try
 a lot harder because they don't have acces to the submitting task
 context anymore.  They'd have to pin some reference to a feedback
 mechanism in the in-flight io.  I think we'd want that explicit in the
 iocb, not hiding off on the other side of this extension interface.

I think we'd want to find an extension that really needs this.  PI doesn't.
We can skate by without supporting the teardown errors case for now.

 I'd just remove this generic teardown callback path entirely.  If
 there's PI state hanging off the iocb tear it down during iocb teardown.

Hmm, I thought aio_complete /was/ iocb teardown time.

  +struct io_extension_type {
  +   unsigned int type;
  +   unsigned int extension_struct_size;
  +   int (*setup_fn)(struct kiocb *, int is_write);
  +   int (*destroy_fn)(struct kiocb *);
  +};
 
 I'd also get rid of all of this.  More below.
 
  +static int io_setup_extensions(struct kiocb *req, int is_write,
  +  struct io_extension __user *ioext)
  +{
  +   struct io_extension_type *iet;
  +   __u64 sz, has;
  +   int ret;
  +
  +   /* Check size of buffer */
  +   if (unlikely(copy_from_user(sz, ioext-ie_size, sizeof(sz
  +   return -EFAULT;
  +   if (sz  PAGE_SIZE ||
  +   sz  sizeof(struct io_extension) ||
  +   sz  IO_EXT_SIZE(ie_has))
  +   return -EINVAL;
  +
  +   /* Check that the buffer's big enough */
  +   if (unlikely(copy_from_user(has, ioext-ie_has, sizeof(has
  +   return -EFAULT;
  +   ret = io_check_bufsize(has, sz);
  +   if (ret)
  +   return ret;
  +
  +   /* Copy from userland */
  +   req-ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
  +   if (!req-ki_ioext)
  +   return -ENOMEM;
  +
  +   req-ki_ioext-ke_user = ioext;
  +   if (unlikely(copy_from_user(req-ki_ioext-ke_kern, ioext, sz))) {
  +   ret = -EFAULT;
  +   goto out;
  +   }
 
 (Isn't there some allocate-and-copy-from-userspace helper now? But..)

shrug Is there?  I didn't find one when I looked, but it wasn't an exhaustive
search.

 I don't like the rudundancy of the implicit size requirement by a
 field's flag being set being duplicated by the explicit size argument.
 What does that give us, exactly?

Either another sanity check or another way to screw up, depending on how you
look at it.  I'd been considering shortening the size field to u32 and adding a
magic number field, but I wonder if that's really necessary.  Seems like it
shouldn't be -- if userland screws up, it's not hard to kill the process.
(Or segv it, or...)

 Our notion of the total size only seems to only matter if we're copying
 the entire struct from userspace and I'm don't think we need to do that.
 
 For each argument, we're translating it into some kernel equivalent,
 right?

Yes.

 Fields in the iocb  As each of these are initialized I'd just
 test the presence bits and __get_user() the userspace arguemnts
 directly, or copy_from_user() something slightly more complicated on to
 the stack.

 That gets rid of us having to care about the size at all.  It stops us
 from allocating a kernel copy and pinning it for the duration of the IO.
 We'd just be sampling the present userspace arguments as we initialie
 the iocb during submission.

I like this idea.  For the PI extension, nothing particularly error-prone
happens in teardown, which allows the flexibility to copy_from_user any
arguments required, and to copy_to_user any setup errors that happen.  I can
get rid a lot of allocate-and-copy nonsense, as you point out.

Ok, I'll migrate my patches towards this strategy, and let's see how much code
goes away. :)

I've also noticed a bug where if you make one of these PI-extended calls on a
file living on a filesystem, it'll extend the io request's range to be
filesystem block-aligned, which causes all kinds of havoc with the user
provided PI buffers, since they now need to be extended to fit the added
blocks.  Alternately, one could require PI IOs to be fs-block

Re: [PATCH 3/6] aio/dio: enable PI passthrough

2014-04-02 Thread Darrick J. Wong

On Wed, Apr 02, 2014 at 03:33:11PM -0700, Zach Brown wrote:
  One thing I'm not sure about: What's the largest IO (in terms of # of 
  blocks,
  not # of struct iovecs) that I can throw at the kernel?
 
 Yeah, dunno.  I'd guess big :).  I'd hope that the PI code already has a
 way to clamp the size of bios if there's a limit to the size of PI data
 that can be managed downstream?

I guess if we restricted the size of the PI buffer to a page's worth of
pointers to struct page, that limits us to 128M on x64 with DIF and 512b
sectors.  That's not really a whole lot; I suppose one could (ab)use vmalloc.

Yes, blk-integrity clamps the size of the bio to fit the downstream device's
maximum integrity sg size.  See max_integrity_segments for details, or the
mostly-undocumented sg_prot_tablesize sysfs attribute that reveals it.

I don't know what a practical limit is; scsi_debug sets it to 65536.

--D
 
 - z
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Darrick J. Wong

On Wed, Apr 02, 2014 at 03:53:33PM -0700, Zach Brown wrote:
   I'd just remove this generic teardown callback path entirely.  If
   there's PI state hanging off the iocb tear it down during iocb teardown.
  
  Hmm, I thought aio_complete /was/ iocb teardown time.
 
 Well, usually :).  If you build up before aio_run_iocb() then you nead
 to teardown in kiocb_free(), which is also called by aio_complete().

Oh, yeah.  I handle that by tearing down the extensions if stuff fails, though
I don't remember if that was in this version of the patchset.

   (Isn't there some allocate-and-copy-from-userspace helper now? But..)
  
  shrug Is there?  I didn't find one when I looked, but it wasn't an 
  exhaustive
  search.
 
 I could have sworn that I saw something.. ah, right, memdup_user().

Noted. :)

   I don't like the rudundancy of the implicit size requirement by a
   field's flag being set being duplicated by the explicit size argument.
   What does that give us, exactly?
  
  Either another sanity check or another way to screw up, depending on how you
  look at it.  I'd been considering shortening the size field to u32 and 
  adding a
  magic number field, but I wonder if that's really necessary.  Seems like it
  shouldn't be -- if userland screws up, it's not hard to kill the process.
  (Or segv it, or...)
 
 I don't think I'd bother.  The bits should be enough and are already
 necessary to have explicit indicators of fields being set.

nod

   Fields in the iocb  As each of these are initialized I'd just
   test the presence bits and __get_user() the userspace arguemnts
   directly, or copy_from_user() something slightly more complicated on to
   the stack.
  
   That gets rid of us having to care about the size at all.  It stops us
   from allocating a kernel copy and pinning it for the duration of the IO.
   We'd just be sampling the present userspace arguments as we initialie
   the iocb during submission.
  
  I like this idea.  For the PI extension, nothing particularly error-prone
  happens in teardown, which allows the flexibility to copy_from_user any
  arguments required, and to copy_to_user any setup errors that happen.  I can
  get rid a lot of allocate-and-copy nonsense, as you point out.
  
  Ok, I'll migrate my patches towards this strategy, and let's see how much 
  code
  goes away. :)
 
 Cool :).
 
  I've also noticed a bug where if you make one of these PI-extended calls on 
  a
  file living on a filesystem, it'll extend the io request's range to be
  filesystem block-aligned, which causes all kinds of havoc with the user
  provided PI buffers, since they now need to be extended to fit the added
  blocks.  Alternately, one could require PI IOs to be fs-block aligned when
  dealing with regular files. 
 
 I think, like O_DIRECT, it just has to be aligned or fail :(.

Heh.  O_DIRECT is a hilarious maze of twisty unobvious requirements.  Yuck.

#define O_IMNAIVEENOUGHTOTHINKIKNOWWHATTHISDOES O_DIRECT

--D
 
 - z
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO

2014-03-24 Thread Darrick J. Wong

This RFC provides a rough implementation of a mechanism to allow
userspace to attach protection information (e.g. T10 DIF) data to a
disk write and to receive the information alongside a disk read.
There's a new IO extension interface wherein we define a structure
(per zab's comments on the v2 series) io_extension that points to the
the PI data buffer.  These patches are against 3.14-rc7.

NOTE: As far as I know this works, but this is just a refresh of last
week's patchset to start the discussion at LSF, which was moved up to
today.  I've not done rigorous testing, hence the 'donotmerge'.

The first patch is a little bit of code refactoring, as sent in by Gu
Zheng.  It seems to be queued up for 3.15, so I figured I might as well
start from there.

Patch #2 implements a generic IO extension interface so that we can
receive a struct io_extension from userspace containing the structure
size, a flag telling us which extensions we'd like to use (ie_has),
and (eventually) extension data.  There's a small framework for
mapping ie_has bits to actual extensions.

Patch #3 provides the plumbing to get the user's buffer all the way to
the block integrity code.  Due to the way that the code deals with the
array of struct page*s that represent the PI buffer, there's an
unfortunate requirement that no PI tuple may cross a page boundary.
Given that so far DIF is only 8 or 16 bytes this isn't a problem yet.
There's also no explicit fallback for the case where the user pages
are not within a device's DMA range.  This patch hooks into the IO
extension interface.

Patch #4 builds on the previous patch to allow userspace to send some
flags along with the PI buffer.  The integrity provider now has a
mod_user_buf_fn hook that enables the provider to read the userspace
flags and modify the PI buffer before submit_bio.  For now, this means
that T10/DIF provider can be told to patch any of the reference, app,
or guard tags.  This is useful for sending PI data with an IO request
for a file on a filesystem, since the kernel can patch in the device's
LBA later.  Also it means that if you only care about, say, app tags,
you can provide those and let the kernel take care of the crc and the
LBA.  I don't know if that's anyone's requirement, but there we are.

Patch #5 provides a mechanism for integrity providers to advertise
both the per-logical-block PI buffer size and the flags that can be
passed to the mod_user_buf_fn hook.  The advertisements can be found
in sysfs, since that's where we present all the other PI details about
a device.

Patch #6 removes redundant code and modifies the tag get/set functions
to follow the other new functions and kmap/unmap the PI buffer page(s)
before messing with the PI buffers, instead of relying on pi_buf being
a valid pointer.

Eventually there will be a patch #7 that makes it so that IO
extensions can be piped through the synchronous IO calls, but it was
nowhere near ready when I sent this patchset. :(

Comments and questions are, as always, welcome.  There will be a
session about this on the second day of LSF/MM, if I'm not mistaken.
A sample program follows this message.

$ cc -o prog prog.c
$ ./prog -rw -pr -s 2048 /path/to/pi/device

--D

/*
 * Userspace DIX API test program
 * Licensed under GPLv2. Copyright 2014 Oracle.
 *
 * XXX: We don't query the kernel for this information like we should!
 */
#define _GNU_SOURCE
#include stdio.h
#include libaio.h
#include unistd.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include sys/uio.h
#include errno.h
#include stdlib.h
#include stdint.h
#include arpa/inet.h
#include sys/ioctl.h
#include linux/fs.h
#include sys/syscall.h

#define GENERATE_GUARD  (1)
#define GENERATE_REF(2)
#define GENERATE_APP(4)
#define GENERATE_ALL(7)

#define NR_IOS  (1)
#define NR_IOVS (2)
#define NR_IOCB_EXTS(1)

/* Stuff that should go in libaio.h */
#define IO_EXT_INVALID  (0)
#define IO_EXT_PI   (1) /* protection info attached */

#define IOCB_FLAG_EXTENSIONS(1  1)

#define __FIOEXT04000

struct io_extension {
__u64 ie_size;
__u64 ie_has;

/* PI stuff */
__u64 ie_pi_buf;
__u32 ie_pi_buflen;
__u32 ie_pi_ret;
__u32 ie_pi_flags;
};

static void io_prep_extensions(struct iocb *iocb, struct io_extension *ext,
   unsigned int nr)
{
iocb-u.c.flags |= IOCB_FLAG_EXTENSIONS;
iocb-u.c.__pad3 = (long long)ext;
}

static void io_prep_extension(struct io_extension *ext)
{
memset(ext, 0, sizeof(struct io_extension));
ext-ie_size = sizeof(*ext);
}

static void io_prep_extension_pi(struct io_extension *ext, void *buf,
 unsigned int buflen, unsigned int flags)
{
ext-ie_has |= IO_EXT_PI;
ext-ie_pi_buf = (__u64)buf;
ext-ie_pi_buflen = buflen;
ext-ie_pi_flags = flags;
}
/* End stuff for libaio.h */

static void dump_buffer(char *buf, size_t len)

[PATCH 2/6] io: define an interface for IO extensions

2014-03-24 Thread Darrick J. Wong

Define a generic interface to allow userspace to attach metadata to an
IO operation.  This interface will be used initially to implement
protection information (PI) pass through, though it ought to be usable
by anyone else desiring to extend the IO interface.  It should not be
difficult to modify the non-AIO calls to use this mechanism.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 fs/aio.c |  180 +-
 include/linux/aio.h  |7 ++
 include/uapi/linux/aio_abi.h |   15 +++-
 3 files changed, 197 insertions(+), 5 deletions(-)


diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..0c40bdc 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static int io_teardown_extensions(struct kiocb *req);
+static int io_setup_extensions(struct kiocb *req, int is_write,
+  struct io_extension __user *ioext);
+static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req);
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
struct qstr this = QSTR_INIT([aio], 5);
@@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
struct io_event *ev_page, *event;
unsigned long   flags;
unsigned tail, pos;
+   int ret;
+
+   ret = io_teardown_extensions(iocb);
+   if (ret) {
+   if (!res)
+   res = ret;
+   else if (!res2)
+   res2 = ret;
+   else
+   pr_err(error %d tearing down aio extensions\n, ret);
+   }
 
/*
 * Special case handling for sync iocbs:
@@ -1350,15 +1366,167 @@ rw_common:
return 0;
 }
 
+/* IO extension code */
+#define REQUIRED_STRUCTURE_SIZE(type, member)  \
+   (offsetof(type, member) + sizeof(((type *)NULL)-member))
+#define IO_EXT_SIZE(member) \
+   REQUIRED_STRUCTURE_SIZE(struct io_extension, member)
+
+struct io_extension_type {
+   unsigned int type;
+   unsigned int extension_struct_size;
+   int (*setup_fn)(struct kiocb *, int is_write);
+   int (*destroy_fn)(struct kiocb *);
+};
+
+static struct io_extension_type extensions[] = {
+   {IO_EXT_INVALID, 0, NULL, NULL},
+};
+
+static int is_write_iocb(struct iocb *iocb)
+{
+   switch (iocb-aio_lio_opcode) {
+   case IOCB_CMD_PWRITE:
+   case IOCB_CMD_PWRITEV:
+   return 1;
+   default:
+   return 0;
+   }
+}
+
+static int io_teardown_extensions(struct kiocb *req)
+{
+   struct io_extension_type *iet;
+   int ret, ret2;
+
+   if (req-ki_ioext == NULL)
+   return 0;
+
+   /* Shut down all the extensions */
+   ret = 0;
+   for (iet = extensions; iet-type != IO_EXT_INVALID; iet++) {
+   if (!(req-ki_ioext-ke_kern.ie_has  iet-type))
+   continue;
+   ret2 = iet-destroy_fn(req);
+   if (ret2  !ret)
+   ret = ret2;
+   }
+
+   /* Copy out return values */
+   if (unlikely(copy_to_user(req-ki_ioext-ke_user,
+ req-ki_ioext-ke_kern,
+ sizeof(struct io_extension {
+   if (!ret)
+   ret = -EFAULT;
+   }
+
+   kfree(req-ki_ioext);
+   req-ki_ioext = NULL;
+   return ret;
+}
+
+static int io_check_bufsize(__u64 has, __u64 size)
+{
+   struct io_extension_type *iet;
+   __u64 all_flags = 0;
+
+   for (iet = extensions; iet-type != IO_EXT_INVALID; iet++) {
+   all_flags |= iet-type;
+   if (!(has  iet-type))
+   continue;
+   if (iet-extension_struct_size  size)
+   return -EINVAL;
+   }
+
+   if (has  ~all_flags)
+   return -EINVAL;
+
+   return 0;
+}
+
+static int io_setup_extensions(struct kiocb *req, int is_write,
+  struct io_extension __user *ioext)
+{
+   struct io_extension_type *iet;
+   __u64 sz, has;
+   int ret;
+
+   /* Check size of buffer */
+   if (unlikely(copy_from_user(sz, ioext-ie_size, sizeof(sz
+   return -EFAULT;
+   if (sz  PAGE_SIZE ||
+   sz  sizeof(struct io_extension) ||
+   sz  IO_EXT_SIZE(ie_has))
+   return -EINVAL;
+
+   /* Check that the buffer's big enough */
+   if (unlikely(copy_from_user(has, ioext-ie_has, sizeof(has
+   return -EFAULT;
+   ret = io_check_bufsize(has, sz);
+   if (ret)
+   return ret;
+
+   /* Copy from userland */
+   req-ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO);
+   if (!req-ki_ioext)
+   return -ENOMEM;
+
+   req-ki_ioext-ke_user = ioext

[PATCH 4/6] PI IO extension: allow user to ask kernel to fill in parts of the protection info

2014-03-24 Thread Darrick J. Wong

Since userspace can now pass PI buffers through to the block integrity
provider, provide a means for userspace to specify a flags argument
with the PI buffer.  The initial user for this will be sd_dif, which
will enable user programs to ask the kernel to fill in whichever
fields they don't want to provide.  This is intended, for example, to
satisfy programs that really only care to provide an app tag.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 Documentation/block/data-integrity.txt |   11 
 block/blk-integrity.c  |1 
 drivers/scsi/sd_dif.c  |   76 ++
 fs/aio.c   |3 +
 fs/bio-integrity.c |   80 
 fs/direct-io.c |1 
 include/linux/bio.h|3 +
 include/linux/blkdev.h |2 +
 include/uapi/linux/aio_abi.h   |1 
 9 files changed, 162 insertions(+), 16 deletions(-)


diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index 1d1f070..b72a54f 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -292,7 +292,10 @@ will require extra work due to the application tag.
 
   The bio_integrity_prep_iter should contain the page offset and buffer
   length of the PI buffer, the number of pages, and the actual array of
-  pages, as returned by get_user_pages.
+  pages, as returned by get_user_pages.  The user_flags argument should
+  contain whatever flag values were passed in by userspace; the values
+  of the flags are specific to the block integrity provider, and are
+  passed to the mod_user_buf_fn handler.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
 METADATA
@@ -332,6 +335,12 @@ will require extra work due to the application tag.
   are available per hardware sector.  For DIF this is either 2 or
   0 depending on the value of the Control Mode Page ATO bit.
 
+  'mod_user_buf_fn' updates the appropriate integrity metadata for
+  a WRITE operation.  This function is called when userspace passes
+  in a PI buffer along with file data; the flags argument (which is
+  specific to the blk_integrity provider) arrange for pre-processing
+  of the user buffer prior to issuing the IO.
+
   See 6.2 for a description of get_tag_fn and set_tag_fn.
 
 --
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 7fbab84..1cb1eb2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -421,6 +421,7 @@ int blk_integrity_register(struct gendisk *disk, struct 
blk_integrity *template)
bi-set_tag_fn = template-set_tag_fn;
bi-get_tag_fn = template-get_tag_fn;
bi-tag_size = template-tag_size;
+   bi-mod_user_buf_fn = template-mod_user_buf_fn;
} else
bi-name = bi_unsupported_name;
 
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index a7a691d..74182c9 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -53,31 +53,58 @@ static __u16 sd_dif_ip_fn(void *data, unsigned int len)
  * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
  * 16 bit app tag, 32 bit reference tag.
  */
-static void sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+#define GENERATE_GUARD (1)
+#define GENERATE_REF   (2)
+#define GENERATE_APP   (4)
+#define GENERATE_ALL   (7)
+static int sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn,
+int flags)
 {
void *buf = bix-data_buf;
struct sd_dif_tuple *sdt = bix-prot_buf;
sector_t sector = bix-sector;
unsigned int i;
 
+   if (flags  ~GENERATE_ALL)
+   return -EINVAL;
+   if (!flags)
+   return -ENOTTY;
+
for (i = 0 ; i  bix-data_size ; i += bix-sector_size, sdt++) {
-   sdt-guard_tag = fn(buf, bix-sector_size);
-   sdt-ref_tag = cpu_to_be32(sector  0x);
-   sdt-app_tag = 0;
+   if (flags  GENERATE_GUARD)
+   sdt-guard_tag = fn(buf, bix-sector_size);
+   if (flags  GENERATE_REF)
+   sdt-ref_tag = cpu_to_be32(sector  0x);
+   if (flags  GENERATE_APP)
+   sdt-app_tag = 0;
 
buf += bix-sector_size;
sector++;
}
+
+   return 0;
 }
 
 static void sd_dif_type1_generate_crc(struct blk_integrity_exchg *bix)
 {
-   sd_dif_type1_generate(bix, sd_dif_crc_fn);
+   sd_dif_type1_generate(bix, sd_dif_crc_fn, GENERATE_ALL);
 }
 
 static void sd_dif_type1_generate_ip(struct blk_integrity_exchg *bix)
 {
-   sd_dif_type1_generate(bix, sd_dif_ip_fn);
+   sd_dif_type1_generate

[PATCH 1/6] fs/bio-integrity: remove duplicate code

2014-03-24 Thread Darrick J. Wong

Frøm: Gu Zheng guz.f...@cn.fujitsu.com

Most code of function bio_integrity_verify and bio_integrity_generate
is the same, so introduce a help function bio_integrity_generate_verify()
to remove the duplicate code.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 fs/bio-integrity.c |   83 +++-
 1 file changed, 37 insertions(+), 46 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 4f70f38..413312f 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -301,25 +301,26 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
- * bio_integrity_generate - Generate integrity metadata for a bio
- * @bio:   bio to generate integrity metadata for
- *
- * Description: Generates integrity metadata for a bio by calling the
- * block device's generation callback function.  The bio must have a
- * bip attached with enough room to accommodate the generated
- * integrity metadata.
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:   bio to generate/verify integrity metadata for
+ * @operate:   operate number, 1 for generate, 0 for verify
  */
-static void bio_integrity_generate(struct bio *bio)
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
 {
struct blk_integrity *bi = bdev_get_integrity(bio-bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec bv;
struct bvec_iter iter;
-   sector_t sector = bio-bi_iter.bi_sector;
-   unsigned int sectors, total;
+   sector_t sector;
+   unsigned int sectors, total, ret;
void *prot_buf = bio-bi_integrity-bip_buf;
 
-   total = 0;
+   if (operate)
+   sector = bio-bi_iter.bi_sector;
+   else
+   sector = bio-bi_integrity-bip_iter.bi_sector;
+
+   total = ret = 0;
bix.disk_name = bio-bi_bdev-bd_disk-disk_name;
bix.sector_size = bi-sector_size;
 
@@ -330,7 +331,15 @@ static void bio_integrity_generate(struct bio *bio)
bix.prot_buf = prot_buf;
bix.sector = sector;
 
-   bi-generate_fn(bix);
+   if (operate) {
+   bi-generate_fn(bix);
+   } else {
+   ret = bi-verify_fn(bix);
+   if (ret) {
+   kunmap_atomic(kaddr);
+   return ret;
+   }
+   }
 
sectors = bv.bv_len / bi-sector_size;
sector += sectors;
@@ -340,6 +349,21 @@ static void bio_integrity_generate(struct bio *bio)
 
kunmap_atomic(kaddr);
}
+   return ret;
+}
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:   bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+   bio_integrity_generate_verify(bio, 1);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -454,40 +478,7 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-   struct blk_integrity *bi = bdev_get_integrity(bio-bi_bdev);
-   struct blk_integrity_exchg bix;
-   struct bio_vec *bv;
-   sector_t sector = bio-bi_integrity-bip_iter.bi_sector;
-   unsigned int sectors, ret = 0;
-   void *prot_buf = bio-bi_integrity-bip_buf;
-   int i;
-
-   bix.disk_name = bio-bi_bdev-bd_disk-disk_name;
-   bix.sector_size = bi-sector_size;
-
-   bio_for_each_segment_all(bv, bio, i) {
-   void *kaddr = kmap_atomic(bv-bv_page);
-
-   bix.data_buf = kaddr + bv-bv_offset;
-   bix.data_size = bv-bv_len;
-   bix.prot_buf = prot_buf;
-   bix.sector = sector;
-
-   ret = bi-verify_fn(bix);
-
-   if (ret) {
-   kunmap_atomic(kaddr);
-   return ret;
-   }
-
-   sectors = bv-bv_len / bi-sector_size;
-   sector += sectors;
-   prot_buf += sectors * bi-tuple_size;
-
-   kunmap_atomic(kaddr);
-   }
-
-   return ret;
+   return bio_integrity_generate_verify(bio, 0);
 }
 
 /**

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6] aio/dio: enable PI passthrough

2014-03-24 Thread Darrick J. Wong

Provide an IO extension handler that attaches PI data from the io
extension structure to a kiocb, then teach directio how to attach the
pages representing the PI buffer directly to a bio.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 Documentation/block/data-integrity.txt |   11 
 fs/aio.c   |   62 +
 fs/bio-integrity.c |   94 +++-
 fs/direct-io.c |   70 +++-
 include/linux/aio.h|   10 +++
 include/linux/bio.h|   15 +
 include/uapi/linux/aio_abi.h   |6 ++
 mm/filemap.c   |6 ++
 8 files changed, 259 insertions(+), 15 deletions(-)


diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index 2d735b0a..1d1f070 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -282,6 +282,17 @@ will require extra work due to the application tag.
   It is up to the receiver to process them and verify data
   integrity upon completion.
 
+int bio_integrity_prep_buffer(struct bio *bio, int rw,
+ struct bio_integrity_prep_iter *pi);
+
+  This function should be called before submit_bio; its purpose is to
+  attach an arbitrary array of struct page * containing integrity data
+  to an existing bio.  Primarily this is intended for AIO/DIO to be
+  able to attach a userspace buffer to a bio.
+
+  The bio_integrity_prep_iter should contain the page offset and buffer
+  length of the PI buffer, the number of pages, and the actual array of
+  pages, as returned by get_user_pages.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
 METADATA
diff --git a/fs/aio.c b/fs/aio.c
index 0c40bdc..3f932c3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1379,7 +1379,69 @@ struct io_extension_type {
int (*destroy_fn)(struct kiocb *);
 };
 
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static int destroy_pi_ext(struct kiocb *req)
+{
+   unsigned int i;
+
+   if (req-ki_ioext-ke_pi_iter.pi_userpages == NULL)
+   return 0;
+
+   for (i = 0; i  req-ki_ioext-ke_pi_iter.pi_nrpages; i++)
+   page_cache_release(req-ki_ioext-ke_pi_iter.pi_userpages[i]);
+   kfree(req-ki_ioext-ke_pi_iter.pi_userpages);
+   req-ki_ioext-ke_pi_iter.pi_userpages = NULL;
+
+   return 0;
+}
+
+static int setup_pi_ext(struct kiocb *req, int is_write)
+{
+   struct file *file = req-ki_filp;
+   struct io_extension *ext = req-ki_ioext-ke_kern;
+   void *p;
+   unsigned long start, end;
+   int retval;
+
+   if (!(file-f_flags  O_DIRECT)) {
+   pr_debug(EINVAL: can't use PI without O_DIRECT.\n);
+   return -EINVAL;
+   }
+
+   BUG_ON(req-ki_ioext-ke_pi_iter.pi_userpages);
+
+   end = (((unsigned long)ext-ie_pi_buf) + ext-ie_pi_buflen +
+   PAGE_SIZE - 1)  PAGE_SHIFT;
+   start = ((unsigned long)ext-ie_pi_buf)  PAGE_SHIFT;
+   req-ki_ioext-ke_pi_iter.pi_offset = offset_in_page(ext-ie_pi_buf);
+   req-ki_ioext-ke_pi_iter.pi_len = ext-ie_pi_buflen;
+   req-ki_ioext-ke_pi_iter.pi_nrpages = end - start;
+   p = kzalloc(req-ki_ioext-ke_pi_iter.pi_nrpages *
+   sizeof(struct page *),
+   GFP_NOIO);
+   if (p == NULL) {
+   pr_err(%s: no room for page array?\n, __func__);
+   return -ENOMEM;
+   }
+   req-ki_ioext-ke_pi_iter.pi_userpages = p;
+
+   retval = get_user_pages_fast((unsigned long)ext-ie_pi_buf,
+req-ki_ioext-ke_pi_iter.pi_nrpages,
+is_write,
+req-ki_ioext-ke_pi_iter.pi_userpages);
+   if (retval != req-ki_ioext-ke_pi_iter.pi_nrpages) {
+   pr_err(%s: couldn't map pages?\n, __func__);
+   req-ki_ioext-ke_pi_iter.pi_nrpages = retval;
+   return -ENOMEM;
+   }
+   req-ki_flags |= KIOCB_DIO_ONLY;
+
+   return 0;
+}
+#endif
+
 static struct io_extension_type extensions[] = {
+   {IO_EXT_PI, IO_EXT_SIZE(ie_pi_ret), setup_pi_ext, destroy_pi_ext},
{IO_EXT_INVALID, 0, NULL, NULL},
 };
 
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 413312f..3df9aeb 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -138,7 +138,7 @@ int bio_integrity_add_page(struct bio *bio, struct page 
*page,
struct bio_vec *iv;
 
if (bip-bip_vcnt = bip_integrity_vecs(bip)) {
-   printk(KERN_ERR %s: bip_vec full\n, __func__);
+   pr_err(%s: bip_vec full\n, __func__);
return 0;
}
 
@@ -250,7 +250,7 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
DIV_ROUND_UP(len, bi-tag_size

[PATCH 5/6] PI IO extension: advertise possible userspace flags

2014-03-24 Thread Darrick J. Wong

Expose possible userland flags to the new PI IO extension so that
userspace can discover what flags exist.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 Documentation/ABI/testing/sysfs-block  |   14 ++
 Documentation/block/data-integrity.txt |   22 +
 block/blk-integrity.c  |   33 
 drivers/scsi/sd_dif.c  |   11 +++
 include/linux/blkdev.h |7 +++
 5 files changed, 87 insertions(+)


diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 279da08..989cb80 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -53,6 +53,20 @@ Description:
512 bytes of data.
 
 
+What:  /sys/block/disk/integrity/tuple_size
+Date:  March 2014
+Contact:   Darrick J. Wong darrick.w...@oracle.com
+Description:
+   Size in bytes of the integrity data buffer for each logical
+   block.
+
+What:  /sys/block/disk/integrity/write_user_flags
+Date:  March 2014
+Contact:   Darrick J. Wong darrick.w...@oracle.com
+Description:
+   Provides a list of flags that userspace can pass to the kernel
+   when supplying integrity data for a write IO.
+
 What:  /sys/block/disk/integrity/write_generate
 Date:  June 2008
 Contact:   Martin K. Petersen martin.peter...@oracle.com
diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index b72a54f..e33d4a7 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -341,7 +341,29 @@ will require extra work due to the application tag.
   specific to the blk_integrity provider) arrange for pre-processing
   of the user buffer prior to issuing the IO.
 
+  'user_write_flags' points to an array of struct blk_integrity_flag,
+  which maps mod_user_buf_fn flags to a description of what they do.
+
   See 6.2 for a description of get_tag_fn and set_tag_fn.
 
+5.5 PASSING INTEGRITY DATA FROM USERSPACE
+
+The IO extension interface has been expanded to provide
+userspace programs with the ability to provide PI data with a WRITE,
+or to receive PI data with a READ.  The fields ie_pi_buf,
+ie_pi_buflen, and ie_pi_flags should contain a pointer to the PI
+buffer, the length of the PI buffer, and any flags that should be
+passed to the PI provider.
+
+This buffer must contain PI tuples.  Tuples must NOT split a page
+boundary.  Valid flag values can be found in
+/sys/block/*/integrity/user_write_flags.  The tuple size can be found
+in /sys/block/*/integrity/tuple_size.
+
+In general, the flags allow the user program to ask the in-kernel
+integrity provider to fill in some parts of the tuples.  For example,
+the T10 DIF provider can fill in the reference tag (sector number) so
+that userspace can choose not to care about the reference tag.
+
 --
 2007-12-24 Martin K. Petersen martin.peter...@oracle.com
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 1cb1eb2..557d28e 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -307,6 +307,26 @@ static ssize_t integrity_write_show(struct blk_integrity 
*bi, char *page)
return sprintf(page, %d\n, (bi-flags  INTEGRITY_FLAG_WRITE) != 0);
 }
 
+static ssize_t integrity_write_flags_show(struct blk_integrity *bi, char *page)
+{
+   struct blk_integrity_flag *flag = bi-user_write_flags;
+   char *p = page;
+   ssize_t ret = 0;
+
+   while (flag-value) {
+   ret += snprintf(p, PAGE_SIZE - ret, 0x%x: %s\n,
+   flag-value, flag-descr);
+   p = page + ret;
+   flag++;
+   }
+   return ret;
+}
+
+static ssize_t integrity_tuple_size_show(struct blk_integrity *bi, char *page)
+{
+   return sprintf(page, %d\n, bi-tuple_size);
+}
+
 static struct integrity_sysfs_entry integrity_format_entry = {
.attr = { .name = format, .mode = S_IRUGO },
.show = integrity_format_show,
@@ -329,11 +349,23 @@ static struct integrity_sysfs_entry integrity_write_entry 
= {
.store = integrity_write_store,
 };
 
+static struct integrity_sysfs_entry integrity_write_flags_entry = {
+   .attr = { .name = write_user_flags, .mode = S_IRUGO },
+   .show = integrity_write_flags_show,
+};
+
+static struct integrity_sysfs_entry integrity_tuple_size_entry = {
+   .attr = { .name = tuple_size, .mode = S_IRUGO },
+   .show = integrity_tuple_size_show,
+};
+
 static struct attribute *integrity_attrs[] = {
integrity_format_entry.attr,
integrity_tag_size_entry.attr,
integrity_read_entry.attr,
integrity_write_entry.attr,
+   integrity_write_flags_entry.attr

[PATCH 6/6] blk-integrity: refactor various routines

2014-03-24 Thread Darrick J. Wong

Refactor blk-integrity.c to avoid duplicating similar functions, and
remove all users of pi_buf, since it's really only there to handle the
(common) case where the kernel auto-generates all the PI data.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 fs/bio-integrity.c  |  120 +--
 include/linux/bio.h |2 -
 2 files changed, 49 insertions(+), 73 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 381ee38..3ff1572 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -97,8 +97,7 @@ void bio_integrity_free(struct bio *bio)
struct bio_integrity_payload *bip = bio-bi_integrity;
struct bio_set *bs = bio-bi_pool;
 
-   if (bip-bip_owns_buf)
-   kfree(bip-bip_buf);
+   kfree(bip-bip_buf);
 
if (bs) {
if (bip-bip_slab != BIO_POOL_NONE)
@@ -239,9 +238,11 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
 {
struct bio_integrity_payload *bip = bio-bi_integrity;
struct blk_integrity *bi = bdev_get_integrity(bio-bi_bdev);
-   unsigned int nr_sectors;
-
-   BUG_ON(bip-bip_buf == NULL);
+   unsigned int nr_sectors, tag_offset, sectors;
+   void *prot_buf;
+   unsigned int prot_offset, prot_len;
+   struct bio_vec *iv;
+   void (*tag_fn)(void *buf, void *tag_buf, unsigned int);
 
if (bi-tag_size == 0)
return -1;
@@ -255,10 +256,30 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
return -1;
}
 
-   if (set)
-   bi-set_tag_fn(bip-bip_buf, tag_buf, nr_sectors);
-   else
-   bi-get_tag_fn(bip-bip_buf, tag_buf, nr_sectors);
+   iv = bip-bip_vec;
+   prot_offset = iv-bv_offset;
+   prot_len = iv-bv_len;
+   prot_buf = kmap_atomic(iv-bv_page);
+   tag_fn = set ? bi-set_tag_fn : bi-get_tag_fn;
+   tag_offset = 0;
+
+   while (nr_sectors) {
+   if (prot_len  bi-tuple_size) {
+   kunmap_atomic(prot_buf);
+   iv++;
+   BUG_ON(iv = bip-bip_vec + bip-bip_vcnt);
+   prot_offset = iv-bv_offset;
+   prot_len = iv-bv_len;
+   prot_buf = kmap_atomic(iv-bv_page);
+   }
+   sectors = min(prot_len / bi-tuple_size, nr_sectors);
+   tag_fn(prot_buf + prot_offset, tag_buf + tag_offset, sectors);
+   nr_sectors -= sectors;
+   tag_offset += sectors * bi-tuple_size;
+   prot_offset += sectors * bi-tuple_size;
+   prot_len -= sectors * bi-tuple_size;
+   }
+   kunmap_atomic(prot_buf);
 
return 0;
 }
@@ -300,28 +321,24 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 }
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
-/**
- * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
- * @bio:   bio to generate/verify integrity metadata for
- */
-int bio_integrity_update_user_buffer(struct bio *bio)
+typedef int (walk_buf_fn)(struct blk_integrity_exchg *bi, int flags);
+
+static int bio_integrity_walk_bufs(struct bio *bio, sector_t sector,
+  walk_buf_fn *mod_fn)
 {
struct blk_integrity *bi = bdev_get_integrity(bio-bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec bv;
struct bvec_iter iter;
-   sector_t sector;
unsigned int sectors, total, ret;
void *prot_buf;
unsigned int prot_offset, prot_len, bv_offset, bv_len;
struct bio_vec *iv;
struct bio_integrity_payload *bip = bio-bi_integrity;
 
-   if (!bi-mod_user_buf_fn)
+   if (!mod_fn)
return 0;
 
-   sector = bio-bi_iter.bi_sector;
-
total = ret = 0;
bix.disk_name = bio-bi_bdev-bd_disk-disk_name;
bix.sector_size = bi-sector_size;
@@ -351,7 +368,7 @@ int bio_integrity_update_user_buffer(struct bio *bio)
bix.prot_buf = prot_buf + prot_offset;
bix.sector = sector;
 
-   ret = bi-mod_user_buf_fn(bix, bip-bip_user_flags);
+   ret = mod_fn(bix, bip-bip_user_flags);
if (ret) {
if (ret == -ENOTTY)
ret = 0;
@@ -374,59 +391,19 @@ int bio_integrity_update_user_buffer(struct bio *bio)
kunmap_atomic(prot_buf);
return ret;
 }
-EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
 
 /**
- * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
  * @bio:   bio to generate/verify integrity metadata for
- * @operate:   operate number, 1 for generate, 0 for verify
+ * @sector:stratin
  */
-static int

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-23 Thread Darrick J. Wong

On Sun, Mar 23, 2014 at 03:02:44PM +0100, Jan Kara wrote:
 On Sat 22-03-14 02:43:20, Darrick J. Wong wrote:
  On Fri, Mar 21, 2014 at 07:32:16PM -0700, Darrick J. Wong wrote:
   On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
I'll admit, though, that I don't really like having to fetch the 'has'
bits first to find out how large the rest of the struct is.  Maybe
that's not worth worrying about.
   
   I'm not worrying about having to pluck 'has' out of the structure, but 
   needing
   a function to tell me how big of a buffer I need for a given pile of flags
   seems ... icky.  But maybe the ease of modifying strace and security 
   auditors
   would make it worth it?
  
  How about explicitly specifying the structure size in struct some_more_args,
  and checking that against whatever we find in .has?  Hm.  I still think 
  that's
  too clever for my brain to keep together for long.
  
  I'm also nervous that we could be creating this monster of a structure 
  wherein
  some user wants to tack the first and last hints ever created onto an IO, so
  now we have to lug this huge structure around that has space for hints that
  we're not going to use, and most of which is zeroes.
   Well, why does it matter that the structure would be big? Are do you
 think the memory consumption would matter?

I doubt the memory consumption will be a big deal (compared to the size of the
IOs), but I'm a little concerned about the overhead of copying a mostly-zeroes
user buffer into the kernel.  I guess it's not a big deal to copy the whole
thing now and if people complain about the overhead, switch it to let the IO
attribute controllers selectively copy_from_user later.

--D
 
   Honza
 -- 
 Jan Kara j...@suse.cz
 SUSE Labs, CR
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-22 Thread Darrick J. Wong

On Fri, Mar 21, 2014 at 07:32:16PM -0700, Darrick J. Wong wrote:
 On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
  On Fri, Mar 21, 2014 at 03:54:37PM -0700, Darrick J. Wong wrote:
   On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
  
I'm inclined to agree with Zach on this item.  Ultimately, we need an 
extensible data structure that can be grown without completely revising 
the ABI as new parameters are added.  We need something that is either 
TLV based, or an extensible array.
   
   Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to 
   indicate
   that this struct iocb has extensions attached to it.  Then, 
   iocb.aio_reserved2
   becomes a pointer to an array of extension descriptors, and 
   iocb.aio_reqprio
   becomes a u16 that tells us the array length.  The libaio.h equivalents 
   are
   iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.
   
   Next, let's define a conceptual structure for aio extensions:
   
   struct iocb_extension {
 void *ie_buf;
 unsigned int ie_buflen;
 unsigned int ie_type;
 unsigned int ie_flags;
   };
   
   The actual definitions can be defined in a similar fashion to the other 
   aio
   structures so that the structures are padded to the same layout 
   regardless of
   bitness.  As mentioned above, iocb.aio_reserved2 points to an array of 
   these.
  
  I'm firmly in the camp that doesn't want to go down this abstract road.
  We had this conversation with Kent when he wanted to do something very
  similar.
 
 Could you point me to this discussion?  I'd like to read it.

Is it [RFC, PATCH] Extensible AIO interface?
http://lkml.iu.edu//hypermail/linux/kernel/1210.0/00651.html 

Regrettably that discussion happened right during that period where I was
pleasantly AWOL from work for a few months. :)

Will read ... tomorrow.

  What happens if there are duplicate ie_types?  Is that universally
  prohibited, validity left up to the types that are duplicated?
 
 Yes.
 
  What if the len is not the right size?  Who checks that?
 
 The extension driver, presumably.
 
   What if the extension (they're arguments, but one thing at a time) is
   writable and the buf pointers overlap or is unaligned?  Is that cool, who
   checks it?
 
 Each extension driver has to check the alignment.  I don't know what to do
 about buffer pointer overlap; if you want to shoot yourself in the foot that's
 fine with me.
 
  Who defines the acceptable set?

(This was an I don't know, for anyone who cares.)

 
   Can drivers make up their own weird types?
 
 How do you mean?  As far as whatever's in the ie_buf, I think that depends on
 the extension.
 
   How does strace print all this?  How does the security module universe
   declare policies that can forbid or allow these things?
 
 I don't know.
 
  Personally, I think this level of dynamism is not worth the complexity.
  
  Can we instead just have a nice easy struct with fixed members that only
  grows?
  
  struct some_more_args {
  u64 has; /* = HAS_PI_VEC; */
  u64 pi_vec_ptr;
  u64 pi_vec_nr_segs;
  };
  
  struct some_more_args {
  u64 has; /* = HAS_PI_VEC | HAS_MAGIC_THING */
  u64 pi_vec_ptr;
  u64 pi_vec_nr_segs;
  u64 magic_thing;
  };
  
  If it only grows and has bits indicating presence then I think we're
  good.   You only fetch the space for the bits that are indicated.  You
  can return errors for bits you don't recognize.  You could perhaps offer
  some way to announce the bits you recognize.
 
 shrug I was gonna just -EINVAL for types we don't recognize, or which don't
 apply in this scenario.
 
  I'll admit, though, that I don't really like having to fetch the 'has'
  bits first to find out how large the rest of the struct is.  Maybe
  that's not worth worrying about.
 
 I'm not worrying about having to pluck 'has' out of the structure, but needing
 a function to tell me how big of a buffer I need for a given pile of flags
 seems ... icky.  But maybe the ease of modifying strace and security auditors
 would make it worth it?

How about explicitly specifying the structure size in struct some_more_args,
and checking that against whatever we find in .has?  Hm.  I still think that's
too clever for my brain to keep together for long.

I'm also nervous that we could be creating this monster of a structure wherein
some user wants to tack the first and last hints ever created onto an IO, so
now we have to lug this huge structure around that has space for hints that
we're not going to use, and most of which is zeroes.

I think it would be easy to add one of these interfaces to the regular
{read,write}{,v} calls too.

--D
 
  Thoughts?  Am I out to lunch here?
 
 I don't have a problem adopting your design, aside from the complications of
 figuring out how big struct some_more_args really is.
 
   Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave 
   that
   to the extension designer

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Darrick J. Wong

On Fri, Mar 21, 2014 at 10:57:31AM -0400, Jeff Moyer wrote:
 Darrick J. Wong darrick.w...@oracle.com writes:
 
  This RFC provides a rough implementation of a mechanism to allow
  userspace to attach protection information (e.g. T10 DIF) data to a
  disk write and to receive the information alongside a disk read.  The
  interface is an extension to the AIO interface: two new commands
  (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
 
 Sorry for the shallow question, but what does that M stand for?

Hmmm... I really don't remember why I picked 'M'.  Probably because it implied
that the IO has extra 'M'etadata associated with it.

But now I see, 'VM' connotes something entirely wrong.

--D
 
 Cheers,
 Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Darrick J. Wong

On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
 On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
  This RFC provides a rough implementation of a mechanism to allow
  userspace to attach protection information (e.g. T10 DIF) data to a
  disk write and to receive the information alongside a disk read.  The
  interface is an extension to the AIO interface: two new commands
  (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
  arg list is interpreted to point to a buffer containing a header,
  followed by the the PI data.
 
 Instead of adding commands that indicate that the final element is a
 magical pi buffer, why not expand the iocb?
 
 In the user iocb, a bit in aio_flags could indicate that aio_reserved2
 is a pointer to an extension of the iocb.  In that extension could be a
 full iov *, nr_segs for PI data.
 
 You'd then translate that into a bigger kernel kiocb with a specific
 pointer to PI data rather than having to bubble the tests for this magic
 final iovec down through the kernel.
 
 +   if (iocb-ki_flags  KIOCB_USE_PI) {
 +   nr_segs--;
 +   pi_iov = (struct iovec *)(iov + nr_segs);
 +   }
 
 I suggest this because there's already pressure to extend the iocb.
 Folks want io priority inputs, completion time outputs, etc.

I'm curious about the reqprio field -- it seems like it was put there to
request some kind of IO priority change, but the kernel doesn't use it.

If aio_reserved2 becomes a (flag-guarded) pointer to an array of aio
extensions, I'd be tempted to reuse the reqprio to signal the length of the
extension array, and if anyone wants to start using reqprio, they could add it
as an extension.

(More about this in my response to Ben LaHaise.)

 It's a much cleaner way to extend the interface without an explosion of
 command enums that are really combinations of per-io arguments that are
 present or not.

Agreed.

 And heck, on the sync rw syscall side, add variant that have a pointer
 to this same extension struct.  There's nothing inherently aio specific
 about having lots more per-io inputs and outputs.

I'm curious -- what kinds of extensions do you envision for sync()?

--D
 
 - z
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Darrick J. Wong

On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
 Hi folks,
 
 On Fri, Mar 21, 2014 at 11:23:32AM -0700, Zach Brown wrote:
  On Thu, Mar 20, 2014 at 09:30:41PM -0700, Darrick J. Wong wrote:
   This RFC provides a rough implementation of a mechanism to allow
   userspace to attach protection information (e.g. T10 DIF) data to a
   disk write and to receive the information alongside a disk read.  The
   interface is an extension to the AIO interface: two new commands
   (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
   arg list is interpreted to point to a buffer containing a header,
   followed by the the PI data.
  
  Instead of adding commands that indicate that the final element is a
  magical pi buffer, why not expand the iocb?
  
  In the user iocb, a bit in aio_flags could indicate that aio_reserved2
  is a pointer to an extension of the iocb.  In that extension could be a
  full iov *, nr_segs for PI data.
 
 I'm inclined to agree with Zach on this item.  Ultimately, we need an 
 extensible data structure that can be grown without completely revising 
 the ABI as new parameters are added.  We need something that is either 
 TLV based, or an extensible array.

Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
that this struct iocb has extensions attached to it.  Then, iocb.aio_reserved2
becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
becomes a u16 that tells us the array length.  The libaio.h equivalents are
iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.

Next, let's define a conceptual structure for aio extensions:

struct iocb_extension {
void *ie_buf;
unsigned int ie_buflen;
unsigned int ie_type;
unsigned int ie_flags;
};

The actual definitions can be defined in a similar fashion to the other aio
structures so that the structures are padded to the same layout regardless of
bitness.  As mentioned above, iocb.aio_reserved2 points to an array of these.

Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave that
to the extension designer to decide if they want to support either a S-G list,
one big (vaddr) buffer, or toggle flags?

For the PI passthrough, I'll define IOCB_EXT_PI as the first ie_type, and move
the flags argument out of the PI buffer and into ie_flags.

I could also make an IOCB_EXT_REQPRIO where ie_flags = reqprio, but since the
kernel ignores it right now, I don't see much point.

  You'd then translate that into a bigger kernel kiocb with a specific
  pointer to PI data rather than having to bubble the tests for this magic
  final iovec down through the kernel.
  
  +   if (iocb-ki_flags  KIOCB_USE_PI) {
  +   nr_segs--;
  +   pi_iov = (struct iovec *)(iov + nr_segs);
  +   }
  
  I suggest this because there's already pressure to extend the iocb.
  Folks want io priority inputs, completion time outputs, etc.
 
 There are already folks at other companies looking at similar extensions.  
 I think there are folks at Google who have similar requirements.

To everyone else interested in AIO extensions: I'd love to hear your ideas.

 Do you have time to put in some effort into defining these extensions?

I think so.  Let's see how much we can get done.

--D
 
   -ben
 
  It's a much cleaner way to extend the interface without an explosion of
  command enums that are really combinations of per-io arguments that are
  present or not.
  
  And heck, on the sync rw syscall side, add variant that have a pointer
  to this same extension struct.  There's nothing inherently aio specific
  about having lots more per-io inputs and outputs.
  
  - z
 
 -- 
 Thought is the essence of where you are now.
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Darrick J. Wong

On Fri, Mar 21, 2014 at 05:29:09PM -0700, Zach Brown wrote:
 On Fri, Mar 21, 2014 at 03:54:37PM -0700, Darrick J. Wong wrote:
  On Fri, Mar 21, 2014 at 05:44:10PM -0400, Benjamin LaHaise wrote:
 
   I'm inclined to agree with Zach on this item.  Ultimately, we need an 
   extensible data structure that can be grown without completely revising 
   the ABI as new parameters are added.  We need something that is either 
   TLV based, or an extensible array.
  
  Ok.  Let's define IOCB_FLAG_EXTENSIONS as an iocb.aio_flags flag to indicate
  that this struct iocb has extensions attached to it.  Then, 
  iocb.aio_reserved2
  becomes a pointer to an array of extension descriptors, and iocb.aio_reqprio
  becomes a u16 that tells us the array length.  The libaio.h equivalents are
  iocb.u.c.flags, iocb.u.c.__pad3, and iocb.aio_reqprio, respectively.
  
  Next, let's define a conceptual structure for aio extensions:
  
  struct iocb_extension {
  void *ie_buf;
  unsigned int ie_buflen;
  unsigned int ie_type;
  unsigned int ie_flags;
  };
  
  The actual definitions can be defined in a similar fashion to the other aio
  structures so that the structures are padded to the same layout regardless 
  of
  bitness.  As mentioned above, iocb.aio_reserved2 points to an array of 
  these.
 
 I'm firmly in the camp that doesn't want to go down this abstract road.
 We had this conversation with Kent when he wanted to do something very
 similar.

Could you point me to this discussion?  I'd like to read it.

 What happens if there are duplicate ie_types?  Is that universally
 prohibited, validity left up to the types that are duplicated?

Yes.

 What if the len is not the right size?  Who checks that?

The extension driver, presumably.

  What if the extension (they're arguments, but one thing at a time) is
  writable and the buf pointers overlap or is unaligned?  Is that cool, who
  checks it?

Each extension driver has to check the alignment.  I don't know what to do
about buffer pointer overlap; if you want to shoot yourself in the foot that's
fine with me.

 Who defines the acceptable set?


  Can drivers make up their own weird types?

How do you mean?  As far as whatever's in the ie_buf, I think that depends on
the extension.

  How does strace print all this?  How does the security module universe
  declare policies that can forbid or allow these things?

I don't know.

 Personally, I think this level of dynamism is not worth the complexity.
 
 Can we instead just have a nice easy struct with fixed members that only
 grows?
 
 struct some_more_args {
   u64 has; /* = HAS_PI_VEC; */
   u64 pi_vec_ptr;
   u64 pi_vec_nr_segs;
 };
 
 struct some_more_args {
   u64 has; /* = HAS_PI_VEC | HAS_MAGIC_THING */
   u64 pi_vec_ptr;
   u64 pi_vec_nr_segs;
   u64 magic_thing;
 };
 
 If it only grows and has bits indicating presence then I think we're
 good.   You only fetch the space for the bits that are indicated.  You
 can return errors for bits you don't recognize.  You could perhaps offer
 some way to announce the bits you recognize.

shrug I was gonna just -EINVAL for types we don't recognize, or which don't
apply in this scenario.

 I'll admit, though, that I don't really like having to fetch the 'has'
 bits first to find out how large the rest of the struct is.  Maybe
 that's not worth worrying about.

I'm not worrying about having to pluck 'has' out of the structure, but needing
a function to tell me how big of a buffer I need for a given pile of flags
seems ... icky.  But maybe the ease of modifying strace and security auditors
would make it worth it?

 Thoughts?  Am I out to lunch here?

I don't have a problem adopting your design, aside from the complications of
figuring out how big struct some_more_args really is.

  Question: Do we want to allow ie_buf to be struct iovec[]?  Can we leave 
  that
  to the extension designer to decide if they want to support either a S-G 
  list,
  one big (vaddr) buffer, or toggle flags?
 
 No idea.  Either seems doable.  I'd aim for simpler to reduce the number
 of weird cases to handle or forbid (iovecs with a byte per page!) unless
 Martin thinks people want to vector the PI goo.

For now I'll leave it as a simple buffer until I hear otherwise.

  I think so.  Let's see how much we can get done.
 
 FWIW, I'm happy to chat about this in person at LSF next week.  I'll be
 around.

Me too!

--D
 
 - z
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/5] blk-integrity: refactor various routines

2014-03-20 Thread Darrick J. Wong

Refactor blk-integrity.c to avoid duplicating similar functions, and
remove all users of pi_buf, since it's really only there to handle the
(common) case where the kernel auto-generates all the PI data.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 fs/bio-integrity.c  |  120 +--
 include/linux/bio.h |2 -
 2 files changed, 49 insertions(+), 73 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 381ee38..3ff1572 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -97,8 +97,7 @@ void bio_integrity_free(struct bio *bio)
struct bio_integrity_payload *bip = bio-bi_integrity;
struct bio_set *bs = bio-bi_pool;
 
-   if (bip-bip_owns_buf)
-   kfree(bip-bip_buf);
+   kfree(bip-bip_buf);
 
if (bs) {
if (bip-bip_slab != BIO_POOL_NONE)
@@ -239,9 +238,11 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
 {
struct bio_integrity_payload *bip = bio-bi_integrity;
struct blk_integrity *bi = bdev_get_integrity(bio-bi_bdev);
-   unsigned int nr_sectors;
-
-   BUG_ON(bip-bip_buf == NULL);
+   unsigned int nr_sectors, tag_offset, sectors;
+   void *prot_buf;
+   unsigned int prot_offset, prot_len;
+   struct bio_vec *iv;
+   void (*tag_fn)(void *buf, void *tag_buf, unsigned int);
 
if (bi-tag_size == 0)
return -1;
@@ -255,10 +256,30 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
return -1;
}
 
-   if (set)
-   bi-set_tag_fn(bip-bip_buf, tag_buf, nr_sectors);
-   else
-   bi-get_tag_fn(bip-bip_buf, tag_buf, nr_sectors);
+   iv = bip-bip_vec;
+   prot_offset = iv-bv_offset;
+   prot_len = iv-bv_len;
+   prot_buf = kmap_atomic(iv-bv_page);
+   tag_fn = set ? bi-set_tag_fn : bi-get_tag_fn;
+   tag_offset = 0;
+
+   while (nr_sectors) {
+   if (prot_len  bi-tuple_size) {
+   kunmap_atomic(prot_buf);
+   iv++;
+   BUG_ON(iv = bip-bip_vec + bip-bip_vcnt);
+   prot_offset = iv-bv_offset;
+   prot_len = iv-bv_len;
+   prot_buf = kmap_atomic(iv-bv_page);
+   }
+   sectors = min(prot_len / bi-tuple_size, nr_sectors);
+   tag_fn(prot_buf + prot_offset, tag_buf + tag_offset, sectors);
+   nr_sectors -= sectors;
+   tag_offset += sectors * bi-tuple_size;
+   prot_offset += sectors * bi-tuple_size;
+   prot_len -= sectors * bi-tuple_size;
+   }
+   kunmap_atomic(prot_buf);
 
return 0;
 }
@@ -300,28 +321,24 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 }
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
-/**
- * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
- * @bio:   bio to generate/verify integrity metadata for
- */
-int bio_integrity_update_user_buffer(struct bio *bio)
+typedef int (walk_buf_fn)(struct blk_integrity_exchg *bi, int flags);
+
+static int bio_integrity_walk_bufs(struct bio *bio, sector_t sector,
+  walk_buf_fn *mod_fn)
 {
struct blk_integrity *bi = bdev_get_integrity(bio-bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec bv;
struct bvec_iter iter;
-   sector_t sector;
unsigned int sectors, total, ret;
void *prot_buf;
unsigned int prot_offset, prot_len, bv_offset, bv_len;
struct bio_vec *iv;
struct bio_integrity_payload *bip = bio-bi_integrity;
 
-   if (!bi-mod_user_buf_fn)
+   if (!mod_fn)
return 0;
 
-   sector = bio-bi_iter.bi_sector;
-
total = ret = 0;
bix.disk_name = bio-bi_bdev-bd_disk-disk_name;
bix.sector_size = bi-sector_size;
@@ -351,7 +368,7 @@ int bio_integrity_update_user_buffer(struct bio *bio)
bix.prot_buf = prot_buf + prot_offset;
bix.sector = sector;
 
-   ret = bi-mod_user_buf_fn(bix, bip-bip_user_flags);
+   ret = mod_fn(bix, bip-bip_user_flags);
if (ret) {
if (ret == -ENOTTY)
ret = 0;
@@ -374,59 +391,19 @@ int bio_integrity_update_user_buffer(struct bio *bio)
kunmap_atomic(prot_buf);
return ret;
 }
-EXPORT_SYMBOL_GPL(bio_integrity_update_user_buffer);
 
 /**
- * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * bio_integrity_update_user_buffer - Update user-provided PI buffers for a bio
  * @bio:   bio to generate/verify integrity metadata for
- * @operate:   operate number, 1 for generate, 0 for verify
+ * @sector:stratin
  */
-static int

[RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-20 Thread Darrick J. Wong

This RFC provides a rough implementation of a mechanism to allow
userspace to attach protection information (e.g. T10 DIF) data to a
disk write and to receive the information alongside a disk read.  The
interface is an extension to the AIO interface: two new commands
(IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the
arg list is interpreted to point to a buffer containing a header,
followed by the the PI data.  These patches are against 3.14-rc7.

The first patch is a little bit of code refactoring, as sent in by Gu
Zheng.  It seems to be queued up for 3.15, so I figured I might as well
start from there.

Patch #2 provides the plumbing to get the user's buffer all the way to
the block integrity code.  I'm not quite sure if the mechanism I took
(passing the results of get_user_pages around) actually works in all
cases (such as the user's buffer being swapped out), but it survives
a simple test.  Due to the way that the code deals with the array of
struct page*s that represent the PI buffer, there's an unfortunate
requirement that no PI tuple may cross a page boundary.  Given that
so far DIF is only 8 or 16 bytes this isn't a problem... yet.  There's
also no explicit fallback for the case where the user pages are not
within a device's DMA range.

Patch #3 builds on the previous patch to allow userspace to send some
flags along with the PI buffer.  The integrity provider now has a
mod_user_buf_fn hook that enables the provider to read the userspace
flags and modify the PI buffer before submit_bio.  For now, this means
that T10/DIF provider can be told to patch any of the reference, app,
or guard tags.  This is useful for sending PI data with an IO request
for a file on a filesystem, since the kernel can patch in the device's
LBA later.  Also it means that if you only care about, say, app tags,
you can provide those and let the kernel take care of the crc and the
LBA.  I don't know if that's anyone's requirement, but there we are.

Patch #4 provides a mechanism for integrity providers to advertise
both the per-logical-block PI buffer size and the flags that can be
passed to the mod_user_buf_fn hook.  The advertisements can be found
in sysfs, since that's where we present all the other PI details about
a device.

Patch #5 removes redundant code and modifies the tag get/set functions
to follow the other new functions and kmap/unmap the PI buffer page(s)
before messing with the PI buffers, instead of relying on pi_buf being
a valid pointer.

Comments and questions are, as always, welcome.  There will be a
session about this on the second day of LSF/MM, if I'm not mistaken.
A sample program follows this message.

$ cc -o prog prog.c
$ ./prog -rw -p r -s 2048 /path/to/pi/device

--D

/*
 * Userspace DIX API test program
 * Licensed under GPLv2. Copyright 2014 Oracle.
 *
 * XXX: We don't query the kernel for this information like we should!
 */
#define _GNU_SOURCE
#include stdio.h
#include libaio.h
#include unistd.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include sys/uio.h
#include errno.h
#include stdlib.h
#include stdint.h
#include arpa/inet.h
#include sys/ioctl.h
#include linux/fs.h

#define IOCB_CMD_PREADVM(9)
#define IOCB_CMD_PWRITEVM   (10)
#define GENERATE_GUARD  (1)
#define GENERATE_REF(2)
#define GENERATE_APP(4)
#define GENERATE_ALL(7)

#define NR_IOS  (1)

static void dump_buffer(char *buf, size_t len)
{
size_t off;
char *p;

for (p = buf; p  buf + len; p++) {
off = p - buf;
if (off % 32 == 0) {
if (p != buf)
printf(\n);
printf(%05zu:, off);
}
printf( %02x, *p  0xFF);
}
printf(\n);
}

/* Table generated using the following polynomium:
 * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
 * gt: 0x8bb7
 */
static const uint16_t t10_dif_crc_table[256] = {
0x, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,

[PATCH 1/5] fs/bio-integrity: remove duplicate code

2014-03-20 Thread Darrick J. Wong

Frøm: Gu Zheng guz.f...@cn.fujitsu.com

Most code of function bio_integrity_verify and bio_integrity_generate
is the same, so introduce a help function bio_integrity_generate_verify()
to remove the duplicate code.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 fs/bio-integrity.c |   83 +++-
 1 file changed, 37 insertions(+), 46 deletions(-)


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 4f70f38..413312f 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -301,25 +301,26 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
- * bio_integrity_generate - Generate integrity metadata for a bio
- * @bio:   bio to generate integrity metadata for
- *
- * Description: Generates integrity metadata for a bio by calling the
- * block device's generation callback function.  The bio must have a
- * bip attached with enough room to accommodate the generated
- * integrity metadata.
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:   bio to generate/verify integrity metadata for
+ * @operate:   operate number, 1 for generate, 0 for verify
  */
-static void bio_integrity_generate(struct bio *bio)
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
 {
struct blk_integrity *bi = bdev_get_integrity(bio-bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec bv;
struct bvec_iter iter;
-   sector_t sector = bio-bi_iter.bi_sector;
-   unsigned int sectors, total;
+   sector_t sector;
+   unsigned int sectors, total, ret;
void *prot_buf = bio-bi_integrity-bip_buf;
 
-   total = 0;
+   if (operate)
+   sector = bio-bi_iter.bi_sector;
+   else
+   sector = bio-bi_integrity-bip_iter.bi_sector;
+
+   total = ret = 0;
bix.disk_name = bio-bi_bdev-bd_disk-disk_name;
bix.sector_size = bi-sector_size;
 
@@ -330,7 +331,15 @@ static void bio_integrity_generate(struct bio *bio)
bix.prot_buf = prot_buf;
bix.sector = sector;
 
-   bi-generate_fn(bix);
+   if (operate) {
+   bi-generate_fn(bix);
+   } else {
+   ret = bi-verify_fn(bix);
+   if (ret) {
+   kunmap_atomic(kaddr);
+   return ret;
+   }
+   }
 
sectors = bv.bv_len / bi-sector_size;
sector += sectors;
@@ -340,6 +349,21 @@ static void bio_integrity_generate(struct bio *bio)
 
kunmap_atomic(kaddr);
}
+   return ret;
+}
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:   bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+   bio_integrity_generate_verify(bio, 1);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -454,40 +478,7 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-   struct blk_integrity *bi = bdev_get_integrity(bio-bi_bdev);
-   struct blk_integrity_exchg bix;
-   struct bio_vec *bv;
-   sector_t sector = bio-bi_integrity-bip_iter.bi_sector;
-   unsigned int sectors, ret = 0;
-   void *prot_buf = bio-bi_integrity-bip_buf;
-   int i;
-
-   bix.disk_name = bio-bi_bdev-bd_disk-disk_name;
-   bix.sector_size = bi-sector_size;
-
-   bio_for_each_segment_all(bv, bio, i) {
-   void *kaddr = kmap_atomic(bv-bv_page);
-
-   bix.data_buf = kaddr + bv-bv_offset;
-   bix.data_size = bv-bv_len;
-   bix.prot_buf = prot_buf;
-   bix.sector = sector;
-
-   ret = bi-verify_fn(bix);
-
-   if (ret) {
-   kunmap_atomic(kaddr);
-   return ret;
-   }
-
-   sectors = bv-bv_len / bi-sector_size;
-   sector += sectors;
-   prot_buf += sectors * bi-tuple_size;
-
-   kunmap_atomic(kaddr);
-   }
-
-   return ret;
+   return bio_integrity_generate_verify(bio, 0);
 }
 
 /**

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] aio/dio: enable DIX passthrough

2014-03-20 Thread Darrick J. Wong

Provide a set of new AIO commands (IOCB_CMD_P{READ,WRITE}VM) that
utilize the last iovec of the iovec array to convey protection
information to and from userspace.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 Documentation/block/data-integrity.txt |   11 ++
 fs/aio.c   |   22 
 fs/bio-integrity.c |   93 +++
 fs/direct-io.c |  157 +---
 include/linux/aio.h|3 +
 include/linux/bio.h|   15 +++
 include/uapi/linux/aio_abi.h   |2 
 mm/filemap.c   |7 +
 8 files changed, 294 insertions(+), 16 deletions(-)


diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index 2d735b0a..1d1f070 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -282,6 +282,17 @@ will require extra work due to the application tag.
   It is up to the receiver to process them and verify data
   integrity upon completion.
 
+int bio_integrity_prep_buffer(struct bio *bio, int rw,
+ struct bio_integrity_prep_iter *pi);
+
+  This function should be called before submit_bio; its purpose is to
+  attach an arbitrary array of struct page * containing integrity data
+  to an existing bio.  Primarily this is intended for AIO/DIO to be
+  able to attach a userspace buffer to a bio.
+
+  The bio_integrity_prep_iter should contain the page offset and buffer
+  length of the PI buffer, the number of pages, and the actual array of
+  pages, as returned by get_user_pages.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
 METADATA
diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..5d425d8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1259,6 +1259,11 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned 
opcode,
struct iovec inline_vec, *iovec = inline_vec;
 
switch (opcode) {
+   case IOCB_CMD_PREADVM:
+   if (!(file-f_flags  O_DIRECT))
+   return -EINVAL;
+   req-ki_flags |= KIOCB_USE_PI;
+
case IOCB_CMD_PREAD:
case IOCB_CMD_PREADV:
mode= FMODE_READ;
@@ -1266,6 +1271,11 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned 
opcode,
rw_op   = file-f_op-aio_read;
goto rw_common;
 
+   case IOCB_CMD_PWRITEVM:
+   if (!(file-f_flags  O_DIRECT))
+   return -EINVAL;
+   req-ki_flags |= KIOCB_USE_PI;
+
case IOCB_CMD_PWRITE:
case IOCB_CMD_PWRITEV:
mode= FMODE_WRITE;
@@ -1280,7 +1290,9 @@ rw_common:
return -EINVAL;
 
ret = (opcode == IOCB_CMD_PREADV ||
-  opcode == IOCB_CMD_PWRITEV)
+  opcode == IOCB_CMD_PWRITEV ||
+  opcode == IOCB_CMD_PREADVM ||
+  opcode == IOCB_CMD_PWRITEVM)
? aio_setup_vectored_rw(req, rw, buf, nr_segs,
iovec, compat)
: aio_setup_single_vector(req, rw, buf, nr_segs,
@@ -1288,6 +1300,13 @@ rw_common:
if (ret)
return ret;
 
+   if ((req-ki_flags  KIOCB_USE_PI)  nr_segs  2) {
+   pr_err(%s: not enough iovecs for PI!\n, __func__);
+   if (iovec != inline_vec)
+   kfree(iovec);
+   return -EINVAL;
+   }
+
ret = rw_verify_area(rw, file, req-ki_pos, req-ki_nbytes);
if (ret  0) {
if (iovec != inline_vec)
@@ -1407,6 +1426,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req-ki_user_data = iocb-aio_data;
req-ki_pos = iocb-aio_offset;
req-ki_nbytes = iocb-aio_nbytes;
+   req-ki_flags = 0;
 
ret = aio_run_iocb(req, iocb-aio_lio_opcode,
   (char __user *)(unsigned long)iocb-aio_buf,
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 413312f..af398f0 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -138,7 +138,7 @@ int bio_integrity_add_page(struct bio *bio, struct page 
*page,
struct bio_vec *iv;
 
if (bip-bip_vcnt = bip_integrity_vecs(bip)) {
-   printk(KERN_ERR %s: bip_vec full\n, __func__);
+   pr_err(%s: bip_vec full\n, __func__);
return 0;
}
 
@@ -250,7 +250,7 @@ static int bio_integrity_tag(struct bio *bio, void 
*tag_buf, unsigned int len,
DIV_ROUND_UP(len, bi-tag_size));
 
if (nr_sectors * bi-tuple_size  bip-bip_iter.bi_size) {
-   printk(KERN_ERR %s: tag too big for bio: %u  %u\n, __func__

[PATCH 3/5] aio/dio: allow user to ask kernel to fill in parts of the protection info

2014-03-20 Thread Darrick J. Wong

Since userspace can now pass PI buffers through to the block integrity
provider, provide a means for userspace to specify a flags argument
with the PI buffer.  The initial user for this will be sd_dif, which
will enable user programs to ask the kernel to fill in whichever
fields they don't want to provide.  This is intended, for example, to
satisfy programs that really only care to provide an app tag.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 Documentation/block/data-integrity.txt |   11 
 block/blk-integrity.c  |1 
 drivers/scsi/sd_dif.c  |   76 
 fs/bio-integrity.c |   87 +++-
 fs/direct-io.c |   15 ++
 include/linux/bio.h|3 +
 include/linux/blkdev.h |2 +
 7 files changed, 178 insertions(+), 17 deletions(-)


diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index 1d1f070..b72a54f 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -292,7 +292,10 @@ will require extra work due to the application tag.
 
   The bio_integrity_prep_iter should contain the page offset and buffer
   length of the PI buffer, the number of pages, and the actual array of
-  pages, as returned by get_user_pages.
+  pages, as returned by get_user_pages.  The user_flags argument should
+  contain whatever flag values were passed in by userspace; the values
+  of the flags are specific to the block integrity provider, and are
+  passed to the mod_user_buf_fn handler.
 
 5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
 METADATA
@@ -332,6 +335,12 @@ will require extra work due to the application tag.
   are available per hardware sector.  For DIF this is either 2 or
   0 depending on the value of the Control Mode Page ATO bit.
 
+  'mod_user_buf_fn' updates the appropriate integrity metadata for
+  a WRITE operation.  This function is called when userspace passes
+  in a PI buffer along with file data; the flags argument (which is
+  specific to the blk_integrity provider) arrange for pre-processing
+  of the user buffer prior to issuing the IO.
+
   See 6.2 for a description of get_tag_fn and set_tag_fn.
 
 --
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 7fbab84..1cb1eb2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -421,6 +421,7 @@ int blk_integrity_register(struct gendisk *disk, struct 
blk_integrity *template)
bi-set_tag_fn = template-set_tag_fn;
bi-get_tag_fn = template-get_tag_fn;
bi-tag_size = template-tag_size;
+   bi-mod_user_buf_fn = template-mod_user_buf_fn;
} else
bi-name = bi_unsupported_name;
 
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index a7a691d..74182c9 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -53,31 +53,58 @@ static __u16 sd_dif_ip_fn(void *data, unsigned int len)
  * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
  * 16 bit app tag, 32 bit reference tag.
  */
-static void sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+#define GENERATE_GUARD (1)
+#define GENERATE_REF   (2)
+#define GENERATE_APP   (4)
+#define GENERATE_ALL   (7)
+static int sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn,
+int flags)
 {
void *buf = bix-data_buf;
struct sd_dif_tuple *sdt = bix-prot_buf;
sector_t sector = bix-sector;
unsigned int i;
 
+   if (flags  ~GENERATE_ALL)
+   return -EINVAL;
+   if (!flags)
+   return -ENOTTY;
+
for (i = 0 ; i  bix-data_size ; i += bix-sector_size, sdt++) {
-   sdt-guard_tag = fn(buf, bix-sector_size);
-   sdt-ref_tag = cpu_to_be32(sector  0x);
-   sdt-app_tag = 0;
+   if (flags  GENERATE_GUARD)
+   sdt-guard_tag = fn(buf, bix-sector_size);
+   if (flags  GENERATE_REF)
+   sdt-ref_tag = cpu_to_be32(sector  0x);
+   if (flags  GENERATE_APP)
+   sdt-app_tag = 0;
 
buf += bix-sector_size;
sector++;
}
+
+   return 0;
 }
 
 static void sd_dif_type1_generate_crc(struct blk_integrity_exchg *bix)
 {
-   sd_dif_type1_generate(bix, sd_dif_crc_fn);
+   sd_dif_type1_generate(bix, sd_dif_crc_fn, GENERATE_ALL);
 }
 
 static void sd_dif_type1_generate_ip(struct blk_integrity_exchg *bix)
 {
-   sd_dif_type1_generate(bix, sd_dif_ip_fn);
+   sd_dif_type1_generate(bix, sd_dif_ip_fn, GENERATE_ALL);
+}
+
+static int sd_dif_type1_mod_crc(struct

[PATCH 4/5] aio/dio: advertise possible userspace flags

2014-03-20 Thread Darrick J. Wong

Expose possible userland flags to the new AIO/DIO PI interface so that
userspace can discover what flags exist.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 Documentation/ABI/testing/sysfs-block  |   14 ++
 Documentation/block/data-integrity.txt |   26 +
 block/blk-integrity.c  |   33 
 drivers/scsi/sd_dif.c  |   11 +++
 include/linux/blkdev.h |7 +++
 5 files changed, 91 insertions(+)


diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 279da08..989cb80 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -53,6 +53,20 @@ Description:
512 bytes of data.
 
 
+What:  /sys/block/disk/integrity/tuple_size
+Date:  March 2014
+Contact:   Darrick J. Wong darrick.w...@oracle.com
+Description:
+   Size in bytes of the integrity data buffer for each logical
+   block.
+
+What:  /sys/block/disk/integrity/write_user_flags
+Date:  March 2014
+Contact:   Darrick J. Wong darrick.w...@oracle.com
+Description:
+   Provides a list of flags that userspace can pass to the kernel
+   when supplying integrity data for a write IO.
+
 What:  /sys/block/disk/integrity/write_generate
 Date:  June 2008
 Contact:   Martin K. Petersen martin.peter...@oracle.com
diff --git a/Documentation/block/data-integrity.txt 
b/Documentation/block/data-integrity.txt
index b72a54f..38a83a7 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -341,7 +341,33 @@ will require extra work due to the application tag.
   specific to the blk_integrity provider) arrange for pre-processing
   of the user buffer prior to issuing the IO.
 
+  'user_write_flags' points to an array of struct blk_integrity_flag,
+  which maps mod_user_buf_fn flags to a description of what they do.
+
   See 6.2 for a description of get_tag_fn and set_tag_fn.
 
+5.5 PASSING INTEGRITY DATA FROM USERSPACE
+
+The AIO/DIO interface has been extended with a new API to provide
+userspace programs the ability to provide PI data with a WRITE, or
+to receive PI data with a READ.  There are two new AIO commands,
+IOCB_CMD_PREADVM and IOCB_CMD_PWRITEVM.  They have the same general
+struct iocb format as IOCB_CMD_PREADV and IOCB_CMD_PWRITEV, respectively.
+The final struct iovec should point to the buffer that contains the
+PI data.
+
+This buffer must be aligned to a page boundary, and it must have the
+following format: Flags are stored in a 32-bit integer.  There must
+then be padding out to the next multiple of the tuple size.  After
+that comes the tuple data.  Valid flag values can be found in
+/sys/block/*/integrity/user_write_flags.  The tuple size can be found
+in /sys/block/*/integrity/tuple_size.  Tuples must not split a page
+boundary.
+
+In general, the flags allow the user program to ask the in-kernel
+integrity provider to fill in some parts of the tuples.  For example,
+the T10 DIF provider can fill in the reference tag (sector number) so
+that userspace can choose not to care about the reference tag.
+
 --
 2007-12-24 Martin K. Petersen martin.peter...@oracle.com
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 1cb1eb2..557d28e 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -307,6 +307,26 @@ static ssize_t integrity_write_show(struct blk_integrity 
*bi, char *page)
return sprintf(page, %d\n, (bi-flags  INTEGRITY_FLAG_WRITE) != 0);
 }
 
+static ssize_t integrity_write_flags_show(struct blk_integrity *bi, char *page)
+{
+   struct blk_integrity_flag *flag = bi-user_write_flags;
+   char *p = page;
+   ssize_t ret = 0;
+
+   while (flag-value) {
+   ret += snprintf(p, PAGE_SIZE - ret, 0x%x: %s\n,
+   flag-value, flag-descr);
+   p = page + ret;
+   flag++;
+   }
+   return ret;
+}
+
+static ssize_t integrity_tuple_size_show(struct blk_integrity *bi, char *page)
+{
+   return sprintf(page, %d\n, bi-tuple_size);
+}
+
 static struct integrity_sysfs_entry integrity_format_entry = {
.attr = { .name = format, .mode = S_IRUGO },
.show = integrity_format_show,
@@ -329,11 +349,23 @@ static struct integrity_sysfs_entry integrity_write_entry 
= {
.store = integrity_write_store,
 };
 
+static struct integrity_sysfs_entry integrity_write_flags_entry = {
+   .attr = { .name = write_user_flags, .mode = S_IRUGO },
+   .show = integrity_write_flags_show,
+};
+
+static struct integrity_sysfs_entry integrity_tuple_size_entry = {
+   .attr = { .name = tuple_size, .mode

Re: status of block-integrity

2014-01-06 Thread Darrick J. Wong

On Fri, Jan 03, 2014 at 03:03:42PM -0500, Martin K. Petersen wrote:
  Hannes == Hannes Reinecke h...@suse.de writes:
 
 Hannes Personally, I doubt it's a good idea to kill it off, but a
 Hannes proper (userland) API for it has been a long time missing.
 
 Before we throw the baby out with the bath water, maybe Darrick can fill
 us in on the progress of the aio passthrough interface?

I haven't made much progress on it -- I haven't seen any earnest demand for it.

Last year Chuck Lever said that some NFS working group was looking defining an
interface it... has there been any progress?  It doesn't sound like there has
been.

--D
 
 -- 
 Martin K. PetersenOracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-07 Thread Darrick J. Wong

On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote:
 On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote:
  
  On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com 
  wrote:
  
   On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
   Hi,
   
   I'm interested in discussing how to pass protection information to and 
   from
   userspace.  Maybe Martin could be enlisted for the discussion.
   
   I read that some work has already been done in this area but have not 
   been able
   to locate it.  It looks like the bio-integrity code already makes it 
   possible
   to generate the t10-dif crc in the filesystem.  It would be good to be 
   able to
   get the guard and application tags back out to backup applications such 
   as
   xfsdump.  Enabling other applications to generate their own tags in 
   userspace
   is also interesting.
   
   This one's been on my list for a couple of years (and companies) too.  A 
   few
   years ago Joel Becker had support for it in his sys_dio proposal (that 
   hasn't
   gone anywhere), and more recently I've theorized that we could add a magic
   fcntl/ioctl to make the kernel recognize, say, the first iovec of a 
   O_DIRECT
   *{read,write}v call as the PI buffer, which I think is similar to how DIX 
   gets
   PI data to a disk.  But it's not like I have any code to show for it.
   
   I /think/ it's fairly straightforward to change the directio submit code 
   to
   find the userspace PI buffer and amend the block integrity code to attach 
   our
   own PI buffer.  You'd still have to let the block layer set the sector # 
   field,
   but afaik that won't affect the crc or the app tag.
   
   I hear that the NFS guys want to propose some sort of protocol for 
   transmitting
   PI data (across NFS), but I haven't seen anything concrete yet.
  
  I'm writing a requirements document for the NFS protocol which I can 
  discuss at LSF.  The use cases for NFS for now would be virtual disk 
  devices (hypervisors) or direct NFS access to storage from user space.
  
  Like everyone else we are waiting for a magical VFS and user space API to 
  appear that can pass PI to and from storage.
 
 I'm happy to chat about it.  Unfortunately, like Darrick says, sys_dio()
 coding hasn't happened.  I do think we're better off with some kind of
 explicit API than some magic state on the file.  I mean, even something
 like:
 
   ssize_t write_with_pi(int fd, const void *buf, size_t count,
 const void *pi, size_t pi_count);
 
 It's not as nice as a non-historical API (eg sys_dio), but it also
 probably plays nicer with buffered I/O.

I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio
and all the other plumbing necessary to make that happen...

void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
   int iovcnt, long long offset, const void *pi,
   size_t pi_count);

--D
 
 Joel
 
  
   Well, I hope I'll scrape together the time to hack together a PoC before 
   LSF...
   on the other hand, I ran the discussion about PI userland interfaces at 
   LPC2011
   and (shamefully) haven't done anything yet.
   
   end rambling
   
   --D
   
   Regards,
Ben
   --
   To unsubscribe from this list: send the line unsubscribe linux-fsdevel 
   in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   --
   To unsubscribe from this list: send the line unsubscribe linux-fsdevel 
   in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  -- 
  Chuck Lever
  chuck[dot]lever[at]oracle[dot]com
  
  
  
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 
 I think it would be a good idea.  
 - Mahatma Ghandi, when asked what he thought of Western
   civilization
 
   http://www.jlbec.org/
   jl...@evilplan.org
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-06 Thread Darrick J. Wong

On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote:
 Hi,
 
 I'm interested in discussing how to pass protection information to and from
 userspace.  Maybe Martin could be enlisted for the discussion.
 
 I read that some work has already been done in this area but have not been 
 able
 to locate it.  It looks like the bio-integrity code already makes it possible
 to generate the t10-dif crc in the filesystem.  It would be good to be able to
 get the guard and application tags back out to backup applications such as
 xfsdump.  Enabling other applications to generate their own tags in userspace
 is also interesting.

This one's been on my list for a couple of years (and companies) too.  A few
years ago Joel Becker had support for it in his sys_dio proposal (that hasn't
gone anywhere), and more recently I've theorized that we could add a magic
fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT
*{read,write}v call as the PI buffer, which I think is similar to how DIX gets
PI data to a disk.  But it's not like I have any code to show for it.

I /think/ it's fairly straightforward to change the directio submit code to
find the userspace PI buffer and amend the block integrity code to attach our
own PI buffer.  You'd still have to let the block layer set the sector # field,
but afaik that won't affect the crc or the app tag.

I hear that the NFS guys want to propose some sort of protocol for transmitting
PI data (across NFS), but I haven't seen anything concrete yet.

Well, I hope I'll scrape together the time to hack together a PoC before LSF...
on the other hand, I ran the discussion about PI userland interfaces at LPC2011
and (shamefully) haven't done anything yet.

end rambling

--D
 
 Regards,
   Ben
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] sd: Don't incorrectly promote DIF type0 into DIF type1 disks.

2012-11-27 Thread Darrick J. Wong

If I run the following command:
# modprobe scsi_debug dev_size_mb=64 ato=1 dix=1 dif=0

then I see the following in the dmesg log:

[   25.859145] scsi_debug: host protection DIX0

Ok, DIX0, which means no integrity extensions at all, and no DIF support at 
all.
I'm not sure why you'd advertise DIX0 at all, but so far so good.

(DIX is the mechanism by which the OS sends integrity data to the disk in
whatever format DIF specifies.  You need DIF for DIX to do anything.)

[   25.860214] scsi0 : scsi_debug, version 1.82 [20100324], dev_size_mb=64, 
opts=0x0
[   25.863418] scsi 0:0:0:0: Direct-Access Linuxscsi_debug   0004 
PQ: 0 ANSI: 5
[   25.880079] sd 0:0:0:0: [sda] 131072 512-byte logical blocks: (67.1 MB/64.0 
MiB)
[   25.884133] sd 0:0:0:0: [sda] Write Protect is off
[   25.892205] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
supports DPO and FUA
[   25.920344]  sda: unknown partition table
[   25.926704] sd_dif_config_host: type=0 dif=0 dix=8
[   25.931651] sd 0:0:0:0: [sda] Enabling DIX T10-DIF-TYPE1-CRC protection

Huh??  Here we are turning on DIX support as if the disk supports DIF type1.
This seems strange to me because we didn't advertise any DIF support at all.

[   25.952208] sd 0:0:0:0: [sda] Attached SCSI disk
[   25.977977] BUG: unable to handle kernel paging request at 000ffc02
[   25.980262] IP: [a01ebaa5] resp_read.part.38+0x145/0x420 
[scsi_debug]

Uhoh, that shouldn't happen.  The SCSI layer sent along what looks like a type1
read request even though the disk wasn't really prepared to handle it, and
kaboom.  I don't think this particular combination is terribly common, but we
could at least not misprogram the disk when we see it.

If the disk advertises DIF type 0, we can skip the rest of the DIF setup.
Right now, the SCSI layer promotes a DIF type 0 disk into a DIF type 1 disk,
which seems incorrect.

Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
---
 drivers/scsi/sd_dif.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 04998f3..ede5b7b 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -313,6 +313,10 @@ void sd_dif_config_host(struct scsi_disk *sdkp)
u8 type = sdkp-protection_type;
int dif, dix;
 
+   /* Don't promote DIF type0 into type1 support. */
+   if (type == SD_DIF_TYPE0_PROTECTION)
+   return;
+
dif = scsi_host_dif_capable(sdp-host, type);
dix = scsi_host_dix_capable(sdp-host, type);
 
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] libsas: Provide a transport-level facility to request SAS addrs

2008-02-19 Thread Darrick J. Wong

Provide a facility to use the request_firmware() interface to get a SAS
address from userspace.  This can be used by SAS LLDDs that cannot
obtain the address from the host adapter.

Resend of 8 Oct. 2007 patch, now based off 2.6.25-rc2 + scsi_misc.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_scsi_host.c |   41 +++
 include/scsi/libsas.h   |2 ++
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index f869fba..583d249 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -24,6 +24,8 @@
  */
 
 #include linux/kthread.h
+#include linux/firmware.h
+#include linux/ctype.h
 
 #include sas_internal.h
 
@@ -1050,6 +1052,45 @@ void sas_target_destroy(struct scsi_target *starget)
return;
 }
 
+static void sas_parse_addr(u8 *sas_addr, const char *p)
+{
+   int i;
+   for (i = 0; i  SAS_ADDR_SIZE; i++) {
+   u8 h, l;
+   if (!*p)
+   break;
+   h = isdigit(*p) ? *p-'0' : toupper(*p)-'A'+10;
+   p++;
+   l = isdigit(*p) ? *p-'0' : toupper(*p)-'A'+10;
+   p++;
+   sas_addr[i] = (h4) | l;
+   }
+}
+
+#define SAS_STRING_ADDR_SIZE   16
+
+int sas_request_addr(struct Scsi_Host *shost, u8 *addr)
+{
+   int res;
+   const struct firmware *fw;
+
+   res = request_firmware(fw, sas_addr, shost-shost_gendev);
+   if (res)
+   return res;
+
+   if (fw-size  SAS_STRING_ADDR_SIZE) {
+   res = -ENODEV;
+   goto out;
+   }
+
+   sas_parse_addr(addr, fw-data);
+
+out:
+   release_firmware(fw);
+   return res;
+}
+EXPORT_SYMBOL_GPL(sas_request_addr);
+
 EXPORT_SYMBOL_GPL(sas_queuecommand);
 EXPORT_SYMBOL_GPL(sas_target_alloc);
 EXPORT_SYMBOL_GPL(sas_slave_configure);
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 3ffd6b5..5f183de 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -676,4 +676,6 @@ extern int sas_smp_handler(struct Scsi_Host *shost, struct 
sas_rphy *rphy,
 extern void sas_ssp_task_response(struct device *dev, struct sas_task *task,
  struct ssp_response_iu *iu);
 
+int sas_request_addr(struct Scsi_Host *shost, u8 *addr);
+
 #endif /* _SASLIB_H_ */
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load)

2008-02-19 Thread Darrick J. Wong

If we send an ABORT_TASK ascb that doesn't return within the timeout period,
we should not free that ascb because the sequencer is still holding onto it.
Hopefully it will fix what James Bottomley describes below:

On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote:

 Unfortunately, there's a bug in TMF timeout handling in the driver, it
 leaves the sequencer entry pending, but frees the ascb.  If the
 sequencer ever picks this up it will get very confused, as it does a
 while down in the trace:
 
  aic94xx: BUG:sequencer:dl:no ascb?!
  aic94xx: BUG:sequencer:dl:no ascb?!
 
 That's where the sequencer adds an ascb to the done list that we've
 already freed.  From this point on confusion reigns and the error
 handler eventually offlines the device.
 
 I'll see if I can come up with patches to fix this ... or at least
 mitigate the problems it causes.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/aic94xx/aic94xx_tmf.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c 
b/drivers/scsi/aic94xx/aic94xx_tmf.c
index b52124f..4b24bd3 100644
--- a/drivers/scsi/aic94xx/aic94xx_tmf.c
+++ b/drivers/scsi/aic94xx/aic94xx_tmf.c
@@ -463,7 +463,7 @@ int asd_abort_task(struct sas_task *task)
   AIC94XX_SCB_TIMEOUT);
spin_lock_irqsave(task-task_state_lock, flags);
if (leftover  1)
-   res = TMF_RESP_FUNC_FAILED;
+   goto out_not_reported;
if (task-task_state_flags  SAS_TASK_STATE_DONE)
res = TMF_RESP_FUNC_COMPLETE;
spin_unlock_irqrestore(task-task_state_lock, flags);
@@ -487,6 +487,11 @@ out:
asd_ascb_free(ascb);
ASD_DPRINTK(task 0x%p aborted, res: 0x%x\n, task, res);
return res;
+
+out_not_reported:
+   spin_unlock_irqrestore(task-task_state_lock, flags);
+   ASD_DPRINTK(task 0x%p aborted? but not reported.\n, task);
+   return res;
 }
 
 /**
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] aic94xx: Use sas_request_addr() to provide SAS WWN if the adapter lacks one

2008-02-19 Thread Darrick J. Wong

If the aic94xx chip doesn't have a SAS address in the chip's flash memory,
make libsas get one for us.

Resend of 8 Oct 2007 patch, now based off 2.6.25-rc2 + scsi_misc.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/aic94xx/aic94xx.h  |   16 
 drivers/scsi/aic94xx/aic94xx_hwi.c  |   20 +---
 drivers/scsi/aic94xx/aic94xx_init.c |2 --
 3 files changed, 9 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx.h b/drivers/scsi/aic94xx/aic94xx.h
index 32f513b..aee235f 100644
--- a/drivers/scsi/aic94xx/aic94xx.h
+++ b/drivers/scsi/aic94xx/aic94xx.h
@@ -58,7 +58,6 @@
 
 extern struct kmem_cache *asd_dma_token_cache;
 extern struct kmem_cache *asd_ascb_cache;
-extern char sas_addr_str[2*SAS_ADDR_SIZE + 1];
 
 static inline void asd_stringify_sas_addr(char *p, const u8 *sas_addr)
 {
@@ -68,21 +67,6 @@ static inline void asd_stringify_sas_addr(char *p, const u8 
*sas_addr)
*p = '\0';
 }
 
-static inline void asd_destringify_sas_addr(u8 *sas_addr, const char *p)
-{
-   int i;
-   for (i = 0; i  SAS_ADDR_SIZE; i++) {
-   u8 h, l;
-   if (!*p)
-   break;
-   h = isdigit(*p) ? *p-'0' : *p-'A'+10;
-   p++;
-   l = isdigit(*p) ? *p-'0' : *p-'A'+10;
-   p++;
-   sas_addr[i] = (h4) | l;
-   }
-}
-
 struct asd_ha_struct;
 struct asd_ascb;
 
diff --git a/drivers/scsi/aic94xx/aic94xx_hwi.c 
b/drivers/scsi/aic94xx/aic94xx_hwi.c
index 098b5f3..940a207 100644
--- a/drivers/scsi/aic94xx/aic94xx_hwi.c
+++ b/drivers/scsi/aic94xx/aic94xx_hwi.c
@@ -27,6 +27,7 @@
 #include linux/pci.h
 #include linux/delay.h
 #include linux/module.h
+#include linux/firmware.h
 
 #include aic94xx.h
 #include aic94xx_reg.h
@@ -38,16 +39,14 @@ u32 MBAR0_SWB_SIZE;
 
 /* -- Initialization -- */
 
-static void asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
+static int asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
 {
-   extern char sas_addr_str[];
-   /* If the user has specified a WWN it overrides other settings
-*/
-   if (sas_addr_str[0] != '\0')
-   asd_destringify_sas_addr(asd_ha-hw_prof.sas_addr,
-sas_addr_str);
-   else if (asd_ha-hw_prof.sas_addr[0] != 0)
-   asd_stringify_sas_addr(sas_addr_str, asd_ha-hw_prof.sas_addr);
+   /* adapter came with a sas address */
+   if (asd_ha-hw_prof.sas_addr[0])
+   return 0;
+
+   return sas_request_addr(asd_ha-sas_ha.core.shost,
+   asd_ha-hw_prof.sas_addr);
 }
 
 static void asd_propagate_sas_addr(struct asd_ha_struct *asd_ha)
@@ -657,8 +656,7 @@ int asd_init_hw(struct asd_ha_struct *asd_ha)
 
asd_init_ctxmem(asd_ha);
 
-   asd_get_user_sas_addr(asd_ha);
-   if (!asd_ha-hw_prof.sas_addr[0]) {
+   if (asd_get_user_sas_addr(asd_ha)) {
asd_printk(No SAS Address provided for %s\n,
   pci_name(asd_ha-pcidev));
err = -ENODEV;
diff --git a/drivers/scsi/aic94xx/aic94xx_init.c 
b/drivers/scsi/aic94xx/aic94xx_init.c
index 5d761eb..1824b0b 100644
--- a/drivers/scsi/aic94xx/aic94xx_init.c
+++ b/drivers/scsi/aic94xx/aic94xx_init.c
@@ -56,8 +56,6 @@ MODULE_PARM_DESC(collector, \n
\tThe aic94xx SAS LLDD supports both modes.\n
\tDefault: 0 (Direct Mode).\n);
 
-char sas_addr_str[2*SAS_ADDR_SIZE + 1] = ;
-
 static struct scsi_transport_template *aic94xx_transport_template;
 static int asd_scan_finished(struct Scsi_Host *, unsigned long);
 static void asd_scan_start(struct Scsi_Host *);
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 03/13] mptbase: reset ioc initiator during PCI resume

2008-02-07 Thread Darrick J. Wong

On Thu, Feb 07, 2008 at 06:41:25PM -0600, James Bottomley wrote:
 On Mon, 2008-02-04 at 23:53 -0800, [EMAIL PROTECTED] wrote:
  From: Darrick J. Wong [EMAIL PROTECTED]
  
  It appears that the LSI SAS 1064E chip needs to be reset after a
  suspend/resume cycle before the driver attempts further communications with
  the chip.  Without this patch, resuming the chip results in this error
  message being printed repeatedly and no more disk I/O.
  
  mptbase: ioc0: ERROR - Invalid IOC facts reply, msgLength=0 offsetof=6!
  
  So far it seems to fix suspend/resume on all the MPT Fusion cards I have
  (SAS and U320 SCSI) but since I don't know the internals of that chip I
  can't say for sure if this is a proper fix.
  
  Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
  Signed-off-by: Andrew Morton [EMAIL PROTECTED]
 
 Ping on this, please Eric.

As far as I can tell, Eric isn't really involved with this patch
anymore, and handed it over to [EMAIL PROTECTED]  I received email
from him (her?  Apologies, I'm not sufficiently familiar with Indian
names) this morning saying that a modified version of it would go out to
linux-scsi in a day or two.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: aic94xx: failing on high load (another data point)

2008-01-30 Thread Darrick J. Wong

On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
 
 V28.  My controller functions well with a single drive (low-medium load).  
 Unfortunately, all attempts to get the mirrors in sync fail and usually hang 
 the whole box.

Adaptec posted a V30 sequencer on their website; does that fix the
problems?

http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libsas: fix sense_buffer overrun

2008-01-14 Thread Darrick J. Wong

Looks sane to me;
Acked-by: Darrick J. Wong [EMAIL PROTECTED]

--D

On Sun, Jan 13, 2008 at 02:20:18AM +0900, FUJITA Tomonori wrote:
 
 Signed-off-by: FUJITA Tomonori [EMAIL PROTECTED]
 ---
  drivers/scsi/libsas/sas_scsi_host.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
 b/drivers/scsi/libsas/sas_scsi_host.c
 index b784089..828fed1 100644
 --- a/drivers/scsi/libsas/sas_scsi_host.c
 +++ b/drivers/scsi/libsas/sas_scsi_host.c
 @@ -108,7 +108,7 @@ static void sas_scsi_task_done(struct sas_task *task)
   break;
   case SAM_CHECK_COND:
   memcpy(sc-sense_buffer, ts-buf,
 -max(SCSI_SENSE_BUFFERSIZE, ts-buf_valid_size));
 +min(SCSI_SENSE_BUFFERSIZE, ts-buf_valid_size));
   stat = SAM_CHECK_COND;
   break;
   default:
 -- 
 1.5.3.4
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-scsi in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: aic94xx: failing on high load

2008-01-14 Thread Darrick J. Wong

On Mon, Jan 14, 2008 at 03:49:16PM +0100, Jan Sembera wrote:
 Hi,
 
   we have array of 16 SAS disks connected to Adaptec controllers
 ...
 this elsewhere and I was recommended to send it to linux-scsi.

Hmm... I think Peter Bogdanovic was hitting this error recently (cc'd).
There are a lot of PRIMITIVE_RECVD messages in the log, which make me
wonder if the expander is being flaky or something?  The commands that
start timing out under heavy load followed by the repeated broadcasts
might be indicative of that, since the sequencer firmware and the kernel
driver are up to date.  Unfortunately, I don't have any LSI expanders...

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libsas: don't use made up error codes

2007-12-31 Thread Darrick J. Wong

On Sun, Dec 30, 2007 at 12:37:31PM -0600, James Bottomley wrote:
 This is bad for two reasons:
 
  1. If they're returned to outside applications, no-one knows what
 they mean.
  2. Eventually they'll clash with the ever expanding standard error
 codes.
 
 The problem error code in question is ETASK.  I've replaced this by
 ECOMM (communications error on send) a network error code that seems to
 most closely relay what ETASK meant.

Yay, cleanups :)

Acked-by: Darrick J. Wong [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] drivers/scsi/: Spelling fixes

2007-12-17 Thread Darrick J. Wong

On Mon, Dec 17, 2007 at 11:40:14AM -0800, Joe Perches wrote:

  drivers/scsi/scsi_transport_sas.c |2 +-

SAS bits are
Acked-by: Darrick J. Wong [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libsas: Don't issue commands to devices that have been hot-removed.

2007-12-04 Thread Darrick J. Wong

Hrm... does this patch help?  You'll get a bunch of ATA/SAS disk errors
printed to the screen if you yank the disk, but at least libsas won't
get stuck waiting for the cache-flush commands to time out.
---
sd will get hung up issuing commands to flush write cache if a SAS device
is unplugged without warning.  Change libsas to reject commands to domain
devices that have already gone away.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_ata.c   |4 
 drivers/scsi/libsas/sas_expander.c  |3 +++
 drivers/scsi/libsas/sas_port.c  |2 ++
 drivers/scsi/libsas/sas_scsi_host.c |7 +++
 include/scsi/libsas.h   |1 +
 5 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 0829b55..f5e5213 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -161,6 +161,10 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd 
*qc)
unsigned int num = 0;
unsigned int xfer = 0;
 
+   /* If the device fell off, no sense in issuing commands */
+   if (dev-gone)
+   return AC_ERR_SYSTEM;
+
task = sas_alloc_task(GFP_ATOMIC);
if (!task)
return AC_ERR_SYSTEM;
diff --git a/drivers/scsi/libsas/sas_expander.c 
b/drivers/scsi/libsas/sas_expander.c
index 27674fe..4ba4d2a 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -1680,6 +1680,7 @@ static void sas_unregister_ex_tree(struct domain_device 
*dev)
struct domain_device *child, *n;
 
list_for_each_entry_safe(child, n, ex-children, siblings) {
+   child-gone = 1;
if (child-dev_type == EDGE_DEV ||
child-dev_type == FANOUT_DEV)
sas_unregister_ex_tree(child);
@@ -1699,6 +1700,7 @@ static void sas_unregister_devs_sas_addr(struct 
domain_device *parent,
list_for_each_entry_safe(child, n, ex_dev-children, siblings) {
if (SAS_ADDR(child-sas_addr) ==
SAS_ADDR(phy-attached_sas_addr)) {
+   child-gone = 1;
if (child-dev_type == EDGE_DEV ||
child-dev_type == FANOUT_DEV)
sas_unregister_ex_tree(child);
@@ -1707,6 +1709,7 @@ static void sas_unregister_devs_sas_addr(struct 
domain_device *parent,
break;
}
}
+   parent-gone = 1;
sas_disable_routing(parent, phy-attached_sas_addr);
memset(phy-attached_sas_addr, 0, SAS_ADDR_SIZE);
sas_port_delete_phy(phy-port, phy-phy);
diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c
index b6f0243..2e82097 100644
--- a/drivers/scsi/libsas/sas_port.c
+++ b/drivers/scsi/libsas/sas_port.c
@@ -144,6 +144,8 @@ void sas_deform_port(struct asd_sas_phy *phy)
port-port_dev-pathways--;
 
if (port-num_phys == 1) {
+   if (port-port_dev)
+   port-port_dev-gone = 1;
sas_unregister_domain_devices(port);
sas_port_delete(port-port);
port-port = NULL;
diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index c29ba47..61d2679 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -228,6 +228,13 @@ int sas_queuecommand(struct scsi_cmnd *cmd,
goto out;
}
 
+   /* If the device fell off, no sense in issuing commands */
+   if (dev-gone) {
+   cmd-result = DID_BAD_TARGET  16;
+   scsi_done(cmd);
+   goto out;
+   }
+
res = -ENOMEM;
task = sas_create_task(cmd, dev, GFP_ATOMIC);
if (!task)
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 8ad7465..73c5b15 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -207,6 +207,7 @@ struct domain_device {
 };
 
 void *lldd_dev;
+   int gone;
 };
 
 struct sas_discovery_event {
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libsas: Don't issue commands to devices that have been hot-removed.

2007-12-04 Thread Darrick J. Wong

On Tue, Dec 04, 2007 at 05:48:33PM -0500, Jeff Garzik wrote:

 As an aside, issues like this really really imply a need to move libsas 
 away from the old libata EH stuff (like brking did with ipr, in patches).

Hm... does the new libata EH handle the case of device was
unplugged, don't bother trying to send any more commands?

In general, I agree that sas-ata should adopt the new EH.
Unfortunately, I believe the old way of sas-ata configuring ATA ports is
somehow not compatible with the new EH stuff and causes a crash during
the device probe with my patch to move sas-ata to the new EH.  If I
apply the patch that migrates sas-ata to use brking's latest ata-sas
configuration mechanism (the one that creates real ata_hosts), I see
(a) lots and lots of ATA hosts getting created (one per ATA port;
possibly undesirable if you've a SAS topology with a lot of SATA disks)
and (b) NCQ disks don't seem to work if you unplug the disk and plug
it back in (unless NCQ is disabled entirely).  Jeff, by any chance have
you tried plugging SATA devices into your SAS controllers?

James Bottomley wondered if it would be easier to have sas-ata call only
into the parts of libata that convert SCSI commands to ATA taskfiles,
though I'm unsure how many wormy cans that would open.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives

2007-12-03 Thread Darrick J. Wong

On Mon, Dec 03, 2007 at 05:09:54PM +0100, Krzysztof B??aszkowski wrote:
 
 I noticed also another failure when i removed a drive. The event was not 
 notified by anything (ie the block device and corresponding sg were 
 registered) so i run dd on this truly virtual drive.
 
 dd reached D state (as well as scsi_wq) . i think it shouldn't happen no 
 matter it was AIC failure or LSI expander failure.

It's wireless! ;)

Seriously, though, it's a good idea to tell the kernel that you're
about to unplug a disk before actually doing it:

echo 1  /sys/block/sdX/device/delete

This way, the kernel can tell the disk to flush its caches long before
power actually gets removed.  Otherwise, the device removal code can
get hung up just like you observed, and whatever's in the write cache
may or may not actually get written to the media.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives

2007-12-03 Thread Darrick J. Wong

On Mon, Dec 03, 2007 at 02:43:09PM -0500, Jeff Garzik wrote:

 But what do you mean by device removal code can get hung up?  That sounds 
 like a bug we should fix.

At the moment, libsas' sas_rphy_remove function doesn't distinguish between
removing a device before or after the disk has been disconnected.
Hence, sd_shutdown tries to tell the disk to flush the write cache, even
in the case that the disk is already gone.  Maybe the solution is to
modify aic94xx to remove the device's DDB registration prior to sending
the device gone event to libsas so that all subsequent commands bounce
with no such device instead of going out to lunch.

(I'll look into this later, as I myself am going out to lunch right now.)

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] libsas: Use new ATA configuration mechanism

2007-11-12 Thread Darrick J. Wong

Update sas_ata to use the new ata_sas_rphy mechanisms as provided by
Brian King, and simplify ATA device discovery...

WARNING WARNING WARNING!  This patch is experimental, use at your own
risk.

Comments-requested-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_ata.c   |  206 +--
 drivers/scsi/libsas/sas_discover.c  |4 +
 drivers/scsi/libsas/sas_scsi_host.c |   37 +-
 include/scsi/libsas.h   |4 -
 4 files changed, 91 insertions(+), 160 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index a9925d5..c6b4213 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -35,6 +35,13 @@
 #include ../scsi_transport_api.h
 #include scsi/scsi_eh.h
 
+struct sas_ata_descr {
+   struct ata_sas_rphy rphy;
+   struct scsi_host_template sht;
+};
+
+#define ata_rphy_to_descr(x) container_of((x), struct sas_ata_descr, rphy)
+
 static int sas_issue_ata_srst(struct domain_device *dev);
 
 static enum ata_completion_errors sas_to_ata_err(struct task_status_struct *ts)
@@ -323,55 +330,6 @@ static void sas_ata_tf_read(struct ata_port *ap, struct 
ata_taskfile *tf)
memcpy(tf, dev-sata_dev.tf, sizeof (*tf));
 }
 
-static int sas_ata_scr_write(struct ata_port *ap, unsigned int sc_reg_in,
- u32 val)
-{
-   struct domain_device *dev = ap-private_data;
-
-   SAS_DPRINTK(STUB %s\n, __FUNCTION__);
-   switch (sc_reg_in) {
-   case SCR_STATUS:
-   dev-sata_dev.sstatus = val;
-   break;
-   case SCR_CONTROL:
-   dev-sata_dev.scontrol = val;
-   break;
-   case SCR_ERROR:
-   dev-sata_dev.serror = val;
-   break;
-   case SCR_ACTIVE:
-   dev-sata_dev.ap-link.sactive = val;
-   break;
-   default:
-   return -EINVAL;
-   }
-   return 0;
-}
-
-static int sas_ata_scr_read(struct ata_port *ap, unsigned int sc_reg_in,
-   u32 *val)
-{
-   struct domain_device *dev = ap-private_data;
-
-   SAS_DPRINTK(STUB %s\n, __FUNCTION__);
-   switch (sc_reg_in) {
-   case SCR_STATUS:
-   *val = dev-sata_dev.sstatus;
-   return 0;
-   case SCR_CONTROL:
-   *val = dev-sata_dev.scontrol;
-   return 0;
-   case SCR_ERROR:
-   *val = dev-sata_dev.serror;
-   return 0;
-   case SCR_ACTIVE:
-   *val = dev-sata_dev.ap-link.sactive;
-   return 0;
-   default:
-   return -EINVAL;
-   }
-}
-
 static struct ata_port_operations sas_sata_ops = {
.check_status   = sas_ata_check_status,
.check_altstatus= sas_ata_check_status,
@@ -385,8 +343,6 @@ static struct ata_port_operations sas_sata_ops = {
.qc_issue   = sas_ata_qc_issue,
.port_start = ata_sas_port_start,
.port_stop  = ata_sas_port_stop,
-   .scr_read   = sas_ata_scr_read,
-   .scr_write  = sas_ata_scr_write
 };
 
 static struct ata_port_info sata_port_info = {
@@ -398,33 +354,6 @@ static struct ata_port_info sata_port_info = {
.port_ops = sas_sata_ops
 };
 
-int sas_ata_init_host_and_port(struct domain_device *found_dev,
-  struct scsi_target *starget)
-{
-   struct Scsi_Host *shost = dev_to_shost(starget-dev);
-   struct sas_ha_struct *ha = SHOST_TO_SAS_HA(shost);
-   struct ata_port *ap;
-
-   ata_host_init(found_dev-sata_dev.ata_host,
- ha-dev,
- sata_port_info.flags,
- sas_sata_ops);
-   ap = ata_sas_port_alloc(found_dev-sata_dev.ata_host,
-   sata_port_info,
-   shost);
-   if (!ap) {
-   SAS_DPRINTK(ata_sas_port_alloc failed.\n);
-   return -ENODEV;
-   }
-
-   ap-private_data = found_dev;
-   ap-cbl = ATA_CBL_SATA;
-   ap-scsi_host = shost;
-   found_dev-sata_dev.ap = ap;
-
-   return 0;
-}
-
 void sas_ata_task_abort(struct sas_task *task)
 {
struct ata_queued_cmd *qc = task-uldd_task;
@@ -601,50 +530,6 @@ out:
 }
 
 /* -- SATA -- */
-
-static void sas_get_ata_command_set(struct domain_device *dev)
-{
-   struct dev_to_host_fis *fis =
-   (struct dev_to_host_fis *) dev-frame_rcvd;
-
-   if ((fis-sector_count == 1  /* ATA */
-fis-lbal == 1 
-fis-lbam == 0 
-fis-lbah == 0 
-fis-device   == 0)
-   ||
-   (fis-sector_count

[PATCH 1/2] libsas: Convert ATA bridge to use new EH

2007-11-12 Thread Darrick J. Wong

Migrate the sas_ata bridge to use the new libata EH strategy, and
finally implement correct software reset.

WARNING WARNING WARNING!  This patch is for experimental use only; it is
nowhere near complete!  Especially the sas_ata_freeze() function.  This
patch may eat your data and kill your trees.

jgarzik: If an ATA command was in-progress at the time of a port freeze,
can complete after thawing?  (Does that even make sense?)

Comments-requested-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_ata.c |   86 ++---
 1 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 0829b55..a9925d5 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -35,6 +35,8 @@
 #include ../scsi_transport_api.h
 #include scsi/scsi_eh.h
 
+static int sas_issue_ata_srst(struct domain_device *dev);
+
 static enum ata_completion_errors sas_to_ata_err(struct task_status_struct *ts)
 {
/* Cheesy attempt to translate SAS errors into ATA.  Hah! */
@@ -233,37 +235,58 @@ static u8 sas_ata_check_status(struct ata_port *ap)
return dev-sata_dev.tf.command;
 }
 
-static void sas_ata_phy_reset(struct ata_port *ap)
+static void sas_ata_freeze(struct ata_port *ap)
 {
-   struct domain_device *dev = ap-private_data;
-   struct sas_internal *i =
-   to_sas_internal(dev-port-ha-core.shost-transportt);
-   int res = 0;
+   /* reroute qc_done for all qc's on this port to a dumb free func */
+   /* i wonder if we can get away with throwing out anything that
+* completes in this time frame, or if we must find the commands
+* that are in progress and cancel only those? */
+   printk(KERN_ERR %s: STUB\n, __FUNCTION__);
+}
 
-   if (i-dft-lldd_I_T_nexus_reset)
-   res = i-dft-lldd_I_T_nexus_reset(dev);
+static void sas_ata_thaw(struct ata_port *ap)
+{
+   /* empty */
+   printk(KERN_ERR %s: STUB\n, __FUNCTION__);
+}
 
-   if (res)
-   SAS_DPRINTK(%s: Unable to reset I T nexus?\n, __FUNCTION__);
+static int sas_ata_soft_reset(struct ata_link *link, unsigned int *classes,
+  unsigned long deadline)
+{
+   struct ata_port *ap = link-ap;
+   struct domain_device *dev = ap-private_data;
+   int res;
 
+   /* Send SRST to device */
+   res = sas_issue_ata_srst(dev);
+   printk(KERN_ERR srst 0 returns %d\n, res);
+
+   /* Set new device type */
switch (dev-sata_dev.command_set) {
case ATA_COMMAND_SET:
SAS_DPRINTK(%s: Found ATA device.\n, __FUNCTION__);
-   ap-link.device[0].class = ATA_DEV_ATA;
+   *classes = ATA_DEV_ATA;
break;
case ATAPI_COMMAND_SET:
SAS_DPRINTK(%s: Found ATAPI device.\n, __FUNCTION__);
-   ap-link.device[0].class = ATA_DEV_ATAPI;
+   *classes = ATA_DEV_ATAPI;
break;
default:
SAS_DPRINTK(%s: Unknown SATA command set: %d.\n,
__FUNCTION__,
dev-sata_dev.command_set);
-   ap-link.device[0].class = ATA_DEV_UNKNOWN;
-   break;
+   *classes = ATA_DEV_UNKNOWN;
+   break;
}
 
-   ap-cbl = ATA_CBL_SATA;
+   /* FIXME: What if SRST fails? */
+   return 0;
+}
+
+static void sas_ata_error_handler(struct ata_port *ap)
+{
+   ata_do_eh(ap, NULL, sas_ata_soft_reset, NULL, NULL);
+   //uh... hopefully there's no commands left in here?
 }
 
 static void sas_ata_post_internal(struct ata_queued_cmd *qc)
@@ -353,7 +376,9 @@ static struct ata_port_operations sas_sata_ops = {
.check_status   = sas_ata_check_status,
.check_altstatus= sas_ata_check_status,
.dev_select = ata_noop_dev_select,
-   .phy_reset  = sas_ata_phy_reset,
+   .error_handler  = sas_ata_error_handler,
+   .freeze = sas_ata_freeze,
+   .thaw   = sas_ata_thaw,
.post_internal_cmd  = sas_ata_post_internal,
.tf_read= sas_ata_tf_read,
.qc_prep= ata_noop_qc_prep,
@@ -658,6 +683,37 @@ out:
return res;
 }
 
+static int sas_issue_ata_srst(struct domain_device *dev)
+{
+   int res = 0;
+   struct sas_task *task;
+   struct dev_to_host_fis *d2h_fis = (struct dev_to_host_fis *)
+   dev-frame_rcvd[0];
+
+   res = -ENOMEM;
+   task = sas_alloc_task(GFP_KERNEL);
+   if (!task)
+   goto out;
+
+   task-dev = dev;
+
+   task-ata_task.fis.fis_type = 0x27;
+   /* FIXME: What's a good dummy command? */
+   task-ata_task.fis.command

Re: [2.6 patch] scsi/aic94xx/: cleanups

2007-11-05 Thread Darrick J. Wong

On Mon, Nov 05, 2007 at 06:07:29PM +0100, Adrian Bunk wrote:
 This patch contains the following cleanups:
 - static functions in .c files shouldn't be marked inline
 - make needlessly global code static
 - #if 0 unused code

asd_unpause_lseq can be removed; the other if 0'd functions are debug
functions and can probably stay.

Otherwise, ack.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] libsas: Convert sas_proto users to sas_protocol

2007-11-05 Thread Darrick J. Wong

sparse complains about the mixing of enums in libsas.  Since the
underlying numeric values of both enums are the same, combine them
to get rid of the warning.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/aic94xx/aic94xx_dev.c  |6 +++---
 drivers/scsi/aic94xx/aic94xx_dump.c |4 ++--
 drivers/scsi/aic94xx/aic94xx_hwi.c  |2 +-
 drivers/scsi/aic94xx/aic94xx_scb.c  |6 +++---
 drivers/scsi/aic94xx/aic94xx_task.c |   30 +++---
 drivers/scsi/aic94xx/aic94xx_tmf.c  |   12 ++--
 drivers/scsi/libsas/sas_discover.c  |2 +-
 drivers/scsi/libsas/sas_expander.c  |6 +++---
 drivers/scsi/libsas/sas_internal.h  |2 +-
 include/scsi/libsas.h   |   18 +-
 include/scsi/sas.h  |   13 ++---
 include/scsi/scsi_transport_sas.h   |8 +---
 12 files changed, 51 insertions(+), 58 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx_dev.c 
b/drivers/scsi/aic94xx/aic94xx_dev.c
index 3dce618..72042ca 100644
--- a/drivers/scsi/aic94xx/aic94xx_dev.c
+++ b/drivers/scsi/aic94xx/aic94xx_dev.c
@@ -165,7 +165,7 @@ static int asd_init_target_ddb(struct domain_device *dev)
if (dev-port-oob_mode != SATA_OOB_MODE) {
flags |= OPEN_REQUIRED;
if ((dev-dev_type == SATA_DEV) ||
-   (dev-tproto  SAS_PROTO_STP)) {
+   (dev-tproto  SAS_PROTOCOL_STP)) {
struct smp_resp *rps_resp = dev-sata_dev.rps_resp;
if (rps_resp-frame_type == SMP_RESPONSE 
rps_resp-function == SMP_REPORT_PHY_SATA 
@@ -193,7 +193,7 @@ static int asd_init_target_ddb(struct domain_device *dev)
asd_ddbsite_write_byte(asd_ha, ddb, DDB_TARG_FLAGS, flags);
 
flags = 0;
-   if (dev-tproto  SAS_PROTO_STP)
+   if (dev-tproto  SAS_PROTOCOL_STP)
flags |= STP_CL_POL_NO_TX;
asd_ddbsite_write_byte(asd_ha, ddb, DDB_TARG_FLAGS2, flags);
 
@@ -201,7 +201,7 @@ static int asd_init_target_ddb(struct domain_device *dev)
asd_ddbsite_write_word(asd_ha, ddb, SEND_QUEUE_TAIL, 0x);
asd_ddbsite_write_word(asd_ha, ddb, SISTER_DDB, 0x);
 
-   if (dev-dev_type == SATA_DEV || (dev-tproto  SAS_PROTO_STP)) {
+   if (dev-dev_type == SATA_DEV || (dev-tproto  SAS_PROTOCOL_STP)) {
i = asd_init_sata(dev);
if (i  0) {
asd_free_ddb(asd_ha, ddb);
diff --git a/drivers/scsi/aic94xx/aic94xx_dump.c 
b/drivers/scsi/aic94xx/aic94xx_dump.c
index 6bd8e30..3d8c4ff 100644
--- a/drivers/scsi/aic94xx/aic94xx_dump.c
+++ b/drivers/scsi/aic94xx/aic94xx_dump.c
@@ -903,11 +903,11 @@ void asd_dump_frame_rcvd(struct asd_phy *phy,
int i;
 
switch ((dl-status_block[1]  0x70)  3) {
-   case SAS_PROTO_STP:
+   case SAS_PROTOCOL_STP:
ASD_DPRINTK(STP proto device-to-host FIS:\n);
break;
default:
-   case SAS_PROTO_SSP:
+   case SAS_PROTOCOL_SSP:
ASD_DPRINTK(SAS proto IDENTIFY:\n);
break;
}
diff --git a/drivers/scsi/aic94xx/aic94xx_hwi.c 
b/drivers/scsi/aic94xx/aic94xx_hwi.c
index fb2be39..940a207 100644
--- a/drivers/scsi/aic94xx/aic94xx_hwi.c
+++ b/drivers/scsi/aic94xx/aic94xx_hwi.c
@@ -90,7 +90,7 @@ static int asd_init_phy(struct asd_phy *phy)
 
sas_phy-enabled = 1;
sas_phy-class = SAS;
-   sas_phy-iproto = SAS_PROTO_ALL;
+   sas_phy-iproto = SAS_PROTOCOL_ALL;
sas_phy-tproto = 0;
sas_phy-type = PHY_TYPE_PHYSICAL;
sas_phy-role = PHY_ROLE_INITIATOR;
diff --git a/drivers/scsi/aic94xx/aic94xx_scb.c 
b/drivers/scsi/aic94xx/aic94xx_scb.c
index db6ab1a..0febad4 100644
--- a/drivers/scsi/aic94xx/aic94xx_scb.c
+++ b/drivers/scsi/aic94xx/aic94xx_scb.c
@@ -788,12 +788,12 @@ void asd_build_control_phy(struct asd_ascb *ascb, int 
phy_id, u8 subfunc)
 
/* initiator port settings are in the hi nibble */
if (phy-sas_phy.role == PHY_ROLE_INITIATOR)
-   control_phy-port_type = SAS_PROTO_ALL  4;
+   control_phy-port_type = SAS_PROTOCOL_ALL  4;
else if (phy-sas_phy.role == PHY_ROLE_TARGET)
-   control_phy-port_type = SAS_PROTO_ALL;
+   control_phy-port_type = SAS_PROTOCOL_ALL;
else
control_phy-port_type =
-   (SAS_PROTO_ALL  4) | SAS_PROTO_ALL;
+   (SAS_PROTOCOL_ALL  4) | SAS_PROTOCOL_ALL;
 
/* link reset retries, this should be nominal */
control_phy-link_reset_retries = 10;
diff --git a/drivers/scsi/aic94xx/aic94xx_task.c 
b/drivers/scsi/aic94xx/aic94xx_task.c
index e0e58be..68ae5f1 100644
--- a/drivers/scsi/aic94xx/aic94xx_task.c
+++ b/drivers/scsi/aic94xx/aic94xx_task.c
@@ -187,7 +187,7 @@ static void asd_get_response_tasklet(struct

[PATCH 2/2] libsas: Fix various sparse complaints

2007-11-05 Thread Darrick J. Wong

Annotate sas_queuecommand with locking details, and clean up a few
more sparse warnings about static/non-static declarations.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_scsi_host.c |6 +-
 include/scsi/libsas.h   |4 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index 0fa0296..c29ba47 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -202,6 +202,10 @@ int sas_queue_up(struct sas_task *task)
  */
 int sas_queuecommand(struct scsi_cmnd *cmd,
 void (*scsi_done)(struct scsi_cmnd *))
+   __releases(host-host_lock)
+   __acquires(dev-sata_dev.ap-lock)
+   __releases(dev-sata_dev.ap-lock)
+   __acquires(host-host_lock)
 {
int res = 0;
struct domain_device *dev = cmd_to_domain_dev(cmd);
@@ -412,7 +416,7 @@ static int sas_recover_I_T(struct domain_device *dev)
 }
 
 /* Find the sas_phy that's attached to this device */
-struct sas_phy *find_local_sas_phy(struct domain_device *dev)
+static struct sas_phy *find_local_sas_phy(struct domain_device *dev)
 {
struct domain_device *pdev = dev-parent;
struct ex_phy *exphy = NULL;
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index fe24bbc..cd11fe2 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -563,7 +563,7 @@ struct sas_task {
struct work_struct abort_work;
 };
 
-
+extern struct kmem_cache *sas_task_cache;
 
 #define SAS_TASK_STATE_PENDING  1
 #define SAS_TASK_STATE_DONE 2
@@ -573,7 +573,6 @@ struct sas_task {
 
 static inline struct sas_task *sas_alloc_task(gfp_t flags)
 {
-   extern struct kmem_cache *sas_task_cache;
struct sas_task *task = kmem_cache_zalloc(sas_task_cache, flags);
 
if (task) {
@@ -590,7 +589,6 @@ static inline struct sas_task *sas_alloc_task(gfp_t flags)
 static inline void sas_free_task(struct sas_task *task)
 {
if (task) {
-   extern struct kmem_cache *sas_task_cache;
BUG_ON(!list_empty(task-list));
kmem_cache_free(sas_task_cache, task);
}
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] aic94xx: Use request_firmware() to provide SAS address if the adapter lacks one

2007-10-09 Thread Darrick J. Wong

On Tue, Oct 09, 2007 at 09:41:47AM -0700, Andrew Vasquez wrote:
 On Tue, 09 Oct 2007, James Smart wrote:
 
   Why do you prefer request_firmware() vs something over sysfs ?
  
   Does environments like the kdump kernel also have access to data needed
   by request_firmware() ?

Assuming the driver-loading parts of the kdump kernel's initrd are the
same (udev, bunch of modules, firmwares, etc) as the regular kernel's
initrd, this shouldn't be a problem.

In the specific case of aic94xx, one needs request_firmware() and
associated infrastructure to load firmware blobs into the controller in
order to issue any I/O at all.

 There's already much in the way of automation and infrastructure
 present in supporting the request_firwmare() interfaces (perhaps not
 the best of names) which can provide for a level of flexibility beyond
 a basic 'soft_port_name' interface.
 
 Though I don't see why both can't coexist cleanly -- I take it the use
 case you are considering is: software recognizes no valid WWPN
 available, query via request_firmware() fails, software halts
 initialization (rather than fail), and awaits the admin to poke
 '0x123456..  /sys/.../fc_host/soft_port_name', causing a ping to the
 driver and continuation of initialization with requested portname?

Hmm... could we use such a sysfs attribute to reassign adapter WWNs at
arbitrary times?  Is that even a good idea?

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] aic94xx: Use request_firmware() to provide SAS address if the adapter lacks one

2007-10-08 Thread Darrick J. Wong

If the aic94xx chip doesn't have a SAS address in the chip's flash memory,
use the request_firmware() interface to get one from userspace.  This
way, there's no debate as to who or how an address gets generated--it's
totally up to the administrator to provide it if the card doesn't have one.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/aic94xx/aic94xx.h  |1 -
 drivers/scsi/aic94xx/aic94xx_hwi.c  |   40 +--
 drivers/scsi/aic94xx/aic94xx_init.c |2 --
 3 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx.h b/drivers/scsi/aic94xx/aic94xx.h
index 32f513b..935d558 100644
--- a/drivers/scsi/aic94xx/aic94xx.h
+++ b/drivers/scsi/aic94xx/aic94xx.h
@@ -58,7 +58,6 @@
 
 extern struct kmem_cache *asd_dma_token_cache;
 extern struct kmem_cache *asd_ascb_cache;
-extern char sas_addr_str[2*SAS_ADDR_SIZE + 1];
 
 static inline void asd_stringify_sas_addr(char *p, const u8 *sas_addr)
 {
diff --git a/drivers/scsi/aic94xx/aic94xx_hwi.c 
b/drivers/scsi/aic94xx/aic94xx_hwi.c
index 0cd7eed..82a12cc 100644
--- a/drivers/scsi/aic94xx/aic94xx_hwi.c
+++ b/drivers/scsi/aic94xx/aic94xx_hwi.c
@@ -27,6 +27,7 @@
 #include linux/pci.h
 #include linux/delay.h
 #include linux/module.h
+#include linux/firmware.h
 
 #include aic94xx.h
 #include aic94xx_reg.h
@@ -38,16 +39,34 @@ u32 MBAR0_SWB_SIZE;
 
 /* -- Initialization -- */
 
-static void asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
+#define SAS_STRING_ADDR_SIZE   16
+static int asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
 {
-   extern char sas_addr_str[];
-   /* If the user has specified a WWN it overrides other settings
-*/
-   if (sas_addr_str[0] != '\0')
-   asd_destringify_sas_addr(asd_ha-hw_prof.sas_addr,
-sas_addr_str);
-   else if (asd_ha-hw_prof.sas_addr[0] != 0)
-   asd_stringify_sas_addr(sas_addr_str, asd_ha-hw_prof.sas_addr);
+   const struct firmware *fw;
+   int res;
+
+   /* adapter came with a sas address */
+   if (asd_ha-hw_prof.sas_addr[0])
+   return 0;
+
+   ASD_DPRINTK(No address found for %s; asking for one...\n,
+   pci_name(asd_ha-pcidev));
+
+   /* else go ask userspace */
+   res = request_firmware(fw, sas_addr, asd_ha-pcidev-dev);
+   if (res)
+   return res;
+
+   if (fw-size  SAS_STRING_ADDR_SIZE) {
+   res = -ENODEV;
+   goto out;
+   }
+
+   asd_destringify_sas_addr(asd_ha-hw_prof.sas_addr, fw-data);
+
+out:
+   release_firmware(fw);
+   return res;
 }
 
 static void asd_propagate_sas_addr(struct asd_ha_struct *asd_ha)
@@ -657,8 +676,7 @@ int asd_init_hw(struct asd_ha_struct *asd_ha)
 
asd_init_ctxmem(asd_ha);
 
-   asd_get_user_sas_addr(asd_ha);
-   if (!asd_ha-hw_prof.sas_addr[0]) {
+   if (asd_get_user_sas_addr(asd_ha)) {
asd_printk(No SAS Address provided for %s\n,
   pci_name(asd_ha-pcidev));
err = -ENODEV;
diff --git a/drivers/scsi/aic94xx/aic94xx_init.c 
b/drivers/scsi/aic94xx/aic94xx_init.c
index b70d6e7..5c99f27 100644
--- a/drivers/scsi/aic94xx/aic94xx_init.c
+++ b/drivers/scsi/aic94xx/aic94xx_init.c
@@ -54,8 +54,6 @@ MODULE_PARM_DESC(collector, \n
\tThe aic94xx SAS LLDD supports both modes.\n
\tDefault: 0 (Direct Mode).\n);
 
-char sas_addr_str[2*SAS_ADDR_SIZE + 1] = ;
-
 static struct scsi_transport_template *aic94xx_transport_template;
 static int asd_scan_finished(struct Scsi_Host *, unsigned long);
 static void asd_scan_start(struct Scsi_Host *);
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] aic94xx: Use request_firmware() to provide SAS address if the adapter lacks one

2007-10-08 Thread Darrick J. Wong

On Mon, Oct 08, 2007 at 03:48:32PM -0700, Andrew Vasquez wrote:

 So how about factoring that out to a transport-level interface.  How
 about something along the lines of the following patch, whereby the
 software driver upon detecting no valid WWPN, makes an upcall to each
 interface's 'request_wwn()'.  The data passed in from shost_gendev
 should be enough for some helper script to cull relevent device bits
 and perhaps offer some level of persistence...  Off base?

Hrm... jejb made a remark that it might be better to pass the
scsi_host's device into request_firmware() as your example does, so I'll
pitch in a patch to do likewise with libsas--the scsi_host knows the
actual device it's coming from, and userland can sort that all out later
anyway via DEVPATH.

I suppose one could also have multiple scsi_hosts per PCI device, which
means that my first patch would stumble horribly in more than a few
cases.

 Darrick, forgive the FC example, I don't do SAS...

That's ok, I don't do FC. :)  Looks mostly good to me...

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] libsas: Provide a transport-level facility to request SAS addrs

2007-10-08 Thread Darrick J. Wong

Use the request_firmware() interface to get a SAS address from userspace.
This way, there's no debate as to who or how an address gets generated;
it's up to the administrator to provide one if the driver can't find one
on its own.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_scsi_host.c |   41 +++
 include/scsi/libsas.h   |3 +++
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index 7663841..0fa0296 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -24,6 +24,8 @@
  */
 
 #include linux/kthread.h
+#include linux/firmware.h
+#include linux/ctype.h
 
 #include sas_internal.h
 
@@ -1047,6 +1049,45 @@ void sas_target_destroy(struct scsi_target *starget)
return;
 }
 
+static void sas_parse_addr(u8 *sas_addr, const char *p)
+{
+   int i;
+   for (i = 0; i  SAS_ADDR_SIZE; i++) {
+   u8 h, l;
+   if (!*p)
+   break;
+   h = isdigit(*p) ? *p-'0' : toupper(*p)-'A'+10;
+   p++;
+   l = isdigit(*p) ? *p-'0' : toupper(*p)-'A'+10;
+   p++;
+   sas_addr[i] = (h4) | l;
+   }
+}
+
+#define SAS_STRING_ADDR_SIZE   16
+
+int sas_request_addr(struct Scsi_Host *shost, u8 *addr)
+{
+   int res;
+   const struct firmware *fw;
+
+   res = request_firmware(fw, sas_addr, shost-shost_gendev);
+   if (res)
+   return res;
+
+   if (fw-size  SAS_STRING_ADDR_SIZE) {
+   res = -ENODEV;
+   goto out;
+   }
+
+   sas_parse_addr(addr, fw-data);
+
+out:
+   release_firmware(fw);
+   return res;
+}
+EXPORT_SYMBOL_GPL(sas_request_addr);
+
 EXPORT_SYMBOL_GPL(sas_queuecommand);
 EXPORT_SYMBOL_GPL(sas_target_alloc);
 EXPORT_SYMBOL_GPL(sas_slave_configure);
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 8dda2d6..58aa2aa 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -676,4 +676,7 @@ extern int sas_ioctl(struct scsi_device *sdev, int cmd, 
void __user *arg);
 
 extern int sas_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
   struct request *req);
+
+int sas_request_addr(struct Scsi_Host *shost, u8 *addr);
+
 #endif /* _SASLIB_H_ */
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] aic94xx: Use sas_request_addr() to provide SAS addr if the adapter lacks one

2007-10-08 Thread Darrick J. Wong

If the aic94xx chip doesn't have a SAS address in the chip's flash memory,
make libsas get one for us.  Also clean out some old code that had been
used to do this in the past.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/aic94xx/aic94xx.h  |   16 
 drivers/scsi/aic94xx/aic94xx_hwi.c  |   21 ++---
 drivers/scsi/aic94xx/aic94xx_init.c |2 --
 3 files changed, 10 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx.h b/drivers/scsi/aic94xx/aic94xx.h
index 32f513b..aee235f 100644
--- a/drivers/scsi/aic94xx/aic94xx.h
+++ b/drivers/scsi/aic94xx/aic94xx.h
@@ -58,7 +58,6 @@
 
 extern struct kmem_cache *asd_dma_token_cache;
 extern struct kmem_cache *asd_ascb_cache;
-extern char sas_addr_str[2*SAS_ADDR_SIZE + 1];
 
 static inline void asd_stringify_sas_addr(char *p, const u8 *sas_addr)
 {
@@ -68,21 +67,6 @@ static inline void asd_stringify_sas_addr(char *p, const u8 
*sas_addr)
*p = '\0';
 }
 
-static inline void asd_destringify_sas_addr(u8 *sas_addr, const char *p)
-{
-   int i;
-   for (i = 0; i  SAS_ADDR_SIZE; i++) {
-   u8 h, l;
-   if (!*p)
-   break;
-   h = isdigit(*p) ? *p-'0' : *p-'A'+10;
-   p++;
-   l = isdigit(*p) ? *p-'0' : *p-'A'+10;
-   p++;
-   sas_addr[i] = (h4) | l;
-   }
-}
-
 struct asd_ha_struct;
 struct asd_ascb;
 
diff --git a/drivers/scsi/aic94xx/aic94xx_hwi.c 
b/drivers/scsi/aic94xx/aic94xx_hwi.c
index 0cd7eed..1dc5400 100644
--- a/drivers/scsi/aic94xx/aic94xx_hwi.c
+++ b/drivers/scsi/aic94xx/aic94xx_hwi.c
@@ -27,6 +27,7 @@
 #include linux/pci.h
 #include linux/delay.h
 #include linux/module.h
+#include linux/firmware.h
 
 #include aic94xx.h
 #include aic94xx_reg.h
@@ -38,16 +39,14 @@ u32 MBAR0_SWB_SIZE;
 
 /* -- Initialization -- */
 
-static void asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
+static int asd_get_user_sas_addr(struct asd_ha_struct *asd_ha)
 {
-   extern char sas_addr_str[];
-   /* If the user has specified a WWN it overrides other settings
-*/
-   if (sas_addr_str[0] != '\0')
-   asd_destringify_sas_addr(asd_ha-hw_prof.sas_addr,
-sas_addr_str);
-   else if (asd_ha-hw_prof.sas_addr[0] != 0)
-   asd_stringify_sas_addr(sas_addr_str, asd_ha-hw_prof.sas_addr);
+   /* adapter came with a sas address */
+   if (asd_ha-hw_prof.sas_addr[0])
+   return 0;
+
+   return sas_request_addr(asd_ha-sas_ha.core.shost,
+   asd_ha-hw_prof.sas_addr);
 }
 
 static void asd_propagate_sas_addr(struct asd_ha_struct *asd_ha)
@@ -657,8 +657,7 @@ int asd_init_hw(struct asd_ha_struct *asd_ha)
 
asd_init_ctxmem(asd_ha);
 
-   asd_get_user_sas_addr(asd_ha);
-   if (!asd_ha-hw_prof.sas_addr[0]) {
+   if (asd_get_user_sas_addr(asd_ha)) {
asd_printk(No SAS Address provided for %s\n,
   pci_name(asd_ha-pcidev));
err = -ENODEV;
diff --git a/drivers/scsi/aic94xx/aic94xx_init.c 
b/drivers/scsi/aic94xx/aic94xx_init.c
index b70d6e7..5c99f27 100644
--- a/drivers/scsi/aic94xx/aic94xx_init.c
+++ b/drivers/scsi/aic94xx/aic94xx_init.c
@@ -54,8 +54,6 @@ MODULE_PARM_DESC(collector, \n
\tThe aic94xx SAS LLDD supports both modes.\n
\tDefault: 0 (Direct Mode).\n);
 
-char sas_addr_str[2*SAS_ADDR_SIZE + 1] = ;
-
 static struct scsi_transport_template *aic94xx_transport_template;
 static int asd_scan_finished(struct Scsi_Host *, unsigned long);
 static void asd_scan_start(struct Scsi_Host *);
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 16/17] mptbase: reset ioc initiator during PCI resume

2007-10-02 Thread Darrick J. Wong

On Tue, Oct 02, 2007 at 04:51:48PM -0600, Moore, Eric wrote:

 I replied to this thread a couple times last week, and no response from
 Darrick.   I doubt this is required becase the MESSAGE_UNIT_RESET is
 issued from inside mpt_do_ioc_recovery.  I need some logs with debug
 enabled.   Darrick did you see my email?

Yep.  Replied to it, too.  Apparently it never got to you, so I've
attached it below.

--D

-

On Thu, Sep 20, 2007 at 07:06:35PM -0600, Moore, Eric wrote:
 Darrick - MESSAGE_UNIT_RESET is already issued from inside
 mpt_do_ioc_recovery(), so you don't need to send this in advance of
 that.YOu will find that occuring from the function MakeIocReady.
 Anyways... would it be possible for you to enable debug logging so I can
 see what problem your having?   I suggest MPT_DEBUG and MPT_DEBUG_INIT.
 If its possible for you to manually load mptbase, that way you can set
 the command line option. 

I took a look at MakeIocReady(), and this section caught my eye:

/* Is it already READY? */
if (!statefault  (ioc_state  MPI_IOC_STATE_MASK) == MPI_IOC_STATE_READY)
return 0;

So I turned on a whole lot more debugging (mpt_debug_level=65535), and
caught this from the dhsprintk() just above that code snippet:

mptbase::MakeIocReady, ioc0 [raw] state=1000

state=1000 seems to correspond with MPI_IOC_STATE_READY, which means
that the adapter isn't getting reset because the chip claims to be
ready.  It doesn't seem to be ready, as demonstrated by the original error
message that I reported with the patch.  I'll append the log entries
pertaining to mpt to the end of this message.

--D

(Driver sign-on message if you were curious)

[  164.467481] Fusion MPT base driver 3.04.05
[  164.471706] Copyright (c) 1999-2007 LSI Logic Corporation
[  164.492483] Fusion MPT SAS Host driver 3.04.05
[  167.066482] ACPI: PCI Interrupt :0c:03.0[A] - 6ACPI: PCI Interrupt 
:01:00.0[A] - GSI 16 (level, low) - IRQ 16
[  167.066534] mptbase: Initiating ioc0 bringup
[  167.761481] ioc0: LSISAS1064E B0: Capabilities={Initiator}
[  178.681050] scsi6 : ioc0: LSISAS1064E B0, FwRev=00060200h, Ports=1, 
MaxQ=511, IRQ=16
[  178.741821] scsi 6:0:0:0: Direct-Access IBM-ESXS GNA073C3ESTT0Z N BH0C 
PQ: 0 ANSI: 5
[  178.816476] sd 6:0:0:0: [sda] 143374000 512-byte hardware sectors (73407 MB)
[  178.825198] sd 6:0:0:0: [sda] Write Protect is off
[  178.830088] sd 6:0:0:0: [sda] Mode Sense: d3 00 10 08
[  178.831204] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[  178.845101] sd 6:0:0:0: [sda] 143374000 512-byte hardware sectors (73407 MB)
[  178.853483] sd 6:0:0:0: [sda] Write Protect is off
[  178.858343] sd 6:0:0:0: [sda] Mode Sense: d3 00 10 08
[  178.859961] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[  178.869069]  sda: sda1 sda2 sda3 sda4
[  178.877690] sd 6:0:0:0: [sda] Attached SCSI disk
[  178.912356] sd 6:0:0:0: Attached scsi generic sg0 type 0

(put system to sleep)

[  821.678155] mptbase: ioc0: pci-suspend: pdev=0x81003f64a000, 
slot=:01:00.0, Entering operating state [D3]
[  821.678195] mptbase: ioc0: Sending IOC reset(0x40)!
[  821.813585] mptbase: ioc0: WaitForDoorbell ACK (count=16)
[  821.814120] ACPI: PCI interrupt for device :01:00.0 disabled

(wake system up)

[  891.307583] mptbase: ioc0: pci-resume: pdev=0x81003f64a000, 
slot=:01:00.0, Previous operating state [D3]
[  891.431146] PM: Writing back config space on device :01:00.0 at offset 1 
(was 10, writing 100107)
[  891.431174] ACPI: PCI Interrupt :01:00.0[A] - GSI 16 (level, low) - 
IRQ 16
[  891.431179] mptbase: ioc0: pci-resume: ioc-state=0x1,doorbell=0x1000
[  891.431182] mptbase: Initiating ioc0 recovery
[  891.431184] mptbase::MakeIocReady, ioc0 [raw] state=1000
[  891.431187] mptbase: ioc0: Sending get IocFacts request req_sz=12 reply_sz=80
[  894.723823] mptbase: ioc0: WaitForDoorbell INT (cnt=412) howlong=5
[  894.723826] mptbase: ioc0: HandShake request start reqBytes=12, WaitCnt=412
[  894.723830] mptbase: ioc0: Sending get IocFacts request req_sz=12 reply_sz=80
[  894.731815] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.731817] mptbase: ioc0: HandShake request start reqBytes=12, WaitCnt=1
[  894.739806] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.747799] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.755791] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.763781] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.763784] mptbase: ioc0: Handshake request frame (@810028c81918) header
[  894.763786] mptbase: ioc0: HandShake request post done, WaitCnt=0
[  894.763789] mptbase: ioc0: WaitForDoorbell INT (cnt=0) howlong=5
[  894.771775] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.771778] mptbase: ioc0: WaitCnt=1 First handshake reply word=0300
[  894.779766] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.779769] mptbase: ioc0:

Re: [PATCH] aic94xx: fix SMP request DMA direction

2007-09-30 Thread Darrick J. Wong

On Sat, Sep 29, 2007 at 02:25:33AM -0400, Jeff Garzik wrote:
 Muli Ben-Yehuda wrote:
 On Fri, Sep 28, 2007 at 04:55:34PM -0700, Darrick J. Wong wrote:
 On Thu, Sep 27, 2007 at 10:33:41PM -0400, Jeff Garzik wrote:
 Unless I'm missing something, the SMP request goes /to/ the PCI device 
 :)

 Signed-off-by: Jeff Garzik [EMAIL PROTECTED]
 ACK; builds ok and SMP commands seem to work ok (not that they
 didn't before).
 Could this explain some weirdness we were seeing with aic94xx and
 Calgary/CalIOC2 enabled, or are SMP commands not likely to be used in
 normal operation? We map the IOMMU entries differently for FROMDEVICE
 (RW) and TODEVICE(RO).

 SMP == scsi management == not used during normal data transfer.

 It could certainly explain flakiness if you have expanders, though

Actually, SMP commands are used during device discovery to find things
attached to expanders, so it seems likely that it blows up almost
immediately after loading the module symptoms are a result of this bug.

That said, the bug that Jeff fixed resulted in extra permissions (+w)
being set for the SMP request buffer, so that's probably why I've never
seen any problems manifesting on x260/x3800 systems.

(Unless the CalIOC2 has a write only mode?)

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] aic94xx: fix SMP request DMA direction

2007-09-28 Thread Darrick J. Wong

On Thu, Sep 27, 2007 at 10:33:41PM -0400, Jeff Garzik wrote:
 
 Unless I'm missing something, the SMP request goes /to/ the PCI device :)
 
 Signed-off-by: Jeff Garzik [EMAIL PROTECTED]

ACK; builds ok and SMP commands seem to work ok (not that they didn't
before).

--Darrick

 ---
  drivers/scsi/aic94xx/aic94xx_task.c |4 -
  2 files changed, 83 insertions(+), 17 deletions(-)
 
 diff --git a/drivers/scsi/aic94xx/aic94xx_task.c 
 b/drivers/scsi/aic94xx/aic94xx_task.c
 index d5d8cab..ab13824 100644
 --- a/drivers/scsi/aic94xx/aic94xx_task.c
 +++ b/drivers/scsi/aic94xx/aic94xx_task.c
 @@ -451,7 +451,7 @@ static int asd_build_smp_ascb(struct asd_ascb *ascb, 
 struct sas_task *task,
   struct scb *scb;
 
   pci_map_sg(asd_ha-pcidev, task-smp_task.smp_req, 1,
 -PCI_DMA_FROMDEVICE);
 +PCI_DMA_TODEVICE);
   pci_map_sg(asd_ha-pcidev, task-smp_task.smp_resp, 1,
  PCI_DMA_FROMDEVICE);
 
 @@ -486,7 +486,7 @@ static void asd_unbuild_smp_ascb(struct asd_ascb *a)
 
   BUG_ON(!task);
   pci_unmap_sg(a-ha-pcidev, task-smp_task.smp_req, 1,
 -  PCI_DMA_FROMDEVICE);
 +  PCI_DMA_TODEVICE);
   pci_unmap_sg(a-ha-pcidev, task-smp_task.smp_resp, 1,
PCI_DMA_FROMDEVICE);
  }
 -
 To unsubscribe from this list: send the line unsubscribe linux-scsi in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mptbase: Reset ioc initiator during PCI resume

2007-09-24 Thread Darrick J. Wong

On Thu, Sep 20, 2007 at 07:06:35PM -0600, Moore, Eric wrote:
 Darrick - MESSAGE_UNIT_RESET is already issued from inside
 mpt_do_ioc_recovery(), so you don't need to send this in advance of
 that.YOu will find that occuring from the function MakeIocReady.
 Anyways... would it be possible for you to enable debug logging so I can
 see what problem your having?   I suggest MPT_DEBUG and MPT_DEBUG_INIT.
 If its possible for you to manually load mptbase, that way you can set
 the command line option. 

I took a look at MakeIocReady(), and this section caught my eye:

/* Is it already READY? */
if (!statefault  (ioc_state  MPI_IOC_STATE_MASK) == MPI_IOC_STATE_READY)
return 0;

So I turned on a whole lot more debugging (mpt_debug_level=65535), and
caught this from the dhsprintk() just above that code snippet:

mptbase::MakeIocReady, ioc0 [raw] state=1000

state=1000 seems to correspond with MPI_IOC_STATE_READY, which means
that the adapter isn't getting reset because the chip claims to be
ready.  It doesn't seem to be ready, as demonstrated by the original error
message that I reported with the patch.  I'll append the log entries
pertaining to mpt to the end of this message.

--D

(Driver sign-on message if you were curious)

[  164.467481] Fusion MPT base driver 3.04.05
[  164.471706] Copyright (c) 1999-2007 LSI Logic Corporation
[  164.492483] Fusion MPT SAS Host driver 3.04.05
[  167.066482] ACPI: PCI Interrupt :0c:03.0[A] - 6ACPI: PCI Interrupt 
:01:00.0[A] - GSI 16 (level, low) - IRQ 16
[  167.066534] mptbase: Initiating ioc0 bringup
[  167.761481] ioc0: LSISAS1064E B0: Capabilities={Initiator}
[  178.681050] scsi6 : ioc0: LSISAS1064E B0, FwRev=00060200h, Ports=1, 
MaxQ=511, IRQ=16
[  178.741821] scsi 6:0:0:0: Direct-Access IBM-ESXS GNA073C3ESTT0Z N BH0C 
PQ: 0 ANSI: 5
[  178.816476] sd 6:0:0:0: [sda] 143374000 512-byte hardware sectors (73407 MB)
[  178.825198] sd 6:0:0:0: [sda] Write Protect is off
[  178.830088] sd 6:0:0:0: [sda] Mode Sense: d3 00 10 08
[  178.831204] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[  178.845101] sd 6:0:0:0: [sda] 143374000 512-byte hardware sectors (73407 MB)
[  178.853483] sd 6:0:0:0: [sda] Write Protect is off
[  178.858343] sd 6:0:0:0: [sda] Mode Sense: d3 00 10 08
[  178.859961] sd 6:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[  178.869069]  sda: sda1 sda2 sda3 sda4
[  178.877690] sd 6:0:0:0: [sda] Attached SCSI disk
[  178.912356] sd 6:0:0:0: Attached scsi generic sg0 type 0

(put system to sleep)

[  821.678155] mptbase: ioc0: pci-suspend: pdev=0x81003f64a000, 
slot=:01:00.0, Entering operating state [D3]
[  821.678195] mptbase: ioc0: Sending IOC reset(0x40)!
[  821.813585] mptbase: ioc0: WaitForDoorbell ACK (count=16)
[  821.814120] ACPI: PCI interrupt for device :01:00.0 disabled

(wake system up)

[  891.307583] mptbase: ioc0: pci-resume: pdev=0x81003f64a000, 
slot=:01:00.0, Previous operating state [D3]
[  891.431146] PM: Writing back config space on device :01:00.0 at offset 1 
(was 10, writing 100107)
[  891.431174] ACPI: PCI Interrupt :01:00.0[A] - GSI 16 (level, low) - 
IRQ 16
[  891.431179] mptbase: ioc0: pci-resume: ioc-state=0x1,doorbell=0x1000
[  891.431182] mptbase: Initiating ioc0 recovery
[  891.431184] mptbase::MakeIocReady, ioc0 [raw] state=1000
[  891.431187] mptbase: ioc0: Sending get IocFacts request req_sz=12 reply_sz=80
[  894.723823] mptbase: ioc0: WaitForDoorbell INT (cnt=412) howlong=5
[  894.723826] mptbase: ioc0: HandShake request start reqBytes=12, WaitCnt=412
[  894.723830] mptbase: ioc0: Sending get IocFacts request req_sz=12 reply_sz=80
[  894.731815] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.731817] mptbase: ioc0: HandShake request start reqBytes=12, WaitCnt=1
[  894.739806] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.747799] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.755791] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.763781] mptbase: ioc0: WaitForDoorbell ACK (count=0)
[  894.763784] mptbase: ioc0: Handshake request frame (@810028c81918) header
[  894.763786] mptbase: ioc0: HandShake request post done, WaitCnt=0
[  894.763789] mptbase: ioc0: WaitForDoorbell INT (cnt=0) howlong=5
[  894.771775] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.771778] mptbase: ioc0: WaitCnt=1 First handshake reply word=0300
[  894.779766] mptbase: ioc0: WaitForDoorbell INT (cnt=1) howlong=5
[  894.779769] mptbase: ioc0: Got Handshake reply:
[  894.779770] mptbase: ioc0: WaitForDoorbell REPLY WaitCnt=1 (sz=1)
[  894.779772] mptbase: ioc0: HandShake reply count=1
[  894.779775] mptbase: ioc0: ERROR - Invalid IOC facts reply, msgLength=0 
offsetof=6!
repeat
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] mptbase: Reset ioc initiator during PCI resume

2007-09-20 Thread Darrick J. Wong

It appears that the LSI SAS 1064E chip needs to be reset after a
suspend/resume cycle before the driver attempts further communications with
the chip.  Without this patch, resuming the chip results in this error
message being printed repeatedly and no more disk I/O.

mptbase: ioc0: ERROR - Invalid IOC facts reply, msgLength=0 offsetof=6!

So far it seems to fix suspend/resume on all the MPT Fusion cards I have
(SAS and U320 SCSI) but since I don't know the internals of that chip I
can't say for sure if this is a proper fix.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/message/fusion/mptbase.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/message/fusion/mptbase.c b/drivers/message/fusion/mptbase.c
index 414c109..97895bd 100644
--- a/drivers/message/fusion/mptbase.c
+++ b/drivers/message/fusion/mptbase.c
@@ -1772,6 +1772,12 @@ mpt_resume(struct pci_dev *pdev)
(mpt_GetIocState(ioc, 1)  MPI_IOC_STATE_SHIFT),
CHIPREG_READ32(ioc-chip-Doorbell));
 
+   /* put ioc into READY_STATE */
+   if(SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP)) {
+   printk(MYIOC_s_ERR_FMT
+   pci-resume:  IOC msg unit reset failed!\n, ioc-name);
+   }
+
/* bring ioc to operational state */
if ((recovery_state = mpt_do_ioc_recovery(ioc,
MPT_HOSTEVENT_IOC_RECOVER, CAN_SLEEP)) != 0) {
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Clean up IOC reset code to obey coding style

2007-09-20 Thread Darrick J. Wong

Randy Dunlap scolded me for introducing poorly styled code.  Since it
was a copy-and-paste block from mpt_suspend(), fix both.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/message/fusion/mptbase.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/message/fusion/mptbase.c b/drivers/message/fusion/mptbase.c
index 40b8b41..2952a54 100644
--- a/drivers/message/fusion/mptbase.c
+++ b/drivers/message/fusion/mptbase.c
@@ -1721,10 +1721,9 @@ mpt_suspend(struct pci_dev *pdev, pm_message_t state)
pci_save_state(pdev);
 
/* put ioc into READY_STATE */
-   if(SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP)) {
+   if (SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP))
printk(MYIOC_s_ERR_FMT
pci-suspend:  IOC msg unit reset failed!\n, ioc-name);
-   }
 
/* disable interrupts */
CHIPREG_WRITE32(ioc-chip-IntMask, 0x);
@@ -1773,10 +1772,9 @@ mpt_resume(struct pci_dev *pdev)
CHIPREG_READ32(ioc-chip-Doorbell));
 
/* put ioc into READY_STATE */
-   if(SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP)) {
+   if (SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, CAN_SLEEP))
printk(MYIOC_s_ERR_FMT
pci-resume:  IOC msg unit reset failed!\n, ioc-name);
-   }
 
/* bring ioc to operational state */
if ((recovery_state = mpt_do_ioc_recovery(ioc,
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libsas: SMP request handler shouldn't crash when rphy is NULL

2007-07-24 Thread Darrick J. Wong

sas_smp_handler crashes when smp utils are used with an aic94xx host
because certain devices (the sas_host itself, specifically) lack rphy
structures.  No rphy means no SMP target support, but we shouldn't crash
here.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_expander.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/libsas/sas_expander.c 
b/drivers/scsi/libsas/sas_expander.c
index b500f0c..8603ae6 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -1879,7 +1879,7 @@ int sas_smp_handler(struct Scsi_Host *shost, struct 
sas_rphy *rphy,
struct request *req)
 {
struct domain_device *dev;
-   int ret, type = rphy-identify.device_type;
+   int ret, type;
struct request *rsp = req-next_rq;
 
if (!rsp) {
@@ -1888,12 +1888,13 @@ int sas_smp_handler(struct Scsi_Host *shost, struct 
sas_rphy *rphy,
return -EINVAL;
}
 
-   /* seems aic94xx doesn't support */
+   /* no rphy means no smp target support (ie aic94xx host) */
if (!rphy) {
printk(%s: can we send a smp request to a host?\n,
   __FUNCTION__);
return -EINVAL;
}
+   type = rphy-identify.device_type;
 
if (type != SAS_EDGE_EXPANDER_DEVICE 
type != SAS_FANOUT_EXPANDER_DEVICE) {
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] dtc: Coding police and printk levels

2007-06-22 Thread Darrick J. Wong

On Fri, Jun 22, 2007 at 02:26:29PM +0100, Alan Cox wrote:
 @@ -244,7 +242,7 @@
   if (check_signature(base + 
 signatures[sig].offset, signatures[sig].string, 
 strlen(signatures[sig].string))) {
   addr = 
 bases[current_base].address;
  #if (DTCDEBUG  DTCDEBUG_INIT)
 - printk(scsi-dtc : detected 
 board.\n);
 + printk(KERB_DEBUG scsi-dtc : 
 detected board.\n);

I think you meant KERN_DEBUG ?

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Patch added to scsi-pending-2.6: [SCSI] libsas: convert to use the data buffer accessors

2007-05-29 Thread Darrick J. Wong

On Sun, May 27, 2007 at 05:37:43PM +, James Bottomley wrote:
 [SCSI] libsas: convert to use the data buffer accessors
snip
 This patch is pending because it requires ACKs from:
 
 Darrick J. Wong [EMAIL PROTECTED]

ACK.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] aic94xx: asd_clear_nexus should fail if the cleared task does not complete

2007-05-16 Thread Darrick J. Wong

Every so often, the driver will call asd_clear_nexus to clean out a task.
It is supposed to be the case that the CLEAR NEXUS does not go on the done
list until after the task itself has been put on the done list, but for
some reason this doesn't always happen.  Thus, the
wait_for_completion_timeout call times out, and we return success.  This
makes libsas free the task even though the task hasn't completed, leading
to a BUG_ON message from aic94xx_hwi.c around line 341.  We should return
failure from asd_clear_nexus so that libsas tries again; at a bare minimum
it shouldn't be freeing active tasks.  I _think_ this will fix one of
the SCB timeout crash problems (though I've not been able to reproduce
it lately...)

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/aic94xx/aic94xx_tmf.c |   14 ++
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c 
b/drivers/scsi/aic94xx/aic94xx_tmf.c
index 9a14a6d..c0d0b7d 100644
--- a/drivers/scsi/aic94xx/aic94xx_tmf.c
+++ b/drivers/scsi/aic94xx/aic94xx_tmf.c
@@ -290,6 +290,7 @@ static void asd_tmf_tasklet_complete(str
 static inline int asd_clear_nexus(struct sas_task *task)
 {
int res = TMF_RESP_FUNC_FAILED;
+   int leftover;
struct asd_ascb *tascb = task-lldd_task;
unsigned long flags;
 
@@ -298,10 +299,12 @@ static inline int asd_clear_nexus(struct
res = asd_clear_nexus_tag(task);
else
res = asd_clear_nexus_index(task);
-   wait_for_completion_timeout(tascb-completion,
-   AIC94XX_SCB_TIMEOUT);
+   leftover = wait_for_completion_timeout(tascb-completion,
+  AIC94XX_SCB_TIMEOUT);
ASD_DPRINTK(came back from clear nexus\n);
spin_lock_irqsave(task-task_state_lock, flags);
+   if (leftover  1)
+   res = TMF_RESP_FUNC_FAILED;
if (task-task_state_flags  SAS_TASK_STATE_DONE)
res = TMF_RESP_FUNC_COMPLETE;
spin_unlock_irqrestore(task-task_state_lock, flags);
@@ -350,6 +353,7 @@ int asd_abort_task(struct sas_task *task
unsigned long flags;
struct asd_ascb *ascb = NULL;
struct scb *scb;
+   int leftover;
 
spin_lock_irqsave(task-task_state_lock, flags);
if (task-task_state_flags  SAS_TASK_STATE_DONE) {
@@ -455,9 +459,11 @@ int asd_abort_task(struct sas_task *task
break;
case TF_TMF_TASK_DONE + 0xFF00: /* done but not reported yet */
res = TMF_RESP_FUNC_FAILED;
-   wait_for_completion_timeout(tascb-completion,
-   AIC94XX_SCB_TIMEOUT);
+   leftover = wait_for_completion_timeout(tascb-completion,
+  AIC94XX_SCB_TIMEOUT);
spin_lock_irqsave(task-task_state_lock, flags);
+   if (leftover  1)
+   res = TMF_RESP_FUNC_FAILED;
if (task-task_state_flags  SAS_TASK_STATE_DONE)
res = TMF_RESP_FUNC_COMPLETE;
spin_unlock_irqrestore(task-task_state_lock, flags);
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel crash with AIC94xx (one step forward, hope it's lucky)

2007-05-01 Thread Darrick J. Wong

Constantin Teodorescu wrote:

 03:02:15 kernel: [ cut here ]
 03:02:15 kernel: kernel BUG at drivers/scsi/aic94xx/aic94xx_hwi.h:354!

On the odd chance you still have this controller (and have the time to
test out patches), would you mind applying this patch:

http://sweaglesw.net/~djwong/docs/17-aic94xx-hwi-bugon_1.patch

and reporting back to me what happens?

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] aacraid: superfluous adapter reset for IBM 8 series ServeRAID controllers

2007-05-01 Thread Darrick J. Wong

Salyzyn, Mark wrote:
 The kexec patch introduced a superfluous (and otherwise inert) reset of
 some adapters. The register can have a hardware default value that has
 zeros for the undefined interrupts. This patch refines the test of the
 interrupt enable register to focus on only the interrupts that affect
 the driver in order to detect if an incomplete shutdown of the Adapter
 had occurred (kdump).

Tests out ok on the affected machines, so:

Acked-by: Darrick J. Wong [EMAIL PROTECTED]

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] aacraid: Initialize rx/rkt function pointers before calling them

2007-04-27 Thread Darrick J. Wong

Salyzyn, Mark wrote:

 In my unit tests of aacraid_kexec_5.patch, restart was not called for
 normal operations. If you are just doing a normal boot, what conditions
 are causing restart to be called in your case? Is it a warm restart?
 Some kind of operation that leaves the Adapter in an initialized state,
 or a bug in the driver making sure that interrupts are disabled when
 shut down. Inquiring minds want to know!

This is a normal boot of a Serveraid 8k-l on an IBM x3550.  One
wrinkle in the configuration is that the system is booted off the
network, though I don't see how that would affect the aacraid's state.
It looks like the MUnit.OIMR test just after the Failure to reset here
is an option... comment is succeeding.  The crash seems to happen
regardless of whether we had just done a warm or cold boot.  The option
ROM had run during POST, if that makes any difference.  No kexec/kdump
have been configured.  For that matter, neither kexec nor kdump have
ever been run in the lifetime of the machine.

Also observed on an IBM x3650.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] aacraid: Initialize rx/rkt function pointers before calling them

2007-04-27 Thread Darrick J. Wong

Salyzyn, Mark wrote:
 As an option for a patch (later), what was the actual value of the
 Munit.OIMR register (on the x3550 and the x3650 please, just in case)?

0xF.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel crash with AIC94xx (one step forward, hope it's lucky)

2007-04-26 Thread Darrick J. Wong

Constantin Teodorescu wrote:

 So ... should I ask for other controller quotation ?
 Could you recommend me a good SAS controller, with 8 internal ports,
 supporting Linux , with 99.% reliability ? :-)
 
 I have the following options : Intel® RAID Controller SRCSAS18E
 (Parowan)  and   LSI MegaRAID SAS 8408E
 
 so ... your bet ? :-)

I don't know anything about either of those controllers, though the LSI
1068E has worked quite reliably for me.  I decline to make any
statements about 99.% reliability, however.

--D

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] aacraid: Initialize rx/rkt function pointers before calling them

2007-04-26 Thread Darrick J. Wong

Commit 8418852d11f0bbaeebeedd4243560d8fdc85410d to scsi-misc resulted in
the substitution of calls to rx_sync_cmd with a function pointer
abstraction.  aac_rx_restart_adapter requires a pointer to a sync_cmd
function, which is not set up before its first invocation.  That causes
the driver to crash at startup.  Move the initializers (we need both
rx_sync_cmd and enable_int pointers) further up to proceed the
restart_adapter call.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/aacraid/rx.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
index 0c71315..b7810d6 100644
--- a/drivers/scsi/aacraid/rx.c
+++ b/drivers/scsi/aacraid/rx.c
@@ -537,6 +537,8 @@ int _aac_rx_init(struct aac_dev *dev)
printk(KERN_WARNING %s: unable to map adapter.\n, name);
goto error_iounmap;
}
+   dev-a_ops.adapter_sync_cmd = rx_sync_cmd;
+   aac_adapter_comm(dev, AAC_COMM_PRODUCER);
 
/* Failure to reset here is an option ... */
dev-OIMR = status = rx_readb (dev, MUnit.OIMR);
@@ -598,7 +600,6 @@ int _aac_rx_init(struct aac_dev *dev)
dev-a_ops.adapter_interrupt = aac_rx_interrupt_adapter;
dev-a_ops.adapter_disable_int = aac_rx_disable_interrupt;
dev-a_ops.adapter_notify = aac_rx_notify_adapter;
-   dev-a_ops.adapter_sync_cmd = rx_sync_cmd;
dev-a_ops.adapter_check_health = aac_rx_check_health;
dev-a_ops.adapter_restart = aac_rx_restart_adapter;
 
@@ -606,7 +607,6 @@ int _aac_rx_init(struct aac_dev *dev)
 *  First clear out all interrupts.  Then enable the one's that we
 *  can handle.
 */
-   aac_adapter_comm(dev, AAC_COMM_PRODUCER);
aac_adapter_disable_int(dev);
rx_writel(dev, MUnit.ODR, 0x);
aac_adapter_enable_int(dev);
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: aic94xx driver woes

2007-03-31 Thread Darrick J. Wong

Douglas Gilbert wrote:

 So that is almost 12 months that I have been reporting
 this driver as broken. Is it just me or my hardware?

I seem to recall you saying that the LSI Fusion card was plugged into
the same expander as the 48300?  If so, does unplugging the Fusion card
from the expander make it work?

 aic94xx: Found sequencer Firmware version 1.1 (V17/10c6)

Have you tried the V30 sequencer?

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] sas_ata: Rename ata_queued_cmd-lldd_task to driver_data

2007-02-22 Thread Darrick J. Wong

Per Tejun's request, rename the lldd_task field and add comments about it.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_ata.c |8 
 include/linux/libata.h|4 +++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 2db2589..c92f4b6 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -122,7 +122,7 @@ static void sas_ata_task_done(struct sas
}
}
 
-   qc-lldd_task = NULL;
+   qc-driver_data = NULL;
if (qc-scsicmd)
ASSIGN_SAS_TASK(qc-scsicmd, NULL);
ata_qc_complete(qc);
@@ -192,7 +192,7 @@ static unsigned int sas_ata_qc_issue(str
task-scatter = qc-__sg;
task-ata_task.retry_count = 1;
task-task_state_flags = SAS_TASK_STATE_PENDING;
-   qc-lldd_task = task;
+   qc-driver_data = task;
 
switch (qc-tf.protocol) {
case ATA_PROT_NCQ:
@@ -276,10 +276,10 @@ static void sas_ata_post_internal(struct
 * bother with sas_ata_task_done.  But we still
 * ought to abort the task.
 */
-   struct sas_task *task = qc-lldd_task;
+   struct sas_task *task = qc-driver_data;
unsigned long flags;
 
-   qc-lldd_task = NULL;
+   qc-driver_data = NULL;
if (task) {
/* Should this be a AT(API) device reset? */
spin_lock_irqsave(task-task_state_lock, flags);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index a20646c..a8eafc7 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -445,7 +445,9 @@ struct ata_queued_cmd {
ata_qc_cb_t complete_fn;
 
void*private_data;
-   void*lldd_task;
+
+   /* This is owned by a low level libata client */
+   void*driver_data;
 };
 
 struct ata_port_stats {
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2]: sas_ata: Don't reset the phy in post_internal_command

2007-02-22 Thread Darrick J. Wong

We don't need to reset the SAS phy in sas_ata_post_internal; all
that is necessary is to clear out the task from the SAS HA.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_ata.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index c92f4b6..d91c5ba 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -281,11 +281,6 @@ static void sas_ata_post_internal(struct
 
qc-driver_data = NULL;
if (task) {
-   /* Should this be a AT(API) device reset? */
-   spin_lock_irqsave(task-task_state_lock, flags);
-   task-task_state_flags |= SAS_TASK_NEED_DEV_RESET;
-   spin_unlock_irqrestore(task-task_state_lock, flags);
-
task-uldd_task = NULL;
__sas_task_abort(task);
}
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please help if u can.

2007-02-21 Thread Darrick J. Wong

John Scarpa wrote:
 First a very big thanks to all of u! I have been suffering a serious
 lack of sleep problem lately..  i should have noticed that by whom has
 been submitting the past 500 fixes and updates!
 
 Quick question, is the driver still consider experimental??

Very much so.  The SAS bits are fairly stable nowadays, but the rest is
still YMWV. :)

 the guys i
 work with say it doesn't support sata drives and it's still experimental

SATA support is under development.  Patches exist in the git tree here:
http://www.kernel.org/git/?p=linux/kernel/git/jejb/aic94xx-sas-2.6.git;a=summary

 so don't use it.  And i can't find anything on the state of this driver.
 
 PS.  I should have said i dropped that aic94xx-seq.fw in
 /lib,/lib/firmware,/lib64,/lib64/firmware  (still have yet to get this
 sucker to work)

Yes, you need a udev that's new enough to know how to handle the
firmware loading interface.  Typically, udev will load firmware from
/lib/firmware, though I suppose that depends on the distribution.  Not
sure if RH/Fedora support fw loading, newer Ubuntu-E and SuSE do...

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG in libata from ata_sas_port_alloc

2007-02-15 Thread Darrick J. Wong

James Bottomley wrote:

 The problem is that memory obtained by devm_kzalloc() cannot be returned
 by kfree() ... they come from different allocation lists.  The solution
 is probably to have a corresponding ata_probe_ent_free(), I just don't
 exactly see how to tell if the object came from the devm_kzalloc or not
 (unless it gets marked).

Just a shot in the dark, but could we simply make whatever changes are
necessary to make all sas-ata LLDDs managed and then use devm_kzalloc?
Though (and I may be totally wrong here) if it's the case that
devres_head is made (or not made) to be part of a list _only_ before we
reach ata_probe_ent_alloc, we could put a similar if check into the free
function.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Any multipath SAS support in Linux?

2007-02-14 Thread Darrick J. Wong

Orion Poplawski wrote:
 I'm thinking about trying to setup a two node HA storage cluster
 connected to an external SAS box.  Is such a thing possible at this time?

I've had success with aic94xx + dm_multipath before.  There has recently
been a bug in the multipath tools wherein it fails to detect disk type
due to the removal of a bus attribute in sysfs. and I don't know if
that's been fixed.  (Aside from building your own with the sysfs part
removed)

Note that success == I set it up, started I/O, yanked some cables, and
it kept chugging. :)

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: aic79xx noise on hot insert in 2.6.19.x and 2.6.20

2007-02-07 Thread Darrick J. Wong

Mark Rustad wrote:
 I have systems with Supermicro X6-class (Nocona/Lindenhurst)
 motherboards with Adaptec SCSI and SAFTE backplanes running software
 RAID-1 (md) on a pair of drives. When I hot-insert a drive, I get a lot
 of noise from the kernel apparently due to lack of handling something in
 the interrupt routine. So far, life seems to go on after the event, but
 not knowing anything about the internals of the driver, I am concerned
 enough to want to ask about it.

I see the noise too, though I don't know enough about aic79xx (and am
too busy with aic94xx) to do anything about it.  As far as I can tell,
the driver's just being a little trigger happy with the DUMP CARD
STATE routine.  But that's my totally unqualified opinion. :)

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] scsi: Update Aic94xx SAS/SATA Linux open source devicedriver for new sequence firmware.

2007-02-07 Thread Darrick J. Wong

Wu, Gilbert wrote:
 Hi James,
 
   We are investigating this issue here. We will update the status when
 we can duplicate the problem here and root cause.

FWIW,

v17 looks good for both SAS/SATA load testing.  The 24-disk x260 seems
to have crapped out after about 800 rounds of load/unload due to the
phys reporting devices, then no devices about 10s later, and then having
the module unloaded before the dead SAS commands finished returning.
Not sure what that's about, though I might also have borked the x260 :(

Though who really is going to reboot the machine 800 times in rapid-fire
succession???

(I'm not trying to slam the v28 sequencer, I'm merely providing a
baseline for comparison between the two.  It may very well be the case
that all the bugs we used to observe with v17 were merely a result of us
poking the sequencer the wrong way)

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/12] sas_ata: Make this a module separate from libsas

2007-02-04 Thread Darrick J. Wong

James Bottomley wrote:
 On Tue, 2007-01-30 at 01:19 -0800, Darrick J. Wong wrote:
 Break out sas_ata as a free-standing module that provides a SATA
 Translation Layer (SATL) for libsas.  This patch requires the libsas
 SATL registration patch; the changes to sas_ata itself are rather
 minor.
 
 Right at the moment, this doesn't work.  The dependency of ATA_AVAILABLE
 on SCSI_SAS_SATL forces libsas to require sas_ata if you select it as a
 module (i.e. they're not truly independent).
 
 How about this solution to untangle them?

ACK.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/12] Roll-up of sas_ata patches

2007-02-04 Thread Darrick J. Wong

James Bottomley wrote:

 There's a problem somewhere with your error handler changes (which I
 picked up thanks to the problems with the V28 firmware).  What I see
 without your changes is that for a directly attached SATA device, when
 the firmware begins its death spiral, the commands all return and
 eventually send I/O errors to the filesystem,  With your patch series
 applied, it just loops forever giving messages like:
 
 Feb  3 12:07:06 localhost kernel: aic94xx: escb_tasklet_complete: phy5: 
 LINK_RESET_ERROR
 Feb  3 12:07:06 localhost kernel: aic94xx: phy5: Receive FIS timeout
 Feb  3 12:07:06 localhost kernel: aic94xx: phy5: retries:0 performing link 
 reset seq
 Feb  3 12:07:06 localhost kernel: sas: --- Exit sas_scsi_recover_host
 Feb  3 12:07:06 localhost kernel: aic94xx: control_phy_tasklet_complete: 
 phy5, lrate:0x8, proto:0xe
 Feb  3 12:07:06 localhost kernel: sas: Enter sas_scsi_recover_host
 Feb  3 12:07:06 localhost kernel: sas: --- Exit sas_scsi_recover_host
 Feb  3 12:07:06 localhost kernel: sas: Enter sas_scsi_recover_host
 Feb  3 12:07:06 localhost kernel: sas: --- Exit sas_scsi_recover_host
 Feb  3 12:07:06 localhost kernel: sas: Enter sas_scsi_recover_host
 Feb  3 12:07:06 localhost kernel: sas: --- Exit sas_scsi_recover_host

Interesting, since the opposite happens with SAS disks. :)

The infinite loop is usually what happens if a scsi_cmnd gets pulled off
the eh queue without being scsi_eh_finish_cmnd()'d.  Can you send me the
whole dmesg?  It's possible that we're trying to abort a command, which
of course fails for a SATA disk, so we try bigger and bigger hammers
and the big hammers don't call scsi-eh-finish-cmd.

Did these SATA link reset errors only start showing up after the v28
firmware patch, or has this always happened?  I've noticed lately that I
get link reset errors if I run a short exercise on an ext3 filesystem on
a SATA disk, yet dd exercise runs just fine.  But I had also thought
that it was just my flaky hardware. :)

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/12] Roll-up of sas_ata patches

2007-01-30 Thread Darrick J. Wong

Hi all,

This is a roll-up of all of my ATA related uncommitted patches against
libsas and aic94xx to date.  Per James Bottomley's request, I'm pushing
these patches out for further review in aic94xx-sas.  The big changes in
this patch set are a lot of bug and locking fixes, the conversion of the
EH routines to interact with the SAS EH strategy routines, and of course
the separation of the SATL code into a separate module.

These patches should apply in number order cleanly against 2.6.20-rc6 +
scsi_misc + scsi-rc-fixes + aic94xx-sas.  They've been fairly well tested
on a bunch of SATA disks in a x206m, though the ATAPI support is not so
well tested.  However, I have run these patches in other loads for a while.
Hopefully these patches are ready for more widespread testing in
scsi-misc, and thank you for any comments or feedback that you provide.

(Apologies for any stgit mail misconfiguration on my part.)

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/12] sas_ata: Require CONFIG_ATA in Kconfig

2007-01-30 Thread Darrick J. Wong


Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/Kconfig |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/libsas/Kconfig b/drivers/scsi/libsas/Kconfig
index aafdc92..b64e391 100644
--- a/drivers/scsi/libsas/Kconfig
+++ b/drivers/scsi/libsas/Kconfig
@@ -24,7 +24,7 @@ #
 
 config SCSI_SAS_LIBSAS
tristate SAS Domain Transport Attributes
-   depends on SCSI
+   depends on SCSI  ATA
select SCSI_SAS_ATTRS
help
  This provides transport specific helpers for SAS drivers which
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/12] sas_ata: Satisfy libata qc function locking requirements

2007-01-30 Thread Darrick J. Wong


ata_qc_complete and ata_sas_queuecmd require that the port lock be held
when they are called.  sas_ata doesn't do this, leading to BUG messages
about qc tags newly allocated qc tags already being in use.  This patch
fixes the locking, which should clean up the rest of those messages.

So far I've tested this against an IBM x206m with two SATA disks with no
BUG messages and no other signs of things going wrong, and the machine
finally passed the pounder stress test.

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
---

 drivers/scsi/libsas/sas_ata.c   |4 
 drivers/scsi/libsas/sas_scsi_host.c |4 
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index de42b5b..0bb1a14 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -92,7 +92,9 @@ static void sas_ata_task_done(struct sas
struct task_status_struct *stat = task-task_status;
struct ata_task_resp *resp = (struct ata_task_resp *)stat-buf;
enum ata_completion_errors ac;
+   unsigned long flags;
 
+   spin_lock_irqsave(dev-sata_dev.ap-lock, flags);
if (stat-stat == SAS_PROTO_RESPONSE) {
ata_tf_from_fis(resp-ending_fis, dev-sata_dev.tf);
qc-err_mask |= ac_err_mask(dev-sata_dev.tf.command);
@@ -113,6 +115,8 @@ static void sas_ata_task_done(struct sas
}
 
ata_qc_complete(qc);
+   spin_unlock_irqrestore(dev-sata_dev.ap-lock, flags);
+
list_del_init(task-list);
sas_free_task(task);
 }
diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index 2cd478a..fee9c10 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -213,8 +213,12 @@ int sas_queuecommand(struct scsi_cmnd *c
struct sas_task *task;
 
if (dev_is_sata(dev)) {
+   unsigned long flags;
+
+   spin_lock_irqsave(dev-sata_dev.ap-lock, flags);
res = ata_sas_queuecmd(cmd, scsi_done,
   dev-sata_dev.ap);
+   spin_unlock_irqrestore(dev-sata_dev.ap-lock, flags);
goto out;
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 163 matches

Mail list logo