subject:"regression introduced by block\: Add support for DAX reads\/writes to block devices"

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-14 Thread Dan Williams

On Thu, Aug 13, 2015 at 12:32 PM, Wilcox, Matthew R
 wrote:
> I liked the patch you were pushing to request the *page* containing the 
> requested bytes instead of the *block* containing the requested bytes.
>
> For the misaligned partition problem, I was thinking we should change the 
> direct_access API to return a phys_addr_t instead of a pfn.  That way we can 
> return something that isn't actually page aligned, and DAX can take care of 
> making sure it doesn't overshoot the end.


If you go that route please make it __pfn_t + offset rather than
phys_addr_t so we can communicate the PFN_DEV flag among others.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-14 Thread Dan Williams

On Thu, Aug 13, 2015 at 12:32 PM, Wilcox, Matthew R
matthew.r.wil...@intel.com wrote:
 I liked the patch you were pushing to request the *page* containing the 
 requested bytes instead of the *block* containing the requested bytes.

 For the misaligned partition problem, I was thinking we should change the 
 direct_access API to return a phys_addr_t instead of a pfn.  That way we can 
 return something that isn't actually page aligned, and DAX can take care of 
 making sure it doesn't overshoot the end.


If you go that route please make it __pfn_t + offset rather than
phys_addr_t so we can communicate the PFN_DEV flag among others.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-13 Thread Wilcox, Matthew R

I liked the patch you were pushing to request the *page* containing the 
requested bytes instead of the *block* containing the requested bytes.

For the misaligned partition problem, I was thinking we should change the 
direct_access API to return a phys_addr_t instead of a pfn.  That way we can 
return something that isn't actually page aligned, and DAX can take care of 
making sure it doesn't overshoot the end.

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Thursday, August 13, 2015 11:19 AM
To: Linda Knippers
Cc: Boaz Harrosh; Wilcox, Matthew R; linux-kernel@vger.kernel.org; 
linux-fsde...@vger.kernel.org; Verma, Vishal L
Subject: Re: regression introduced by "block: Add support for DAX reads/writes 
to block devices"

Linda Knippers  writes:

>>> It causes the physical block size to be PAGE_SIZE but the
>>> logical block size is still 512.  However, the minimum_io_size
>>> is now 4096 (same as physical block size, I assume).  The
>>> optimal_io_size is still 0.  What does that mean?
>> 
>> physical block size - device's internal block size
>> logical block size - addressable unit
>
> Right, but it's still reported as 512 and that doesn't work.

Understood.  :)

>> optimal io size - device's preferred unit for streaming
>
> So 0 is ok.

Correct.

>> We can change the block device to export logical/physical block sizes of
>> PAGE_SIZE.  However, when persistent memory support comes to platforms
>> that support page sizes > 32k, xfs will again run into problems (Dave
>> Chinner mentioned that xfs can't deal with logical block sizes >32k.)
>> Arguably, you can use pmem and dax on such platforms using RAM today for
>> testing.  Do we care about breaking that?
>
> I would think so.  AARCH64 uses 64k pages today.

So does powerpc, but I guess nobody cares about that anymore.  ;-) If
the logical block size is smaller than the page size, we're going to
have to deal with sub-page I/O.  For now, we can do as Boaz suggested,
and just turn off dax for those configurations.  We could also just
revert the patch that introduced this problem.  I really don't know who
is going to care about O_DIRECT I/O performance to a persistent memory
block device.

Willy?  What was the real motivation there?

> I think Documentation/filesystems/dax.txt could use a little update
> too.  It has a section "Implementation Tips for Block Driver Writers"
> that makes it sound easy but now I wonder if it even works with the
> example ram drivers.  Should we be able to read any 512 byte
> "sector"?

If the logical block size is 512 bytes, then you have to be able to do
(direct) I/O to any 512 byte sector.  Simple as that.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-13 Thread Jeff Moyer

Linda Knippers  writes:

>>> It causes the physical block size to be PAGE_SIZE but the
>>> logical block size is still 512.  However, the minimum_io_size
>>> is now 4096 (same as physical block size, I assume).  The
>>> optimal_io_size is still 0.  What does that mean?
>> 
>> physical block size - device's internal block size
>> logical block size - addressable unit
>
> Right, but it's still reported as 512 and that doesn't work.

Understood.  :)

>> optimal io size - device's preferred unit for streaming
>
> So 0 is ok.

Correct.

>> We can change the block device to export logical/physical block sizes of
>> PAGE_SIZE.  However, when persistent memory support comes to platforms
>> that support page sizes > 32k, xfs will again run into problems (Dave
>> Chinner mentioned that xfs can't deal with logical block sizes >32k.)
>> Arguably, you can use pmem and dax on such platforms using RAM today for
>> testing.  Do we care about breaking that?
>
> I would think so.  AARCH64 uses 64k pages today.

So does powerpc, but I guess nobody cares about that anymore.  ;-) If
the logical block size is smaller than the page size, we're going to
have to deal with sub-page I/O.  For now, we can do as Boaz suggested,
and just turn off dax for those configurations.  We could also just
revert the patch that introduced this problem.  I really don't know who
is going to care about O_DIRECT I/O performance to a persistent memory
block device.

Willy?  What was the real motivation there?

> I think Documentation/filesystems/dax.txt could use a little update
> too.  It has a section "Implementation Tips for Block Driver Writers"
> that makes it sound easy but now I wonder if it even works with the
> example ram drivers.  Should we be able to read any 512 byte
> "sector"?

If the logical block size is 512 bytes, then you have to be able to do
(direct) I/O to any 512 byte sector.  Simple as that.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-13 Thread Linda Knippers

On 8/13/2015 1:14 PM, Jeff Moyer wrote:
> Linda Knippers  writes:
> 
>>> I'd be fine with changing the persistent memory block device to only
>>> support 4k logical, 4k physical block size.  That probably makes the
>>> most sense.
>>
>> If that's what we want, the current patch doesn't do that.
>> https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
>>
>> It causes the physical block size to be PAGE_SIZE but the
>> logical block size is still 512.  However, the minimum_io_size
>> is now 4096 (same as physical block size, I assume).  The
>> optimal_io_size is still 0.  What does that mean?
> 
> physical block size - device's internal block size
> logical block size - addressable unit

Right, but it's still reported as 512 and that doesn't work.

> optimal io size - device's preferred unit for streaming

So 0 is ok.

> minimum io size - device’s preferred minimum unit for random I/O
> 
> See Martin Petersen's "Linux & Advanced Storage Interfaces" document for
> more information.
> 
>> Whatever we go with, we should do something because 4.2rc6 is still
>> broken, unable to create a xfs file system on a pmem device, ever
>> since the change to use DAX on block devices with O_DIRECT.
> 
> We can change the block device to export logical/physical block sizes of
> PAGE_SIZE.  However, when persistent memory support comes to platforms
> that support page sizes > 32k, xfs will again run into problems (Dave
> Chinner mentioned that xfs can't deal with logical block sizes >32k.)
> Arguably, you can use pmem and dax on such platforms using RAM today for
> testing.  Do we care about breaking that?

I would think so.  AARCH64 uses 64k pages today.

I think Documentation/filesystems/dax.txt could use a little update
too.  It has a section "Implementation Tips for Block Driver Writers"
that makes it sound easy but now I wonder if it even works with the
example ram drivers.  Should we be able to read any 512 byte
"sector"?

-- ljk
> 
> Cheers,
> Jeff
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-13 Thread Jeff Moyer

Linda Knippers  writes:

>> I'd be fine with changing the persistent memory block device to only
>> support 4k logical, 4k physical block size.  That probably makes the
>> most sense.
>
> If that's what we want, the current patch doesn't do that.
> https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
>
> It causes the physical block size to be PAGE_SIZE but the
> logical block size is still 512.  However, the minimum_io_size
> is now 4096 (same as physical block size, I assume).  The
> optimal_io_size is still 0.  What does that mean?

physical block size - device's internal block size
logical block size - addressable unit
optimal io size - device's preferred unit for streaming
minimum io size - device’s preferred minimum unit for random I/O

See Martin Petersen's "Linux & Advanced Storage Interfaces" document for
more information.

> Whatever we go with, we should do something because 4.2rc6 is still
> broken, unable to create a xfs file system on a pmem device, ever
> since the change to use DAX on block devices with O_DIRECT.

We can change the block device to export logical/physical block sizes of
PAGE_SIZE.  However, when persistent memory support comes to platforms
that support page sizes > 32k, xfs will again run into problems (Dave
Chinner mentioned that xfs can't deal with logical block sizes >32k.)
Arguably, you can use pmem and dax on such platforms using RAM today for
testing.  Do we care about breaking that?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-13 Thread Linda Knippers

On 8/13/2015 10:00 AM, Jeff Moyer wrote:
> Boaz Harrosh  writes:
> 
>> On 08/13/2015 12:11 AM, Jeff Moyer wrote:
>>> Boaz Harrosh  writes:
>>>
 On 08/07/2015 11:41 PM, Jeff Moyer wrote:
 <>
>
>> We need to cope with the case where the end of a partition isn't on a
>> page boundary though.
>
> Well, that's usually done by falling back to buffered I/O.  I gave that
> a try and panicked the box.  :)  I'll keep looking into it, but probably
> won't have another patch until next week.
>

 lets slow down for a sec, please.

 We have all established that an unaligned partition start is BAD and not 
 supported?
>>>
>>> No.  Unaligned partitions on RAID arrays or 512e devices are bad because
>>> they result in suboptimal performance.  They are most certainly still
>>> supported, though.
>>>
>>
>> What ?
>>
>> I meant for dax on pmem or brd. I meant that we *do not* support dax access
>> on an unaligned partition start. (None dax is perfectly supported)
> 
> Sorry, I thought your statement was broader than that.
> 
>> We did it this way because of the direct_access API that returns a pfn
>> with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
>> not worth it, and that dax should be page aligned for code simplicity
> 
> I'd be fine with changing the persistent memory block device to only
> support 4k logical, 4k physical block size.  That probably makes the
> most sense.

If that's what we want, the current patch doesn't do that.
https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html

It causes the physical block size to be PAGE_SIZE but the
logical block size is still 512.  However, the minimum_io_size
is now 4096 (same as physical block size, I assume).  The
optimal_io_size is still 0.  What does that mean?

Whatever we go with, we should do something because 4.2rc6 is still
broken, unable to create a xfs file system on a pmem device, ever
since the change to use DAX on block devices with O_DIRECT.

-- ljk
> 
> Cheers,
> Jeff
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-13 Thread Jeff Moyer

Boaz Harrosh  writes:

> On 08/13/2015 12:11 AM, Jeff Moyer wrote:
>> Boaz Harrosh  writes:
>> 
>>> On 08/07/2015 11:41 PM, Jeff Moyer wrote:
>>> <>

> We need to cope with the case where the end of a partition isn't on a
> page boundary though.

 Well, that's usually done by falling back to buffered I/O.  I gave that
 a try and panicked the box.  :)  I'll keep looking into it, but probably
 won't have another patch until next week.

>>>
>>> lets slow down for a sec, please.
>>>
>>> We have all established that an unaligned partition start is BAD and not 
>>> supported?
>> 
>> No.  Unaligned partitions on RAID arrays or 512e devices are bad because
>> they result in suboptimal performance.  They are most certainly still
>> supported, though.
>> 
>
> What ?
>
> I meant for dax on pmem or brd. I meant that we *do not* support dax access
> on an unaligned partition start. (None dax is perfectly supported)

Sorry, I thought your statement was broader than that.

> We did it this way because of the direct_access API that returns a pfn
> with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
> not worth it, and that dax should be page aligned for code simplicity

I'd be fine with changing the persistent memory block device to only
support 4k logical, 4k physical block size.  That probably makes the
most sense.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-13 Thread Jeff Moyer

Boaz Harrosh b...@plexistor.com writes:

 On 08/13/2015 12:11 AM, Jeff Moyer wrote:
 Boaz Harrosh b...@plexistor.com writes:
 
 On 08/07/2015 11:41 PM, Jeff Moyer wrote:
 

 We need to cope with the case where the end of a partition isn't on a
 page boundary though.

 Well, that's usually done by falling back to buffered I/O.  I gave that
 a try and panicked the box.  :)  I'll keep looking into it, but probably
 won't have another patch until next week.


 lets slow down for a sec, please.

 We have all established that an unaligned partition start is BAD and not 
 supported?
 
 No.  Unaligned partitions on RAID arrays or 512e devices are bad because
 they result in suboptimal performance.  They are most certainly still
 supported, though.
 

 What ?

 I meant for dax on pmem or brd. I meant that we *do not* support dax access
 on an unaligned partition start. (None dax is perfectly supported)

Sorry, I thought your statement was broader than that.

 We did it this way because of the direct_access API that returns a pfn
 with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
 not worth it, and that dax should be page aligned for code simplicity

I'd be fine with changing the persistent memory block device to only
support 4k logical, 4k physical block size.  That probably makes the
most sense.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-13 Thread Linda Knippers

On 8/13/2015 10:00 AM, Jeff Moyer wrote:
 Boaz Harrosh b...@plexistor.com writes:
 
 On 08/13/2015 12:11 AM, Jeff Moyer wrote:
 Boaz Harrosh b...@plexistor.com writes:

 On 08/07/2015 11:41 PM, Jeff Moyer wrote:
 

 We need to cope with the case where the end of a partition isn't on a
 page boundary though.

 Well, that's usually done by falling back to buffered I/O.  I gave that
 a try and panicked the box.  :)  I'll keep looking into it, but probably
 won't have another patch until next week.


 lets slow down for a sec, please.

 We have all established that an unaligned partition start is BAD and not 
 supported?

 No.  Unaligned partitions on RAID arrays or 512e devices are bad because
 they result in suboptimal performance.  They are most certainly still
 supported, though.


 What ?

 I meant for dax on pmem or brd. I meant that we *do not* support dax access
 on an unaligned partition start. (None dax is perfectly supported)
 
 Sorry, I thought your statement was broader than that.
 
 We did it this way because of the direct_access API that returns a pfn
 with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
 not worth it, and that dax should be page aligned for code simplicity
 
 I'd be fine with changing the persistent memory block device to only
 support 4k logical, 4k physical block size.  That probably makes the
 most sense.

If that's what we want, the current patch doesn't do that.
https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html

It causes the physical block size to be PAGE_SIZE but the
logical block size is still 512.  However, the minimum_io_size
is now 4096 (same as physical block size, I assume).  The
optimal_io_size is still 0.  What does that mean?

Whatever we go with, we should do something because 4.2rc6 is still
broken, unable to create a xfs file system on a pmem device, ever
since the change to use DAX on block devices with O_DIRECT.

-- ljk
 
 Cheers,
 Jeff
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-13 Thread Linda Knippers

On 8/13/2015 1:14 PM, Jeff Moyer wrote:
 Linda Knippers linda.knipp...@hp.com writes:
 
 I'd be fine with changing the persistent memory block device to only
 support 4k logical, 4k physical block size.  That probably makes the
 most sense.

 If that's what we want, the current patch doesn't do that.
 https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html

 It causes the physical block size to be PAGE_SIZE but the
 logical block size is still 512.  However, the minimum_io_size
 is now 4096 (same as physical block size, I assume).  The
 optimal_io_size is still 0.  What does that mean?
 
 physical block size - device's internal block size
 logical block size - addressable unit

Right, but it's still reported as 512 and that doesn't work.

 optimal io size - device's preferred unit for streaming

So 0 is ok.

 minimum io size - device’s preferred minimum unit for random I/O
 
 See Martin Petersen's Linux  Advanced Storage Interfaces document for
 more information.
 
 Whatever we go with, we should do something because 4.2rc6 is still
 broken, unable to create a xfs file system on a pmem device, ever
 since the change to use DAX on block devices with O_DIRECT.
 
 We can change the block device to export logical/physical block sizes of
 PAGE_SIZE.  However, when persistent memory support comes to platforms
 that support page sizes  32k, xfs will again run into problems (Dave
 Chinner mentioned that xfs can't deal with logical block sizes 32k.)
 Arguably, you can use pmem and dax on such platforms using RAM today for
 testing.  Do we care about breaking that?

I would think so.  AARCH64 uses 64k pages today.

I think Documentation/filesystems/dax.txt could use a little update
too.  It has a section Implementation Tips for Block Driver Writers
that makes it sound easy but now I wonder if it even works with the
example ram drivers.  Should we be able to read any 512 byte
sector?

-- ljk
 
 Cheers,
 Jeff
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-13 Thread Jeff Moyer

Linda Knippers linda.knipp...@hp.com writes:

 I'd be fine with changing the persistent memory block device to only
 support 4k logical, 4k physical block size.  That probably makes the
 most sense.

 If that's what we want, the current patch doesn't do that.
 https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html

 It causes the physical block size to be PAGE_SIZE but the
 logical block size is still 512.  However, the minimum_io_size
 is now 4096 (same as physical block size, I assume).  The
 optimal_io_size is still 0.  What does that mean?

physical block size - device's internal block size
logical block size - addressable unit
optimal io size - device's preferred unit for streaming
minimum io size - device’s preferred minimum unit for random I/O

See Martin Petersen's Linux  Advanced Storage Interfaces document for
more information.

 Whatever we go with, we should do something because 4.2rc6 is still
 broken, unable to create a xfs file system on a pmem device, ever
 since the change to use DAX on block devices with O_DIRECT.

We can change the block device to export logical/physical block sizes of
PAGE_SIZE.  However, when persistent memory support comes to platforms
that support page sizes  32k, xfs will again run into problems (Dave
Chinner mentioned that xfs can't deal with logical block sizes 32k.)
Arguably, you can use pmem and dax on such platforms using RAM today for
testing.  Do we care about breaking that?

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-13 Thread Jeff Moyer

Linda Knippers linda.knipp...@hp.com writes:

 It causes the physical block size to be PAGE_SIZE but the
 logical block size is still 512.  However, the minimum_io_size
 is now 4096 (same as physical block size, I assume).  The
 optimal_io_size is still 0.  What does that mean?
 
 physical block size - device's internal block size
 logical block size - addressable unit

 Right, but it's still reported as 512 and that doesn't work.

Understood.  :)

 optimal io size - device's preferred unit for streaming

 So 0 is ok.

Correct.

 We can change the block device to export logical/physical block sizes of
 PAGE_SIZE.  However, when persistent memory support comes to platforms
 that support page sizes  32k, xfs will again run into problems (Dave
 Chinner mentioned that xfs can't deal with logical block sizes 32k.)
 Arguably, you can use pmem and dax on such platforms using RAM today for
 testing.  Do we care about breaking that?

 I would think so.  AARCH64 uses 64k pages today.

So does powerpc, but I guess nobody cares about that anymore.  ;-) If
the logical block size is smaller than the page size, we're going to
have to deal with sub-page I/O.  For now, we can do as Boaz suggested,
and just turn off dax for those configurations.  We could also just
revert the patch that introduced this problem.  I really don't know who
is going to care about O_DIRECT I/O performance to a persistent memory
block device.

Willy?  What was the real motivation there?

 I think Documentation/filesystems/dax.txt could use a little update
 too.  It has a section Implementation Tips for Block Driver Writers
 that makes it sound easy but now I wonder if it even works with the
 example ram drivers.  Should we be able to read any 512 byte
 sector?

If the logical block size is 512 bytes, then you have to be able to do
(direct) I/O to any 512 byte sector.  Simple as that.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-13 Thread Wilcox, Matthew R

I liked the patch you were pushing to request the *page* containing the 
requested bytes instead of the *block* containing the requested bytes.

For the misaligned partition problem, I was thinking we should change the 
direct_access API to return a phys_addr_t instead of a pfn.  That way we can 
return something that isn't actually page aligned, and DAX can take care of 
making sure it doesn't overshoot the end.

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Thursday, August 13, 2015 11:19 AM
To: Linda Knippers
Cc: Boaz Harrosh; Wilcox, Matthew R; linux-kernel@vger.kernel.org; 
linux-fsde...@vger.kernel.org; Verma, Vishal L
Subject: Re: regression introduced by block: Add support for DAX reads/writes 
to block devices

Linda Knippers linda.knipp...@hp.com writes:

 It causes the physical block size to be PAGE_SIZE but the
 logical block size is still 512.  However, the minimum_io_size
 is now 4096 (same as physical block size, I assume).  The
 optimal_io_size is still 0.  What does that mean?
 
 physical block size - device's internal block size
 logical block size - addressable unit

 Right, but it's still reported as 512 and that doesn't work.

Understood.  :)

 optimal io size - device's preferred unit for streaming

 So 0 is ok.

Correct.

 We can change the block device to export logical/physical block sizes of
 PAGE_SIZE.  However, when persistent memory support comes to platforms
 that support page sizes  32k, xfs will again run into problems (Dave
 Chinner mentioned that xfs can't deal with logical block sizes 32k.)
 Arguably, you can use pmem and dax on such platforms using RAM today for
 testing.  Do we care about breaking that?

 I would think so.  AARCH64 uses 64k pages today.

So does powerpc, but I guess nobody cares about that anymore.  ;-) If
the logical block size is smaller than the page size, we're going to
have to deal with sub-page I/O.  For now, we can do as Boaz suggested,
and just turn off dax for those configurations.  We could also just
revert the patch that introduced this problem.  I really don't know who
is going to care about O_DIRECT I/O performance to a persistent memory
block device.

Willy?  What was the real motivation there?

 I think Documentation/filesystems/dax.txt could use a little update
 too.  It has a section Implementation Tips for Block Driver Writers
 that makes it sound easy but now I wonder if it even works with the
 example ram drivers.  Should we be able to read any 512 byte
 sector?

If the logical block size is 512 bytes, then you have to be able to do
(direct) I/O to any 512 byte sector.  Simple as that.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-12 Thread Boaz Harrosh

On 08/13/2015 12:11 AM, Jeff Moyer wrote:
> Boaz Harrosh  writes:
> 
>> On 08/07/2015 11:41 PM, Jeff Moyer wrote:
>> <>
>>>
 We need to cope with the case where the end of a partition isn't on a
 page boundary though.
>>>
>>> Well, that's usually done by falling back to buffered I/O.  I gave that
>>> a try and panicked the box.  :)  I'll keep looking into it, but probably
>>> won't have another patch until next week.
>>>
>>
>> lets slow down for a sec, please.
>>
>> We have all established that an unaligned partition start is BAD and not 
>> supported?
> 
> No.  Unaligned partitions on RAID arrays or 512e devices are bad because
> they result in suboptimal performance.  They are most certainly still
> supported, though.
> 

What ?

I meant for dax on pmem or brd. I meant that we *do not* support dax access
on an unaligned partition start. (None dax is perfectly supported)

We did it this way because of the direct_access API that returns a pfn
with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
not worth it, and that dax should be page aligned for code simplicity

Cheers
Boaz

> -Jeff
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-12 Thread Jeff Moyer

Boaz Harrosh  writes:

> On 08/07/2015 11:41 PM, Jeff Moyer wrote:
> <>
>> 
>>> We need to cope with the case where the end of a partition isn't on a
>>> page boundary though.
>> 
>> Well, that's usually done by falling back to buffered I/O.  I gave that
>> a try and panicked the box.  :)  I'll keep looking into it, but probably
>> won't have another patch until next week.
>> 
>
> lets slow down for a sec, please.
>
> We have all established that an unaligned partition start is BAD and not 
> supported?

No.  Unaligned partitions on RAID arrays or 512e devices are bad because
they result in suboptimal performance.  They are most certainly still
supported, though.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-12 Thread Jeff Moyer

Boaz Harrosh b...@plexistor.com writes:

 On 08/07/2015 11:41 PM, Jeff Moyer wrote:
 
 
 We need to cope with the case where the end of a partition isn't on a
 page boundary though.
 
 Well, that's usually done by falling back to buffered I/O.  I gave that
 a try and panicked the box.  :)  I'll keep looking into it, but probably
 won't have another patch until next week.
 

 lets slow down for a sec, please.

 We have all established that an unaligned partition start is BAD and not 
 supported?

No.  Unaligned partitions on RAID arrays or 512e devices are bad because
they result in suboptimal performance.  They are most certainly still
supported, though.

-Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-12 Thread Boaz Harrosh

On 08/13/2015 12:11 AM, Jeff Moyer wrote:
 Boaz Harrosh b...@plexistor.com writes:
 
 On 08/07/2015 11:41 PM, Jeff Moyer wrote:
 

 We need to cope with the case where the end of a partition isn't on a
 page boundary though.

 Well, that's usually done by falling back to buffered I/O.  I gave that
 a try and panicked the box.  :)  I'll keep looking into it, but probably
 won't have another patch until next week.


 lets slow down for a sec, please.

 We have all established that an unaligned partition start is BAD and not 
 supported?
 
 No.  Unaligned partitions on RAID arrays or 512e devices are bad because
 they result in suboptimal performance.  They are most certainly still
 supported, though.
 

What ?

I meant for dax on pmem or brd. I meant that we *do not* support dax access
on an unaligned partition start. (None dax is perfectly supported)

We did it this way because of the direct_access API that returns a pfn
with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
not worth it, and that dax should be page aligned for code simplicity

Cheers
Boaz

 -Jeff
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-10 Thread Linda Knippers

On 8/10/2015 5:27 PM, Dave Chinner wrote:
> On Mon, Aug 10, 2015 at 12:32:08PM -0400, Linda Knippers wrote:
>> On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
>>> On 08/06/2015 11:34 PM, Dave Chinner wrote:
 On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
> On 08/06/2015 06:24 AM, Dave Chinner wrote:
>> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
>>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> <>
>
> I sat down with Linda to look into it, and the problem is that 
> mkfs.xfs
> sets the blocksize of the device to 512 (via BLKBSZSET), and then 
> reads
> from the last sector of the device.  This results in dax_io trying to 
> do
> a page-sized I/O at 512 bytes from the end of the device.

>
> This part I do not understand. how is mkfs.xfs reading the sector?
> Is it through open(/dev/pmem0,...) ? O_DIRECT?

 mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
 told that it is working on an image file does it fall back to
 buffered IO. All of the XFS userspace tools work this way to prevent
 page cache pollution issues with read-once or write-once data during
 operation.
> 
>> That patch does cause 'mkfs -t xfs' to work.
>>
>> Before:
>> $ sudo mkfs -t xfs -f /dev/pmem3
>> meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
>>  =   sectsz=512   attr=2, projid32bit=1
>^^
> 
> 
>> $ sudo mkfs -t xfs -f /dev/pmem3
>> meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
>>  =   sectsz=4096  attr=2, projid32bit=1
>^^^
> 
> So in the after case, mkfs.xfs is behaving differently and not
> exercising the bug. It's seen the:
> 
>> $ cat /sys/block/pmem3/queue/logical_block_size
>> 512
>> $ cat /sys/block/pmem3/queue/physical_block_size
>> 4096
>   
> 
> 4k physical block size, and hence configured the filesystem with a
> 4k sector size so all IO it issues is physicallly aligned. IOWs,
> mkfs.xfs's last sector read is 4k aligned and sized, and therefore
> the test has not confirmed that the patch fixes the 512 byte last
> sector read is fixed at all.

That is true.  All I reported is that mkfs.xfs now works.  I
had some questions about whether that was right.

The underlying problem is still there, as I can demonstrate with
a simple reproducer that just does what mkfs.xfs would have done
before.

> Isn't there a regression test suite that covers basic block device
> functionality that you can use to test these simple corner cases?

If there is, seems like DAX adds a few twists.

-- ljk
> 
> Cheers,
> 
> Dave.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-10 Thread Dave Chinner

On Mon, Aug 10, 2015 at 12:32:08PM -0400, Linda Knippers wrote:
> On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
> > On 08/06/2015 11:34 PM, Dave Chinner wrote:
> >> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
> >>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
>  On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
> > On 08/05/2015 06:01 PM, Dave Chinner wrote:
> >> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> >>> <>
> >>>
> >>> I sat down with Linda to look into it, and the problem is that 
> >>> mkfs.xfs
> >>> sets the blocksize of the device to 512 (via BLKBSZSET), and then 
> >>> reads
> >>> from the last sector of the device.  This results in dax_io trying to 
> >>> do
> >>> a page-sized I/O at 512 bytes from the end of the device.
> >>
> >>>
> >>> This part I do not understand. how is mkfs.xfs reading the sector?
> >>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
> >>
> >> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
> >> told that it is working on an image file does it fall back to
> >> buffered IO. All of the XFS userspace tools work this way to prevent
> >> page cache pollution issues with read-once or write-once data during
> >> operation.

> That patch does cause 'mkfs -t xfs' to work.
> 
> Before:
> $ sudo mkfs -t xfs -f /dev/pmem3
> meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
>  =   sectsz=512   attr=2, projid32bit=1
   ^^


> $ sudo mkfs -t xfs -f /dev/pmem3
> meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
>  =   sectsz=4096  attr=2, projid32bit=1
   ^^^

So in the after case, mkfs.xfs is behaving differently and not
exercising the bug. It's seen the:

> $ cat /sys/block/pmem3/queue/logical_block_size
> 512
> $ cat /sys/block/pmem3/queue/physical_block_size
> 4096
  

4k physical block size, and hence configured the filesystem with a
4k sector size so all IO it issues is physicallly aligned. IOWs,
mkfs.xfs's last sector read is 4k aligned and sized, and therefore
the test has not confirmed that the patch fixes the 512 byte last
sector read is fixed at all.

Isn't there a regression test suite that covers basic block device
functionality that you can use to test these simple corner cases?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-10 Thread Linda Knippers

On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
> On 08/06/2015 11:34 PM, Dave Chinner wrote:
>> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
>>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
> On 08/05/2015 06:01 PM, Dave Chinner wrote:
>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>>> <>
>>>
>>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>>> from the last sector of the device.  This results in dax_io trying to do
>>> a page-sized I/O at 512 bytes from the end of the device.
>>
>>>
>>> This part I do not understand. how is mkfs.xfs reading the sector?
>>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
>>
>> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
>> told that it is working on an image file does it fall back to
>> buffered IO. All of the XFS userspace tools work this way to prevent
>> page cache pollution issues with read-once or write-once data during
>> operation.
>>
> 
> Thanks, yes makes sense. This is a bug at the DAX implementation of
> bdev. Since as you know with DAX there is no difference between
> O_DIRECT and buffered, we must support any aligned IO. I bet it
> should be something with bdev not giving 4K buffer-heads to dax.c.
> 
> Or ... It might just be the infamous bug where the actual partition
> they used was not 4k aligned on its start sector. So the last sector IO
> after partition translation came out wrong. This bug then should be
> fixed by: https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
> by:Vishal Verma
> 
> Vishal I think we should add CC: sta...@vger.kernel.org to your patch
> because of these fdisk bugs.

That patch does cause 'mkfs -t xfs' to work.

Before:
$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

After:

$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
 =   sectsz=4096  attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

$ cat /sys/block/pmem3/queue/logical_block_size
512
$ cat /sys/block/pmem3/queue/physical_block_size
4096
$ cat /sys/block/pmem3/queue/hw_sector_size
512
$ cat /sys/block/pmem3/queue/minimum_io_size
4096

Previously physical_block_size was 512 and minimum_io_size was 0.
What about logical_block_size and hw_sector_size still being 512?

So do we want to change pmem rather than changing DAX?

-- ljk
> 
>> Cheers,
>> Dave.
> 
> Thanks
> Boaz
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-10 Thread Boaz Harrosh

On 08/07/2015 11:41 PM, Jeff Moyer wrote:
<>
> 
>> We need to cope with the case where the end of a partition isn't on a
>> page boundary though.
> 
> Well, that's usually done by falling back to buffered I/O.  I gave that
> a try and panicked the box.  :)  I'll keep looking into it, but probably
> won't have another patch until next week.
> 

lets slow down for a sec, please.

We have all established that an unaligned partition start is BAD and not 
supported?
(If we want to also support this, which is possible then all the below is mute)

Well we do know that any real pmem device will actually be 128M aligned because 
of
how the DIMM thing work. So any start-aligned partition means length aligned as 
well.
(Emulated pmem is 4k aligned as well)

That said, you might want to protect against unaligned start / length. Even 
though
we have the 4k physical sector patch, a forced fdisk could produce such a 
partition.

I would suggest that for such un-aligned partitions the code in bdev that sets
IS_DAX on the bdev-inode should only do so if the start / length is aligned.
Else it becomes a !IS_DAX inode and all is fine.
Users will learn soon enough that for dax you need to keep the recommended 4k
alignments.

In the DAX filesystem case the start-un-aligned case is fatal and the FS would
not mount (dax enabled) . The length-un-aligned case is not a problem because
the 4k block size mandatory for dax will trim the device length to an FS block
boundary.

So it is only a problem with RAW bdev inodes and I would just not let it be
IS_DAX in the un-aligned case. (or trim its size)

Thanks
Boaz

> Cheers,
> Jeff

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-10 Thread Linda Knippers

On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
 On 08/06/2015 11:34 PM, Dave Chinner wrote:
 On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
 On 08/06/2015 06:24 AM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
 On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 

 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.


 This part I do not understand. how is mkfs.xfs reading the sector?
 Is it through open(/dev/pmem0,...) ? O_DIRECT?

 mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
 told that it is working on an image file does it fall back to
 buffered IO. All of the XFS userspace tools work this way to prevent
 page cache pollution issues with read-once or write-once data during
 operation.

 
 Thanks, yes makes sense. This is a bug at the DAX implementation of
 bdev. Since as you know with DAX there is no difference between
 O_DIRECT and buffered, we must support any aligned IO. I bet it
 should be something with bdev not giving 4K buffer-heads to dax.c.
 
 Or ... It might just be the infamous bug where the actual partition
 they used was not 4k aligned on its start sector. So the last sector IO
 after partition translation came out wrong. This bug then should be
 fixed by: https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
 by:Vishal Verma
 
 Vishal I think we should add CC: sta...@vger.kernel.org to your patch
 because of these fdisk bugs.

That patch does cause 'mkfs -t xfs' to work.

Before:
$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

After:

$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
 =   sectsz=4096  attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

$ cat /sys/block/pmem3/queue/logical_block_size
512
$ cat /sys/block/pmem3/queue/physical_block_size
4096
$ cat /sys/block/pmem3/queue/hw_sector_size
512
$ cat /sys/block/pmem3/queue/minimum_io_size
4096

Previously physical_block_size was 512 and minimum_io_size was 0.
What about logical_block_size and hw_sector_size still being 512?

So do we want to change pmem rather than changing DAX?

-- ljk
 
 Cheers,
 Dave.
 
 Thanks
 Boaz
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-10 Thread Dave Chinner

On Mon, Aug 10, 2015 at 12:32:08PM -0400, Linda Knippers wrote:
 On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
  On 08/06/2015 11:34 PM, Dave Chinner wrote:
  On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
  On 08/06/2015 06:24 AM, Dave Chinner wrote:
  On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
  On 08/05/2015 06:01 PM, Dave Chinner wrote:
  On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
  
 
  I sat down with Linda to look into it, and the problem is that 
  mkfs.xfs
  sets the blocksize of the device to 512 (via BLKBSZSET), and then 
  reads
  from the last sector of the device.  This results in dax_io trying to 
  do
  a page-sized I/O at 512 bytes from the end of the device.
 
 
  This part I do not understand. how is mkfs.xfs reading the sector?
  Is it through open(/dev/pmem0,...) ? O_DIRECT?
 
  mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
  told that it is working on an image file does it fall back to
  buffered IO. All of the XFS userspace tools work this way to prevent
  page cache pollution issues with read-once or write-once data during
  operation.

 That patch does cause 'mkfs -t xfs' to work.
 
 Before:
 $ sudo mkfs -t xfs -f /dev/pmem3
 meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
  =   sectsz=512   attr=2, projid32bit=1
   ^^


 $ sudo mkfs -t xfs -f /dev/pmem3
 meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
  =   sectsz=4096  attr=2, projid32bit=1
   ^^^

So in the after case, mkfs.xfs is behaving differently and not
exercising the bug. It's seen the:

 $ cat /sys/block/pmem3/queue/logical_block_size
 512
 $ cat /sys/block/pmem3/queue/physical_block_size
 4096
  

4k physical block size, and hence configured the filesystem with a
4k sector size so all IO it issues is physicallly aligned. IOWs,
mkfs.xfs's last sector read is 4k aligned and sized, and therefore
the test has not confirmed that the patch fixes the 512 byte last
sector read is fixed at all.

Isn't there a regression test suite that covers basic block device
functionality that you can use to test these simple corner cases?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-10 Thread Linda Knippers

On 8/10/2015 5:27 PM, Dave Chinner wrote:
 On Mon, Aug 10, 2015 at 12:32:08PM -0400, Linda Knippers wrote:
 On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
 On 08/06/2015 11:34 PM, Dave Chinner wrote:
 On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
 On 08/06/2015 06:24 AM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
 On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 

 I sat down with Linda to look into it, and the problem is that 
 mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then 
 reads
 from the last sector of the device.  This results in dax_io trying to 
 do
 a page-sized I/O at 512 bytes from the end of the device.


 This part I do not understand. how is mkfs.xfs reading the sector?
 Is it through open(/dev/pmem0,...) ? O_DIRECT?

 mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
 told that it is working on an image file does it fall back to
 buffered IO. All of the XFS userspace tools work this way to prevent
 page cache pollution issues with read-once or write-once data during
 operation.
 
 That patch does cause 'mkfs -t xfs' to work.

 Before:
 $ sudo mkfs -t xfs -f /dev/pmem3
 meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
  =   sectsz=512   attr=2, projid32bit=1
^^
 
 
 $ sudo mkfs -t xfs -f /dev/pmem3
 meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
  =   sectsz=4096  attr=2, projid32bit=1
^^^
 
 So in the after case, mkfs.xfs is behaving differently and not
 exercising the bug. It's seen the:
 
 $ cat /sys/block/pmem3/queue/logical_block_size
 512
 $ cat /sys/block/pmem3/queue/physical_block_size
 4096
   
 
 4k physical block size, and hence configured the filesystem with a
 4k sector size so all IO it issues is physicallly aligned. IOWs,
 mkfs.xfs's last sector read is 4k aligned and sized, and therefore
 the test has not confirmed that the patch fixes the 512 byte last
 sector read is fixed at all.

That is true.  All I reported is that mkfs.xfs now works.  I
had some questions about whether that was right.

The underlying problem is still there, as I can demonstrate with
a simple reproducer that just does what mkfs.xfs would have done
before.

 Isn't there a regression test suite that covers basic block device
 functionality that you can use to test these simple corner cases?

If there is, seems like DAX adds a few twists.

-- ljk
 
 Cheers,
 
 Dave.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-10 Thread Boaz Harrosh

On 08/07/2015 11:41 PM, Jeff Moyer wrote:

 
 We need to cope with the case where the end of a partition isn't on a
 page boundary though.
 
 Well, that's usually done by falling back to buffered I/O.  I gave that
 a try and panicked the box.  :)  I'll keep looking into it, but probably
 won't have another patch until next week.
 

lets slow down for a sec, please.

We have all established that an unaligned partition start is BAD and not 
supported?
(If we want to also support this, which is possible then all the below is mute)

Well we do know that any real pmem device will actually be 128M aligned because 
of
how the DIMM thing work. So any start-aligned partition means length aligned as 
well.
(Emulated pmem is 4k aligned as well)

That said, you might want to protect against unaligned start / length. Even 
though
we have the 4k physical sector patch, a forced fdisk could produce such a 
partition.

I would suggest that for such un-aligned partitions the code in bdev that sets
IS_DAX on the bdev-inode should only do so if the start / length is aligned.
Else it becomes a !IS_DAX inode and all is fine.
Users will learn soon enough that for dax you need to keep the recommended 4k
alignments.

In the DAX filesystem case the start-un-aligned case is fatal and the FS would
not mount (dax enabled) . The length-un-aligned case is not a problem because
the 4k block size mandatory for dax will trim the device length to an FS block
boundary.

So it is only a problem with RAW bdev inodes and I would just not let it be
IS_DAX in the un-aligned case. (or trim its size)

Thanks
Boaz

 Cheers,
 Jeff


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-09 Thread Boaz Harrosh

On 08/06/2015 11:34 PM, Dave Chinner wrote:
> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
>>> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
 On 08/05/2015 06:01 PM, Dave Chinner wrote:
> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>> <>
>>
>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>> from the last sector of the device.  This results in dax_io trying to do
>> a page-sized I/O at 512 bytes from the end of the device.
>
>>
>> This part I do not understand. how is mkfs.xfs reading the sector?
>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
> 
> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
> told that it is working on an image file does it fall back to
> buffered IO. All of the XFS userspace tools work this way to prevent
> page cache pollution issues with read-once or write-once data during
> operation.
> 

Thanks, yes makes sense. This is a bug at the DAX implementation of
bdev. Since as you know with DAX there is no difference between
O_DIRECT and buffered, we must support any aligned IO. I bet it
should be something with bdev not giving 4K buffer-heads to dax.c.

Or ... It might just be the infamous bug where the actual partition
they used was not 4k aligned on its start sector. So the last sector IO
after partition translation came out wrong. This bug then should be
fixed by: https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
by:Vishal Verma

Vishal I think we should add CC: sta...@vger.kernel.org to your patch
because of these fdisk bugs.

> Cheers,
> Dave.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-09 Thread Boaz Harrosh

On 08/06/2015 11:34 PM, Dave Chinner wrote:
 On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
 On 08/06/2015 06:24 AM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
 On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 

 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.


 This part I do not understand. how is mkfs.xfs reading the sector?
 Is it through open(/dev/pmem0,...) ? O_DIRECT?
 
 mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
 told that it is working on an image file does it fall back to
 buffered IO. All of the XFS userspace tools work this way to prevent
 page cache pollution issues with read-once or write-once data during
 operation.
 

Thanks, yes makes sense. This is a bug at the DAX implementation of
bdev. Since as you know with DAX there is no difference between
O_DIRECT and buffered, we must support any aligned IO. I bet it
should be something with bdev not giving 4K buffer-heads to dax.c.

Or ... It might just be the infamous bug where the actual partition
they used was not 4k aligned on its start sector. So the last sector IO
after partition translation came out wrong. This bug then should be
fixed by: https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
by:Vishal Verma

Vishal I think we should add CC: sta...@vger.kernel.org to your patch
because of these fdisk bugs.

 Cheers,
 Dave.

Thanks
Boaz

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-07 Thread Jeff Moyer

"Wilcox, Matthew R"  writes:

> So ... what you're doing here is if somebody requests the last 512
> bytes, you're asking for the last page.  That's going to work as long
> as the partition is a multiple of PAGE_SIZE, but not if the partition
> happens to be misaligned.  It's an improvement, although I'd be
> tempted to do this as:
>
>   if (pos == max) {
>   unsigned blkbits = inode->i_blkbits;
> - sector_t block = pos >> blkbits;
> + long page = pos >> PAGE_SHIFT;
> + sector_t block = page << (PAGE_SHIFT - blkbits);
>   unsigned first = pos - (block << blkbits);
>   long size;

Yeah, that's easier to read.

> We need to cope with the case where the end of a partition isn't on a
> page boundary though.

Well, that's usually done by falling back to buffered I/O.  I gave that
a try and panicked the box.  :)  I'll keep looking into it, but probably
won't have another patch until next week.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-07 Thread Wilcox, Matthew R

So ... what you're doing here is if somebody requests the last 512 bytes, 
you're asking for the last page.  That's going to work as long as the partition 
is a multiple of PAGE_SIZE, but not if the partition happens to be misaligned.  
It's an improvement, although I'd be tempted to do this as:

if (pos == max) {
unsigned blkbits = inode->i_blkbits;
-   sector_t block = pos >> blkbits;
+   long page = pos >> PAGE_SHIFT;
+   sector_t block = page << (PAGE_SHIFT - blkbits);
unsigned first = pos - (block << blkbits);
long size;

We need to cope with the case where the end of a partition isn't on a page 
boundary though.

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Thursday, August 06, 2015 2:31 PM
To: Wilcox, Matthew R
Cc: linda.knipp...@hp.com; linux-kernel@vger.kernel.org; 
linux-fsde...@vger.kernel.org
Subject: Re: regression introduced by "block: Add support for DAX reads/writes 
to block devices"

"Wilcox, Matthew R"  writes:

> I think I see the problem.  I'm kind of wrapped up in other things
> right now; can you try replacing the line in dax_io():
>
> - bh->b_size = PAGE_ALIGN(end - pos);
> + bh->b_size = ALIGN(end - pos, 1 << blkbits);

This works for me.  If it looks okay to others, I'll submit a properly
signed-off patch.

Cheers,
Jeff

diff --git a/fs/dax.c b/fs/dax.c
index a7f77e1..b6c4f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -98,6 +98,10 @@ static bool buffer_size_valid(struct buffer_head *bh)
return bh->b_state != 0;
 }
 
+/*
+ * This function assumes file system block size (represented by
+ * inode->i_blkbits) is less than or equal to the system page size.
+ */
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
  loff_t start, loff_t end, get_block_t get_block,
  struct buffer_head *bh)
@@ -117,9 +121,15 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
if (pos == max) {
unsigned blkbits = inode->i_blkbits;
sector_t block = pos >> blkbits;
-   unsigned first = pos - (block << blkbits);
+   long page = pos >> PAGE_SHIFT;
+   unsigned first; /* byte offset into block */
long size;
 
+   /* we can only map entire pages */
+   if (pos & (PAGE_SIZE-1))
+   block = page << (PAGE_SHIFT - blkbits);
+   first = pos - (block << blkbits);
+
if (pos == bh_max) {
bh->b_size = PAGE_ALIGN(end - pos);
bh->b_state = 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-07 Thread Jeff Moyer

Wilcox, Matthew R matthew.r.wil...@intel.com writes:

 So ... what you're doing here is if somebody requests the last 512
 bytes, you're asking for the last page.  That's going to work as long
 as the partition is a multiple of PAGE_SIZE, but not if the partition
 happens to be misaligned.  It's an improvement, although I'd be
 tempted to do this as:

   if (pos == max) {
   unsigned blkbits = inode-i_blkbits;
 - sector_t block = pos  blkbits;
 + long page = pos  PAGE_SHIFT;
 + sector_t block = page  (PAGE_SHIFT - blkbits);
   unsigned first = pos - (block  blkbits);
   long size;

Yeah, that's easier to read.

 We need to cope with the case where the end of a partition isn't on a
 page boundary though.

Well, that's usually done by falling back to buffered I/O.  I gave that
a try and panicked the box.  :)  I'll keep looking into it, but probably
won't have another patch until next week.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-07 Thread Wilcox, Matthew R

So ... what you're doing here is if somebody requests the last 512 bytes, 
you're asking for the last page.  That's going to work as long as the partition 
is a multiple of PAGE_SIZE, but not if the partition happens to be misaligned.  
It's an improvement, although I'd be tempted to do this as:

if (pos == max) {
unsigned blkbits = inode-i_blkbits;
-   sector_t block = pos  blkbits;
+   long page = pos  PAGE_SHIFT;
+   sector_t block = page  (PAGE_SHIFT - blkbits);
unsigned first = pos - (block  blkbits);
long size;

We need to cope with the case where the end of a partition isn't on a page 
boundary though.

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Thursday, August 06, 2015 2:31 PM
To: Wilcox, Matthew R
Cc: linda.knipp...@hp.com; linux-kernel@vger.kernel.org; 
linux-fsde...@vger.kernel.org
Subject: Re: regression introduced by block: Add support for DAX reads/writes 
to block devices

Wilcox, Matthew R matthew.r.wil...@intel.com writes:

 I think I see the problem.  I'm kind of wrapped up in other things
 right now; can you try replacing the line in dax_io():

 - bh-b_size = PAGE_ALIGN(end - pos);
 + bh-b_size = ALIGN(end - pos, 1  blkbits);

This works for me.  If it looks okay to others, I'll submit a properly
signed-off patch.

Cheers,
Jeff

diff --git a/fs/dax.c b/fs/dax.c
index a7f77e1..b6c4f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -98,6 +98,10 @@ static bool buffer_size_valid(struct buffer_head *bh)
return bh-b_state != 0;
 }
 
+/*
+ * This function assumes file system block size (represented by
+ * inode-i_blkbits) is less than or equal to the system page size.
+ */
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
  loff_t start, loff_t end, get_block_t get_block,
  struct buffer_head *bh)
@@ -117,9 +121,15 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
if (pos == max) {
unsigned blkbits = inode-i_blkbits;
sector_t block = pos  blkbits;
-   unsigned first = pos - (block  blkbits);
+   long page = pos  PAGE_SHIFT;
+   unsigned first; /* byte offset into block */
long size;
 
+   /* we can only map entire pages */
+   if (pos  (PAGE_SIZE-1))
+   block = page  (PAGE_SHIFT - blkbits);
+   first = pos - (block  blkbits);
+
if (pos == bh_max) {
bh-b_size = PAGE_ALIGN(end - pos);
bh-b_state = 0;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-06 Thread Jeff Moyer

"Wilcox, Matthew R"  writes:

> I think I see the problem.  I'm kind of wrapped up in other things
> right now; can you try replacing the line in dax_io():
>
> - bh->b_size = PAGE_ALIGN(end - pos);
> + bh->b_size = ALIGN(end - pos, 1 << blkbits);

This works for me.  If it looks okay to others, I'll submit a properly
signed-off patch.

Cheers,
Jeff

diff --git a/fs/dax.c b/fs/dax.c
index a7f77e1..b6c4f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -98,6 +98,10 @@ static bool buffer_size_valid(struct buffer_head *bh)
return bh->b_state != 0;
 }
 
+/*
+ * This function assumes file system block size (represented by
+ * inode->i_blkbits) is less than or equal to the system page size.
+ */
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
  loff_t start, loff_t end, get_block_t get_block,
  struct buffer_head *bh)
@@ -117,9 +121,15 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
if (pos == max) {
unsigned blkbits = inode->i_blkbits;
sector_t block = pos >> blkbits;
-   unsigned first = pos - (block << blkbits);
+   long page = pos >> PAGE_SHIFT;
+   unsigned first; /* byte offset into block */
long size;
 
+   /* we can only map entire pages */
+   if (pos & (PAGE_SIZE-1))
+   block = page << (PAGE_SHIFT - blkbits);
+   first = pos - (block << blkbits);
+
if (pos == bh_max) {
bh->b_size = PAGE_ALIGN(end - pos);
bh->b_state = 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-06 Thread Dave Chinner

On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
> On 08/06/2015 06:24 AM, Dave Chinner wrote:
> > On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
> >> On 08/05/2015 06:01 PM, Dave Chinner wrote:
> >>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> <>
> 
>  I sat down with Linda to look into it, and the problem is that mkfs.xfs
>  sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>  from the last sector of the device.  This results in dax_io trying to do
>  a page-sized I/O at 512 bytes from the end of the device.
> >>>
> 
> This part I do not understand. how is mkfs.xfs reading the sector?
> Is it through open(/dev/pmem0,...) ? O_DIRECT?

mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
told that it is working on an image file does it fall back to
buffered IO. All of the XFS userspace tools work this way to prevent
page cache pollution issues with read-once or write-once data during
operation.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-06 Thread Wilcox, Matthew R

Yes, that's the result I want.  Fundamentally, I think DAX should be able to 
support devices that are not multiples of PAGE_SIZE in size.

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Thursday, August 06, 2015 8:34 AM
To: Wilcox, Matthew R
Cc: linda.knipp...@hp.com; linux-kernel@vger.kernel.org; 
linux-fsde...@vger.kernel.org
Subject: Re: regression introduced by "block: Add support for DAX reads/writes 
to block devices"

"Wilcox, Matthew R"  writes:

> I think I see the problem.  I'm kind of wrapped up in other things
> right now; can you try replacing the line in dax_io():
>
> - bh->b_size = PAGE_ALIGN(end - pos);
> + bh->b_size = ALIGN(end - pos, 1 << blkbits);

That's not gonna work either.  :)  You'll end up with -EINVAL since
bdev_direct_access wants the sector to be aligned to a page:

   if (sector % (PAGE_SIZE / 512))
return -EINVAL;

I think you really want to call direct_access with the full page, and
then tease out the part you want up in dax_io, right?  I'll take a crack
at it if you're busy.

Cheers,
Jeff

> -Original Message-
> From: Jeff Moyer [mailto:jmo...@redhat.com] 
> Sent: Wednesday, August 05, 2015 1:19 PM
> To: Wilcox, Matthew R; linda.knipp...@hp.com
> Cc: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
> Subject: regression introduced by "block: Add support for DAX reads/writes to 
> block devices"
>
> Hi, Matthew,
>
> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>
> # mkfs -t xfs -f /dev/pmem0
> meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=0finobt=0
> data =   bsize=4096   blocks=2097152, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> log  =internal log   bsize=4096   blocks=2560, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> mkfs.xfs: read failed: Numerical result out of range
>
> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> from the last sector of the device.  This results in dax_io trying to do
> a page-sized I/O at 512 bytes from the end of the device.
> bdev_direct_access, receiving this bogus pos/size combo, returns
> -ERANGE:
>
>   if ((sector + DIV_ROUND_UP(size, 512)) >
>   part_nr_sects_read(bdev->bd_part))
>   return -ERANGE;
>
> Given that file systems supporting dax refuse to mount with a blocksize
> != page size, I'm guessing this is sort of expected behavior.  However,
> we really shouldn't be breaking direct I/O on pmem devices.
>
> So, what do you want to do?  We could make the pmem device's logical
> block size fixed at the sytem page size.  Or, we could modify the dax
> code to work with blocksize < pagesize.  Or, we could continue using the
> direct I/O codepath for direct block device access.  What do you think?
>
> Thaks,
> Jeff and Linda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-06 Thread Jeff Moyer

"Wilcox, Matthew R"  writes:

> I think I see the problem.  I'm kind of wrapped up in other things
> right now; can you try replacing the line in dax_io():
>
> - bh->b_size = PAGE_ALIGN(end - pos);
> + bh->b_size = ALIGN(end - pos, 1 << blkbits);

That's not gonna work either.  :)  You'll end up with -EINVAL since
bdev_direct_access wants the sector to be aligned to a page:

   if (sector % (PAGE_SIZE / 512))
return -EINVAL;

I think you really want to call direct_access with the full page, and
then tease out the part you want up in dax_io, right?  I'll take a crack
at it if you're busy.

Cheers,
Jeff

> -Original Message-
> From: Jeff Moyer [mailto:jmo...@redhat.com] 
> Sent: Wednesday, August 05, 2015 1:19 PM
> To: Wilcox, Matthew R; linda.knipp...@hp.com
> Cc: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
> Subject: regression introduced by "block: Add support for DAX reads/writes to 
> block devices"
>
> Hi, Matthew,
>
> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>
> # mkfs -t xfs -f /dev/pmem0
> meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=0finobt=0
> data =   bsize=4096   blocks=2097152, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> log  =internal log   bsize=4096   blocks=2560, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> mkfs.xfs: read failed: Numerical result out of range
>
> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> from the last sector of the device.  This results in dax_io trying to do
> a page-sized I/O at 512 bytes from the end of the device.
> bdev_direct_access, receiving this bogus pos/size combo, returns
> -ERANGE:
>
>   if ((sector + DIV_ROUND_UP(size, 512)) >
>   part_nr_sects_read(bdev->bd_part))
>   return -ERANGE;
>
> Given that file systems supporting dax refuse to mount with a blocksize
> != page size, I'm guessing this is sort of expected behavior.  However,
> we really shouldn't be breaking direct I/O on pmem devices.
>
> So, what do you want to do?  We could make the pmem device's logical
> block size fixed at the sytem page size.  Or, we could modify the dax
> code to work with blocksize < pagesize.  Or, we could continue using the
> direct I/O codepath for direct block device access.  What do you think?
>
> Thaks,
> Jeff and Linda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-06 Thread Wilcox, Matthew R

I think I see the problem.  I'm kind of wrapped up in other things right now; 
can you try replacing the line in dax_io():

-   bh->b_size = PAGE_ALIGN(end - pos);
+   bh->b_size = ALIGN(end - pos, 1 << blkbits);

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Wednesday, August 05, 2015 1:19 PM
To: Wilcox, Matthew R; linda.knipp...@hp.com
Cc: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
Subject: regression introduced by "block: Add support for DAX reads/writes to 
block devices"

Hi, Matthew,

Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

# mkfs -t xfs -f /dev/pmem0
meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

I sat down with Linda to look into it, and the problem is that mkfs.xfs
sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
from the last sector of the device.  This results in dax_io trying to do
a page-sized I/O at 512 bytes from the end of the device.
bdev_direct_access, receiving this bogus pos/size combo, returns
-ERANGE:

if ((sector + DIV_ROUND_UP(size, 512)) >
part_nr_sects_read(bdev->bd_part))
return -ERANGE;

Given that file systems supporting dax refuse to mount with a blocksize
!= page size, I'm guessing this is sort of expected behavior.  However,
we really shouldn't be breaking direct I/O on pmem devices.

So, what do you want to do?  We could make the pmem device's logical
block size fixed at the sytem page size.  Or, we could modify the dax
code to work with blocksize < pagesize.  Or, we could continue using the
direct I/O codepath for direct block device access.  What do you think?

Thaks,
Jeff and Linda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-06 Thread Boaz Harrosh

On 08/06/2015 06:24 AM, Dave Chinner wrote:
> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
>>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
<>

 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.
>>>

This part I do not understand. how is mkfs.xfs reading the sector?
Is it through open(/dev/pmem0,...) ? O_DIRECT?

If so then yes the inode of /dev/pmem0 is IS_DAX() and will try
to use the dax.c stuff. (I think, which Kernel?)

Which means this is a bug.

>>> Right - we have to be able to do IO to that last sector, so this is
>>> a sanity check to tell if the block dev is large enough. The XFS
>>> kernel code does the same end-of-device sector read when the
>>> filesystem is mounted, too.
>>>
 bdev_direct_access, receiving this bogus pos/size combo, returns
 -ERANGE:

if ((sector + DIV_ROUND_UP(size, 512)) >
part_nr_sects_read(bdev->bd_part))
return -ERANGE;

 Given that file systems supporting dax refuse to mount with a blocksize
 != page size, I'm guessing this is sort of expected behavior.  However,
 we really shouldn't be breaking direct I/O on pmem devices.
>>>

No this is a BUG. read/write buffered/direct to an IS_DAX() inode should
be able to be of any alignment size. Since with DAX buffered/direct is
exact same code path and buffered IO expects any size IO.

This is probably a bug in the DAX handling of the bdev-inode. Let me
test this. I will send a fix ASAP.

<>
>>> the output of:
>>>
>>> /sys/block/pmem0/queue/logical_block_size
>> 512
>>
>>> /sys/block/pmem0/queue/physical_block_size
>> 512
>>

There is a pending fix for this.
Do you need it sent to stable ?

>>> /sys/block/pmem0/queue/hw_sector_size
>> 512
>>
>>> /sys/block/pmem0/queue/minimum_io_size
>> 512
>>
>>> /sys/block/pmem0/queue/optimal_io_size
>> 0

Thanks
Boaz


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-06 Thread Jeff Moyer

Wilcox, Matthew R matthew.r.wil...@intel.com writes:

 I think I see the problem.  I'm kind of wrapped up in other things
 right now; can you try replacing the line in dax_io():

 - bh-b_size = PAGE_ALIGN(end - pos);
 + bh-b_size = ALIGN(end - pos, 1  blkbits);

This works for me.  If it looks okay to others, I'll submit a properly
signed-off patch.

Cheers,
Jeff

diff --git a/fs/dax.c b/fs/dax.c
index a7f77e1..b6c4f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -98,6 +98,10 @@ static bool buffer_size_valid(struct buffer_head *bh)
return bh-b_state != 0;
 }
 
+/*
+ * This function assumes file system block size (represented by
+ * inode-i_blkbits) is less than or equal to the system page size.
+ */
 static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
  loff_t start, loff_t end, get_block_t get_block,
  struct buffer_head *bh)
@@ -117,9 +121,15 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
if (pos == max) {
unsigned blkbits = inode-i_blkbits;
sector_t block = pos  blkbits;
-   unsigned first = pos - (block  blkbits);
+   long page = pos  PAGE_SHIFT;
+   unsigned first; /* byte offset into block */
long size;
 
+   /* we can only map entire pages */
+   if (pos  (PAGE_SIZE-1))
+   block = page  (PAGE_SHIFT - blkbits);
+   first = pos - (block  blkbits);
+
if (pos == bh_max) {
bh-b_size = PAGE_ALIGN(end - pos);
bh-b_state = 0;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-06 Thread Dave Chinner

On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
 On 08/06/2015 06:24 AM, Dave Chinner wrote:
  On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
  On 08/05/2015 06:01 PM, Dave Chinner wrote:
  On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 
 
  I sat down with Linda to look into it, and the problem is that mkfs.xfs
  sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
  from the last sector of the device.  This results in dax_io trying to do
  a page-sized I/O at 512 bytes from the end of the device.
 
 
 This part I do not understand. how is mkfs.xfs reading the sector?
 Is it through open(/dev/pmem0,...) ? O_DIRECT?

mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
told that it is working on an image file does it fall back to
buffered IO. All of the XFS userspace tools work this way to prevent
page cache pollution issues with read-once or write-once data during
operation.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-06 Thread Wilcox, Matthew R

I think I see the problem.  I'm kind of wrapped up in other things right now; 
can you try replacing the line in dax_io():

-   bh-b_size = PAGE_ALIGN(end - pos);
+   bh-b_size = ALIGN(end - pos, 1  blkbits);

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Wednesday, August 05, 2015 1:19 PM
To: Wilcox, Matthew R; linda.knipp...@hp.com
Cc: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
Subject: regression introduced by block: Add support for DAX reads/writes to 
block devices

Hi, Matthew,

Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

# mkfs -t xfs -f /dev/pmem0
meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

I sat down with Linda to look into it, and the problem is that mkfs.xfs
sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
from the last sector of the device.  This results in dax_io trying to do
a page-sized I/O at 512 bytes from the end of the device.
bdev_direct_access, receiving this bogus pos/size combo, returns
-ERANGE:

if ((sector + DIV_ROUND_UP(size, 512)) 
part_nr_sects_read(bdev-bd_part))
return -ERANGE;

Given that file systems supporting dax refuse to mount with a blocksize
!= page size, I'm guessing this is sort of expected behavior.  However,
we really shouldn't be breaking direct I/O on pmem devices.

So, what do you want to do?  We could make the pmem device's logical
block size fixed at the sytem page size.  Or, we could modify the dax
code to work with blocksize  pagesize.  Or, we could continue using the
direct I/O codepath for direct block device access.  What do you think?

Thaks,
Jeff and Linda
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-06 Thread Boaz Harrosh

On 08/06/2015 06:24 AM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
 On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:


 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.


This part I do not understand. how is mkfs.xfs reading the sector?
Is it through open(/dev/pmem0,...) ? O_DIRECT?

If so then yes the inode of /dev/pmem0 is IS_DAX() and will try
to use the dax.c stuff. (I think, which Kernel?)

Which means this is a bug.

 Right - we have to be able to do IO to that last sector, so this is
 a sanity check to tell if the block dev is large enough. The XFS
 kernel code does the same end-of-device sector read when the
 filesystem is mounted, too.

 bdev_direct_access, receiving this bogus pos/size combo, returns
 -ERANGE:

if ((sector + DIV_ROUND_UP(size, 512)) 
part_nr_sects_read(bdev-bd_part))
return -ERANGE;

 Given that file systems supporting dax refuse to mount with a blocksize
 != page size, I'm guessing this is sort of expected behavior.  However,
 we really shouldn't be breaking direct I/O on pmem devices.


No this is a BUG. read/write buffered/direct to an IS_DAX() inode should
be able to be of any alignment size. Since with DAX buffered/direct is
exact same code path and buffered IO expects any size IO.

This is probably a bug in the DAX handling of the bdev-inode. Let me
test this. I will send a fix ASAP.


 the output of:

 /sys/block/pmem0/queue/logical_block_size
 512

 /sys/block/pmem0/queue/physical_block_size
 512


There is a pending fix for this.
Do you need it sent to stable ?

 /sys/block/pmem0/queue/hw_sector_size
 512

 /sys/block/pmem0/queue/minimum_io_size
 512

 /sys/block/pmem0/queue/optimal_io_size
 0

Thanks
Boaz


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-06 Thread Jeff Moyer

Wilcox, Matthew R matthew.r.wil...@intel.com writes:

 I think I see the problem.  I'm kind of wrapped up in other things
 right now; can you try replacing the line in dax_io():

 - bh-b_size = PAGE_ALIGN(end - pos);
 + bh-b_size = ALIGN(end - pos, 1  blkbits);

That's not gonna work either.  :)  You'll end up with -EINVAL since
bdev_direct_access wants the sector to be aligned to a page:

   if (sector % (PAGE_SIZE / 512))
return -EINVAL;

I think you really want to call direct_access with the full page, and
then tease out the part you want up in dax_io, right?  I'll take a crack
at it if you're busy.

Cheers,
Jeff

 -Original Message-
 From: Jeff Moyer [mailto:jmo...@redhat.com] 
 Sent: Wednesday, August 05, 2015 1:19 PM
 To: Wilcox, Matthew R; linda.knipp...@hp.com
 Cc: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
 Subject: regression introduced by block: Add support for DAX reads/writes to 
 block devices

 Hi, Matthew,

 Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

 # mkfs -t xfs -f /dev/pmem0
 meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
  =   sectsz=512   attr=2, projid32bit=1
  =   crc=0finobt=0
 data =   bsize=4096   blocks=2097152, imaxpct=25
  =   sunit=0  swidth=0 blks
 naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
 log  =internal log   bsize=4096   blocks=2560, version=2
  =   sectsz=512   sunit=0 blks, lazy-count=1
 realtime =none   extsz=4096   blocks=0, rtextents=0
 mkfs.xfs: read failed: Numerical result out of range

 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.
 bdev_direct_access, receiving this bogus pos/size combo, returns
 -ERANGE:

   if ((sector + DIV_ROUND_UP(size, 512)) 
   part_nr_sects_read(bdev-bd_part))
   return -ERANGE;

 Given that file systems supporting dax refuse to mount with a blocksize
 != page size, I'm guessing this is sort of expected behavior.  However,
 we really shouldn't be breaking direct I/O on pmem devices.

 So, what do you want to do?  We could make the pmem device's logical
 block size fixed at the sytem page size.  Or, we could modify the dax
 code to work with blocksize  pagesize.  Or, we could continue using the
 direct I/O codepath for direct block device access.  What do you think?

 Thaks,
 Jeff and Linda
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-06 Thread Wilcox, Matthew R

Yes, that's the result I want.  Fundamentally, I think DAX should be able to 
support devices that are not multiples of PAGE_SIZE in size.

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Thursday, August 06, 2015 8:34 AM
To: Wilcox, Matthew R
Cc: linda.knipp...@hp.com; linux-kernel@vger.kernel.org; 
linux-fsde...@vger.kernel.org
Subject: Re: regression introduced by block: Add support for DAX reads/writes 
to block devices

Wilcox, Matthew R matthew.r.wil...@intel.com writes:

 I think I see the problem.  I'm kind of wrapped up in other things
 right now; can you try replacing the line in dax_io():

 - bh-b_size = PAGE_ALIGN(end - pos);
 + bh-b_size = ALIGN(end - pos, 1  blkbits);

That's not gonna work either.  :)  You'll end up with -EINVAL since
bdev_direct_access wants the sector to be aligned to a page:

   if (sector % (PAGE_SIZE / 512))
return -EINVAL;

I think you really want to call direct_access with the full page, and
then tease out the part you want up in dax_io, right?  I'll take a crack
at it if you're busy.

Cheers,
Jeff

 -Original Message-
 From: Jeff Moyer [mailto:jmo...@redhat.com] 
 Sent: Wednesday, August 05, 2015 1:19 PM
 To: Wilcox, Matthew R; linda.knipp...@hp.com
 Cc: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org
 Subject: regression introduced by block: Add support for DAX reads/writes to 
 block devices

 Hi, Matthew,

 Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

 # mkfs -t xfs -f /dev/pmem0
 meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
  =   sectsz=512   attr=2, projid32bit=1
  =   crc=0finobt=0
 data =   bsize=4096   blocks=2097152, imaxpct=25
  =   sunit=0  swidth=0 blks
 naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
 log  =internal log   bsize=4096   blocks=2560, version=2
  =   sectsz=512   sunit=0 blks, lazy-count=1
 realtime =none   extsz=4096   blocks=0, rtextents=0
 mkfs.xfs: read failed: Numerical result out of range

 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.
 bdev_direct_access, receiving this bogus pos/size combo, returns
 -ERANGE:

   if ((sector + DIV_ROUND_UP(size, 512)) 
   part_nr_sects_read(bdev-bd_part))
   return -ERANGE;

 Given that file systems supporting dax refuse to mount with a blocksize
 != page size, I'm guessing this is sort of expected behavior.  However,
 we really shouldn't be breaking direct I/O on pmem devices.

 So, what do you want to do?  We could make the pmem device's logical
 block size fixed at the sytem page size.  Or, we could modify the dax
 code to work with blocksize  pagesize.  Or, we could continue using the
 direct I/O codepath for direct block device access.  What do you think?

 Thaks,
 Jeff and Linda
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-05 Thread Dave Chinner

On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
> On 08/05/2015 06:01 PM, Dave Chinner wrote:
> > On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> >> Hi, Matthew,
> >>
> >> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
> >>
> >> # mkfs -t xfs -f /dev/pmem0
> >> meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
> >>  =   sectsz=512   attr=2, projid32bit=1
> >>  =   crc=0finobt=0
> >> data =   bsize=4096   blocks=2097152, imaxpct=25
> >>  =   sunit=0  swidth=0 blks
> >> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> >> log  =internal log   bsize=4096   blocks=2560, version=2
> >>  =   sectsz=512   sunit=0 blks, lazy-count=1
> >> realtime =none   extsz=4096   blocks=0, rtextents=0
> >> mkfs.xfs: read failed: Numerical result out of range
> >>
> >> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> >> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> >> from the last sector of the device.  This results in dax_io trying to do
> >> a page-sized I/O at 512 bytes from the end of the device.
> > 
> > Right - we have to be able to do IO to that last sector, so this is
> > a sanity check to tell if the block dev is large enough. The XFS
> > kernel code does the same end-of-device sector read when the
> > filesystem is mounted, too.
> > 
> >> bdev_direct_access, receiving this bogus pos/size combo, returns
> >> -ERANGE:
> >>
> >>if ((sector + DIV_ROUND_UP(size, 512)) >
> >>part_nr_sects_read(bdev->bd_part))
> >>return -ERANGE;
> >>
> >> Given that file systems supporting dax refuse to mount with a blocksize
> >> != page size, I'm guessing this is sort of expected behavior.  However,
> >> we really shouldn't be breaking direct I/O on pmem devices.
> > 
> > If the device is advertising 512 byte sector size support, then this
> > needs to work, especially as DAX is completely transparent on the
> > block device. Remember that DAX through a filesystem works on
> > filesystem data block size boundaries, so a 512 byte sector/4k block
> > size filesystem will be able to use DAX for mmapped files just fine.
> > 
> >> So, what do you want to do?  We could make the pmem device's logical
> >> block size fixed at the sytem page size.  Or, we could modify the dax
> >> code to work with blocksize < pagesize.  Or, we could continue using the
> >> direct I/O codepath for direct block device access.  What do you think?
> > 
> > I don't know how the pmem device sets up it's limits. Can you post
> > the output of:
> > 
> > /sys/block/pmem0/queue/logical_block_size
> 512
> 
> > /sys/block/pmem0/queue/physical_block_size
> 512
> 
> > /sys/block/pmem0/queue/hw_sector_size
> 512
> 
> > /sys/block/pmem0/queue/minimum_io_size
> 512
> 
> > /sys/block/pmem0/queue/optimal_io_size
> 0

Ok, so the pmem device is advertising 512 bytes for both
physical and logical sector sizes. That means mkfs.xfs is not doing
anything wrong. i.e. ERANGE on w read of the last sector of the
block device is a bug in the block device code.

It is not at all obvious from these sector sizes that the block
device is DAX enabled. I'd suggest that you probably want to make
the physical sector size 4k on x86-64 to indicate to filesystem
utilities that 4k alignment of the filesystem is preferred, even if
512 byte IO can be supported in a less efficient manner (i.e.
equivalent of a 512e hard drive)

You can't really make the logical sector size = PAGE_SIZE, because
on 64k page size machines that will make the sector size larger than
many filesystems support. e.g. XFS only supports sector sizes up to
32k at the moment...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-05 Thread Linda Knippers

On 08/05/2015 06:01 PM, Dave Chinner wrote:
> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>> Hi, Matthew,
>>
>> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>>
>> # mkfs -t xfs -f /dev/pmem0
>> meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
>>  =   sectsz=512   attr=2, projid32bit=1
>>  =   crc=0finobt=0
>> data =   bsize=4096   blocks=2097152, imaxpct=25
>>  =   sunit=0  swidth=0 blks
>> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
>> log  =internal log   bsize=4096   blocks=2560, version=2
>>  =   sectsz=512   sunit=0 blks, lazy-count=1
>> realtime =none   extsz=4096   blocks=0, rtextents=0
>> mkfs.xfs: read failed: Numerical result out of range
>>
>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>> from the last sector of the device.  This results in dax_io trying to do
>> a page-sized I/O at 512 bytes from the end of the device.
> 
> Right - we have to be able to do IO to that last sector, so this is
> a sanity check to tell if the block dev is large enough. The XFS
> kernel code does the same end-of-device sector read when the
> filesystem is mounted, too.
> 
>> bdev_direct_access, receiving this bogus pos/size combo, returns
>> -ERANGE:
>>
>>  if ((sector + DIV_ROUND_UP(size, 512)) >
>>  part_nr_sects_read(bdev->bd_part))
>>  return -ERANGE;
>>
>> Given that file systems supporting dax refuse to mount with a blocksize
>> != page size, I'm guessing this is sort of expected behavior.  However,
>> we really shouldn't be breaking direct I/O on pmem devices.
> 
> If the device is advertising 512 byte sector size support, then this
> needs to work, especially as DAX is completely transparent on the
> block device. Remember that DAX through a filesystem works on
> filesystem data block size boundaries, so a 512 byte sector/4k block
> size filesystem will be able to use DAX for mmapped files just fine.
> 
>> So, what do you want to do?  We could make the pmem device's logical
>> block size fixed at the sytem page size.  Or, we could modify the dax
>> code to work with blocksize < pagesize.  Or, we could continue using the
>> direct I/O codepath for direct block device access.  What do you think?
> 
> I don't know how the pmem device sets up it's limits. Can you post
> the output of:
> 
>   /sys/block/pmem0/queue/logical_block_size
512

>   /sys/block/pmem0/queue/physical_block_size
512

>   /sys/block/pmem0/queue/hw_sector_size
512

>   /sys/block/pmem0/queue/minimum_io_size
512

>   /sys/block/pmem0/queue/optimal_io_size
0

Let me know if you need anything else.

-- ljk


> As these all affect how mkfs.xfs configures the filesystem being
> made and so influences the size and alignment of the IO is does
> 
> Cheers,
> 
> Dave.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-05 Thread Dave Chinner

On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> Hi, Matthew,
> 
> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
> 
> # mkfs -t xfs -f /dev/pmem0
> meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=0finobt=0
> data =   bsize=4096   blocks=2097152, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> log  =internal log   bsize=4096   blocks=2560, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> mkfs.xfs: read failed: Numerical result out of range
> 
> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> from the last sector of the device.  This results in dax_io trying to do
> a page-sized I/O at 512 bytes from the end of the device.

Right - we have to be able to do IO to that last sector, so this is
a sanity check to tell if the block dev is large enough. The XFS
kernel code does the same end-of-device sector read when the
filesystem is mounted, too.

> bdev_direct_access, receiving this bogus pos/size combo, returns
> -ERANGE:
> 
>   if ((sector + DIV_ROUND_UP(size, 512)) >
>   part_nr_sects_read(bdev->bd_part))
>   return -ERANGE;
> 
> Given that file systems supporting dax refuse to mount with a blocksize
> != page size, I'm guessing this is sort of expected behavior.  However,
> we really shouldn't be breaking direct I/O on pmem devices.

If the device is advertising 512 byte sector size support, then this
needs to work, especially as DAX is completely transparent on the
block device. Remember that DAX through a filesystem works on
filesystem data block size boundaries, so a 512 byte sector/4k block
size filesystem will be able to use DAX for mmapped files just fine.

> So, what do you want to do?  We could make the pmem device's logical
> block size fixed at the sytem page size.  Or, we could modify the dax
> code to work with blocksize < pagesize.  Or, we could continue using the
> direct I/O codepath for direct block device access.  What do you think?

I don't know how the pmem device sets up it's limits. Can you post
the output of:

/sys/block/pmem0/queue/logical_block_size
/sys/block/pmem0/queue/physical_block_size
/sys/block/pmem0/queue/hw_sector_size
/sys/block/pmem0/queue/minimum_io_size
/sys/block/pmem0/queue/optimal_io_size

As these all affect how mkfs.xfs configures the filesystem being
made and so influences the size and alignment of the IO is does

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-05 Thread Jeff Moyer

Hi, Matthew,

Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

# mkfs -t xfs -f /dev/pmem0
meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

I sat down with Linda to look into it, and the problem is that mkfs.xfs
sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
from the last sector of the device.  This results in dax_io trying to do
a page-sized I/O at 512 bytes from the end of the device.
bdev_direct_access, receiving this bogus pos/size combo, returns
-ERANGE:

if ((sector + DIV_ROUND_UP(size, 512)) >
part_nr_sects_read(bdev->bd_part))
return -ERANGE;

Given that file systems supporting dax refuse to mount with a blocksize
!= page size, I'm guessing this is sort of expected behavior.  However,
we really shouldn't be breaking direct I/O on pmem devices.

So, what do you want to do?  We could make the pmem device's logical
block size fixed at the sytem page size.  Or, we could modify the dax
code to work with blocksize < pagesize.  Or, we could continue using the
direct I/O codepath for direct block device access.  What do you think?

Thaks,
Jeff and Linda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-05 Thread Jeff Moyer

Hi, Matthew,

Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

# mkfs -t xfs -f /dev/pmem0
meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

I sat down with Linda to look into it, and the problem is that mkfs.xfs
sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
from the last sector of the device.  This results in dax_io trying to do
a page-sized I/O at 512 bytes from the end of the device.
bdev_direct_access, receiving this bogus pos/size combo, returns
-ERANGE:

if ((sector + DIV_ROUND_UP(size, 512)) 
part_nr_sects_read(bdev-bd_part))
return -ERANGE;

Given that file systems supporting dax refuse to mount with a blocksize
!= page size, I'm guessing this is sort of expected behavior.  However,
we really shouldn't be breaking direct I/O on pmem devices.

So, what do you want to do?  We could make the pmem device's logical
block size fixed at the sytem page size.  Or, we could modify the dax
code to work with blocksize  pagesize.  Or, we could continue using the
direct I/O codepath for direct block device access.  What do you think?

Thaks,
Jeff and Linda
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-05 Thread Dave Chinner

On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 Hi, Matthew,
 
 Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
 
 # mkfs -t xfs -f /dev/pmem0
 meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
  =   sectsz=512   attr=2, projid32bit=1
  =   crc=0finobt=0
 data =   bsize=4096   blocks=2097152, imaxpct=25
  =   sunit=0  swidth=0 blks
 naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
 log  =internal log   bsize=4096   blocks=2560, version=2
  =   sectsz=512   sunit=0 blks, lazy-count=1
 realtime =none   extsz=4096   blocks=0, rtextents=0
 mkfs.xfs: read failed: Numerical result out of range
 
 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.

Right - we have to be able to do IO to that last sector, so this is
a sanity check to tell if the block dev is large enough. The XFS
kernel code does the same end-of-device sector read when the
filesystem is mounted, too.

 bdev_direct_access, receiving this bogus pos/size combo, returns
 -ERANGE:
 
   if ((sector + DIV_ROUND_UP(size, 512)) 
   part_nr_sects_read(bdev-bd_part))
   return -ERANGE;
 
 Given that file systems supporting dax refuse to mount with a blocksize
 != page size, I'm guessing this is sort of expected behavior.  However,
 we really shouldn't be breaking direct I/O on pmem devices.

If the device is advertising 512 byte sector size support, then this
needs to work, especially as DAX is completely transparent on the
block device. Remember that DAX through a filesystem works on
filesystem data block size boundaries, so a 512 byte sector/4k block
size filesystem will be able to use DAX for mmapped files just fine.

 So, what do you want to do?  We could make the pmem device's logical
 block size fixed at the sytem page size.  Or, we could modify the dax
 code to work with blocksize  pagesize.  Or, we could continue using the
 direct I/O codepath for direct block device access.  What do you think?

I don't know how the pmem device sets up it's limits. Can you post
the output of:

/sys/block/pmem0/queue/logical_block_size
/sys/block/pmem0/queue/physical_block_size
/sys/block/pmem0/queue/hw_sector_size
/sys/block/pmem0/queue/minimum_io_size
/sys/block/pmem0/queue/optimal_io_size

As these all affect how mkfs.xfs configures the filesystem being
made and so influences the size and alignment of the IO is does

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-05 Thread Dave Chinner

On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
 On 08/05/2015 06:01 PM, Dave Chinner wrote:
  On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
  Hi, Matthew,
 
  Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
 
  # mkfs -t xfs -f /dev/pmem0
  meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
   =   sectsz=512   attr=2, projid32bit=1
   =   crc=0finobt=0
  data =   bsize=4096   blocks=2097152, imaxpct=25
   =   sunit=0  swidth=0 blks
  naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
  log  =internal log   bsize=4096   blocks=2560, version=2
   =   sectsz=512   sunit=0 blks, lazy-count=1
  realtime =none   extsz=4096   blocks=0, rtextents=0
  mkfs.xfs: read failed: Numerical result out of range
 
  I sat down with Linda to look into it, and the problem is that mkfs.xfs
  sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
  from the last sector of the device.  This results in dax_io trying to do
  a page-sized I/O at 512 bytes from the end of the device.
  
  Right - we have to be able to do IO to that last sector, so this is
  a sanity check to tell if the block dev is large enough. The XFS
  kernel code does the same end-of-device sector read when the
  filesystem is mounted, too.
  
  bdev_direct_access, receiving this bogus pos/size combo, returns
  -ERANGE:
 
 if ((sector + DIV_ROUND_UP(size, 512)) 
 part_nr_sects_read(bdev-bd_part))
 return -ERANGE;
 
  Given that file systems supporting dax refuse to mount with a blocksize
  != page size, I'm guessing this is sort of expected behavior.  However,
  we really shouldn't be breaking direct I/O on pmem devices.
  
  If the device is advertising 512 byte sector size support, then this
  needs to work, especially as DAX is completely transparent on the
  block device. Remember that DAX through a filesystem works on
  filesystem data block size boundaries, so a 512 byte sector/4k block
  size filesystem will be able to use DAX for mmapped files just fine.
  
  So, what do you want to do?  We could make the pmem device's logical
  block size fixed at the sytem page size.  Or, we could modify the dax
  code to work with blocksize  pagesize.  Or, we could continue using the
  direct I/O codepath for direct block device access.  What do you think?
  
  I don't know how the pmem device sets up it's limits. Can you post
  the output of:
  
  /sys/block/pmem0/queue/logical_block_size
 512
 
  /sys/block/pmem0/queue/physical_block_size
 512
 
  /sys/block/pmem0/queue/hw_sector_size
 512
 
  /sys/block/pmem0/queue/minimum_io_size
 512
 
  /sys/block/pmem0/queue/optimal_io_size
 0

Ok, so the pmem device is advertising 512 bytes for both
physical and logical sector sizes. That means mkfs.xfs is not doing
anything wrong. i.e. ERANGE on w read of the last sector of the
block device is a bug in the block device code.

It is not at all obvious from these sector sizes that the block
device is DAX enabled. I'd suggest that you probably want to make
the physical sector size 4k on x86-64 to indicate to filesystem
utilities that 4k alignment of the filesystem is preferred, even if
512 byte IO can be supported in a less efficient manner (i.e.
equivalent of a 512e hard drive)

You can't really make the logical sector size = PAGE_SIZE, because
on 64k page size machines that will make the sector size larger than
many filesystems support. e.g. XFS only supports sector sizes up to
32k at the moment...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-05 Thread Linda Knippers

On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 Hi, Matthew,

 Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

 # mkfs -t xfs -f /dev/pmem0
 meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
  =   sectsz=512   attr=2, projid32bit=1
  =   crc=0finobt=0
 data =   bsize=4096   blocks=2097152, imaxpct=25
  =   sunit=0  swidth=0 blks
 naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
 log  =internal log   bsize=4096   blocks=2560, version=2
  =   sectsz=512   sunit=0 blks, lazy-count=1
 realtime =none   extsz=4096   blocks=0, rtextents=0
 mkfs.xfs: read failed: Numerical result out of range

 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.
 
 Right - we have to be able to do IO to that last sector, so this is
 a sanity check to tell if the block dev is large enough. The XFS
 kernel code does the same end-of-device sector read when the
 filesystem is mounted, too.
 
 bdev_direct_access, receiving this bogus pos/size combo, returns
 -ERANGE:

  if ((sector + DIV_ROUND_UP(size, 512)) 
  part_nr_sects_read(bdev-bd_part))
  return -ERANGE;

 Given that file systems supporting dax refuse to mount with a blocksize
 != page size, I'm guessing this is sort of expected behavior.  However,
 we really shouldn't be breaking direct I/O on pmem devices.
 
 If the device is advertising 512 byte sector size support, then this
 needs to work, especially as DAX is completely transparent on the
 block device. Remember that DAX through a filesystem works on
 filesystem data block size boundaries, so a 512 byte sector/4k block
 size filesystem will be able to use DAX for mmapped files just fine.
 
 So, what do you want to do?  We could make the pmem device's logical
 block size fixed at the sytem page size.  Or, we could modify the dax
 code to work with blocksize  pagesize.  Or, we could continue using the
 direct I/O codepath for direct block device access.  What do you think?
 
 I don't know how the pmem device sets up it's limits. Can you post
 the output of:
 
   /sys/block/pmem0/queue/logical_block_size
512

   /sys/block/pmem0/queue/physical_block_size
512

   /sys/block/pmem0/queue/hw_sector_size
512

   /sys/block/pmem0/queue/minimum_io_size
512

   /sys/block/pmem0/queue/optimal_io_size
0

Let me know if you need anything else.

-- ljk


 As these all affect how mkfs.xfs configures the filesystem being
 made and so influences the size and alignment of the IO is does
 
 Cheers,
 
 Dave.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

52 matches

Mail list logo