Re: [developer] DRAID impressions

2022-12-07 Thread Richard Elling
draid uses the raidz allocator. Thus it is subject to the same allocation
efficiencies as raidz and
described here
https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
The short answer is: avoid using raidz with small recordsize or
volblocksize relative to the minimum physically allocatable size of
the disks. Too small is bad and too big is bad: somewhere in the middle is
a reasonable trade-off for performance and space
efficiency.
 -- richard



On Tue, Dec 6, 2022 at 6:59 PM Phil Harman  wrote:

> Hi Richard, a very long time indeed!
>
> I was thinking more of the accumulated allocation (not instantaneous IOPS).
>
> The pool was created with both DRAID vdevs from the get-go. We had 32x
> 12TB and 12x 14TB drives to repurpose, and wanted to build two identical
> pools.
>
> Our other two servers have 16x new 18TB. Interestingly, (and more by luck
> than design) the usable space across all four servers is the same.
>
> In this pool, both vdevs use a 3d+2p RAID scheme. But whereas
> draid2:3d:16c:1s-0 has three 3d+2p, plus one spare (16x
> 12TB), draid2:3d:6c:1s-1 has just one 3d+2p, plus one spare (6x 14TB
> drives).
>
> The disappointing thing is the lop-sided allocation,
> such that draid2:3d:6c:1s-1 has 50.6T allocated and 3.96T free (i.e. 92%
> allocated), whereas raid2:3d:16c:1s-0 only has 128T allocated, and 63.1T
> free (i.e. about 50% allocated).
>
> Phil
>
> On Tue, 6 Dec 2022 at 13:10, Richard Elling <
> richard.ell...@richardelling.com> wrote:
>
>> Hi Phil, long time, no see. comments embedded below
>>
>>
>> On Tue, Dec 6, 2022 at 3:35 AM Phil Harman  wrote:
>>
>>> I have a number of "ZFS backup" servers (about 1PB split between four
>>> machines).
>>>
>>> Some of them have 16x 18TB drives, but a couple have a mix of 12TB and
>>> 14TB drives (because that's what we had).
>>>
>>> All are running Ubuntu 20.04 LTS (with a snapshot build) or 22.04 LTS
>>> (bundled).
>>>
>>> We're just doing our first replace (actually swapping in a 16TB drive
>>> for a 12TB because that's all we have).
>>>
>>> root@hqs1:~# zpool version
>>> zfs-2.1.99-784_gae07fc139
>>> zfs-kmod-2.1.99-784_gae07fc139
>>> root@hqs1:~# zpool status
>>>   pool: hqs1p1
>>>  state: DEGRADED
>>> status: One or more devices is currently being resilvered.  The pool will
>>> continue to function, possibly in a degraded state.
>>> action: Wait for the resilver to complete.
>>>   scan: resilver in progress since Mon Dec  5 15:19:31 2022
>>> 177T scanned at 2.49G/s, 175T issued at 2.46G/s, 177T total
>>> 7.95T resilvered, 98.97% done, 00:12:39 to go
>>> config:
>>>
>>> NAME  STATE READ WRITE CKSUM
>>> hqs1p1DEGRADED 0 0 0
>>>  draid2:3d:16c:1s-0  DEGRADED 0 0 0
>>>780530e1-d2e4-0040-aa8b-8c7bed75a14a  ONLINE   0 0 0
>>>  (resilvering)
>>>9c4428e8-d16f-3849-97d9-22fc441750dc  ONLINE   0 0 0
>>>  (resilvering)
>>>0e148b1d-69a3-3345-9478-343ecf6b855d  ONLINE   0 0 0
>>>  (resilvering)
>>>98208ffe-4b31-564f-832d-5744c809f163  ONLINE   0 0 0
>>>  (resilvering)
>>>3ac46b0a-9c46-e14f-8137-69227f3a890a  ONLINE   0 0 0
>>>  (resilvering)
>>>44e8f62f-5d49-c345-9c89-ac82926d42b7  ONLINE   0 0 0
>>>  (resilvering)
>>>968dbacd-1d85-0b40-a1fc-977a09ac5aaa  ONLINE   0 0 0
>>>  (resilvering)
>>>e7ca2666-1067-f54c-b723-b464fb0a5fa3  ONLINE   0 0 0
>>>  (resilvering)
>>>318ff075-8860-e84e-8063-f5f57a2d  ONLINE   0 0 0
>>>  (resilvering)
>>>replacing-9   DEGRADED 0 0 0
>>>  2888151727045752617 UNAVAIL  0 0 0  was
>>> /dev/disk/by-partuuid/85fa9347-8359-4942-a20d-da1f6016ea48
>>>  sdd ONLINE   0 0 0
>>>  (resilvering)
>>>fd69f284-d05d-f145-9bdb-0da8a72bf311  ONLINE   0 0 0
>>>  (resilvering)
>>>f40f997a-33a1-2a4e-bb8d-64223c441f0f  ONLINE   0 0 0
>>>  (resilvering)
>>>dbc35ea9-95d1-bd40-b79e-90d8a37079a6  ONLINE   0 0 0
>>>  (resilvering)
>>>ac62bf3e-517e-a444-ae4f-a784b81cd14c  ONLINE   0 0 0
>>>  (resilv

Re: [developer] DRAID impressions

2022-12-06 Thread Richard Elling
Hi Phil, long time, no see. comments embedded below


On Tue, Dec 6, 2022 at 3:35 AM Phil Harman  wrote:

> I have a number of "ZFS backup" servers (about 1PB split between four
> machines).
>
> Some of them have 16x 18TB drives, but a couple have a mix of 12TB and
> 14TB drives (because that's what we had).
>
> All are running Ubuntu 20.04 LTS (with a snapshot build) or 22.04 LTS
> (bundled).
>
> We're just doing our first replace (actually swapping in a 16TB drive for
> a 12TB because that's all we have).
>
> root@hqs1:~# zpool version
> zfs-2.1.99-784_gae07fc139
> zfs-kmod-2.1.99-784_gae07fc139
> root@hqs1:~# zpool status
>   pool: hqs1p1
>  state: DEGRADED
> status: One or more devices is currently being resilvered.  The pool will
> continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>   scan: resilver in progress since Mon Dec  5 15:19:31 2022
> 177T scanned at 2.49G/s, 175T issued at 2.46G/s, 177T total
> 7.95T resilvered, 98.97% done, 00:12:39 to go
> config:
>
> NAME  STATE READ WRITE CKSUM
> hqs1p1DEGRADED 0 0 0
>  draid2:3d:16c:1s-0  DEGRADED 0 0 0
>780530e1-d2e4-0040-aa8b-8c7bed75a14a  ONLINE   0 0 0
>  (resilvering)
>9c4428e8-d16f-3849-97d9-22fc441750dc  ONLINE   0 0 0
>  (resilvering)
>0e148b1d-69a3-3345-9478-343ecf6b855d  ONLINE   0 0 0
>  (resilvering)
>98208ffe-4b31-564f-832d-5744c809f163  ONLINE   0 0 0
>  (resilvering)
>3ac46b0a-9c46-e14f-8137-69227f3a890a  ONLINE   0 0 0
>  (resilvering)
>44e8f62f-5d49-c345-9c89-ac82926d42b7  ONLINE   0 0 0
>  (resilvering)
>968dbacd-1d85-0b40-a1fc-977a09ac5aaa  ONLINE   0 0 0
>  (resilvering)
>e7ca2666-1067-f54c-b723-b464fb0a5fa3  ONLINE   0 0 0
>  (resilvering)
>318ff075-8860-e84e-8063-f5f57a2d  ONLINE   0 0 0
>  (resilvering)
>replacing-9   DEGRADED 0 0 0
>  2888151727045752617 UNAVAIL  0 0 0  was
> /dev/disk/by-partuuid/85fa9347-8359-4942-a20d-da1f6016ea48
>  sdd ONLINE   0 0 0
>  (resilvering)
>fd69f284-d05d-f145-9bdb-0da8a72bf311  ONLINE   0 0 0
>  (resilvering)
>f40f997a-33a1-2a4e-bb8d-64223c441f0f  ONLINE   0 0 0
>  (resilvering)
>dbc35ea9-95d1-bd40-b79e-90d8a37079a6  ONLINE   0 0 0
>  (resilvering)
>ac62bf3e-517e-a444-ae4f-a784b81cd14c  ONLINE   0 0 0
>  (resilvering)
>d211031c-54d4-2443-853c-7e5c075b28ab  ONLINE   0 0 0
>  (resilvering)
>06ba16e5-05cf-9b45-a267-510bfe98ceb1  ONLINE   0 0 0
>  (resilvering)
>  draid2:3d:6c:1s-1   ONLINE   0 0 0
>be297802-095c-7d43-9132-360627ba8ceb  ONLINE   0 0 0
>e849981c-7316-cb47-b926-61d444790518  ONLINE   0 0 0
>bbc6d66d-38e1-c448-9d00-10ba7adcd371  ONLINE   0 0 0
>9fb44c95-5ea6-2347-ae97-38de283f45bf  ONLINE   0 0 0
>b212cae5-5068-8740-b120-0618ad459c1f  ONLINE   0 0 0
>8c771f6b-7d48-e744-9e25-847230fd2fdd  ONLINE   0 0 0
> spares
>  draid2-0-0  AVAIL
>  draid2-1-0  AVAIL
>
> errors: No known data errors
> root@hqs1:~#
>
> Here's a snippet of zpool iostat -v 1 ...
>
> capacity operations
> bandwidth
> pool  alloc   free   read  write
> read  write
>   -  -  -  -
>  -  -
> hqs1p1 178T  67.1T  5.68K259
> 585M   144M
>   draid2:3d:16c:1s-0   128T  63.1T  5.66K259
> 585M   144M
> 780530e1-d2e4-0040-aa8b-8c7bed75a14a  -  -208  0
>  25.4M  0
> 9c4428e8-d16f-3849-97d9-22fc441750dc  -  -   1010  2
> 102M  23.7K
> 0e148b1d-69a3-3345-9478-343ecf6b855d  -  -145  0
>  34.1M  15.8K
> 98208ffe-4b31-564f-832d-5744c809f163  -  -101  1
>  28.3M  7.90K
> 3ac46b0a-9c46-e14f-8137-69227f3a890a  -  -511  0
>  53.5M  0
> 44e8f62f-5d49-c345-9c89-ac82926d42b7  -  - 12  0
>  4.82M  0
> 968dbacd-1d85-0b40-a1fc-977a09ac5aaa  -  - 22  0
>  5.43M  15.8K
> e7ca2666-1067-f54c-b723-b464fb0a5fa3  -  -227  2
>  36.7M  23.7K
> 318ff075-8860-e84e-8063-f5f57a2d  -  -999  1
>  83.1M  7.90K
> replacing-9   -  -  0243
>  0   144M
>   2888151727045752617 -  -  0  0
>  0  0
>   sdd -  -  0243
>  0   144M
> 

Re: [developer] What vdev is picked as leading in mirror resilver proces?

2021-12-23 Thread Richard Elling
In looking at your pastebin...
Scenario 1 deliberately causes data loss as a result of the forced import
[*]. No surprise there.
Scenario 2 also works as designed because you're forcing the data loss on
vda
Scenario 2 again causes data loss because of the forced import. Again, no
surprise there.
Scenario 3 I don't follow your process, can you share the full procedure?
Scenario 4 I don't understand why you create a pool named blacksite and
then import -f blacksite.
   The pool should already be imported. While it is possible to
import multiple pools with
   the same name, each pool needs to have a different guid.
Otherwise, there might be a
   bug lurking here.

[*] by forcing an import in read/write mode you've effectively forked the
pool into two pools,
each with their own timeline of changes -- but without changing the pool
guid. It will not be
possible to re-integrate the datasets on these two pools using zpool mirror
attach/detach.

Methinks you might be trying to do the old "split mirror for offsite
disaster recovery" plan
for which you should consider using `zpool split`
 -- richard


On Mon, Dec 20, 2021 at 11:06 PM openzfs via openzfs-developer <
developer@lists.open-zfs.org> wrote:

> Thanks Richard for explaining this. I have done a couple of tests some
> time ago (you can find these in this pastebin: https://8n1.org/20024/2eb4).
> Results from these tests were that even for vdevs that have newer data (and
> other data for that matter) than other vdevs, it does not guarantee that
> vdev will be picked as the source. And in a  2-way mirror there is only one
> vdev with the latest data / checksums. So, still not sure why this happens.
> *openzfs * / openzfs-developer / see
> discussions  + participants
>  + delivery options
>  Permalink
> 
>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T7209b2fe172e98f1-M2439279d1c81e6c12c853d7f
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] What vdev is picked as leading in mirror resilver proces?

2021-12-20 Thread Richard Elling
First, resilvering is done at the dataset and snapshot layer (DSL) and not
the vdev layer.
Each txg commit has a monotonically increasing counter. So the dataset
knows what data
is written when. The resilver begins temporally at the oldest common time
(as determined by the
txg commit in the vdev's label) and progresses to the present. This is
commonly called "time
based resilvering" and is a good optimization for providing increased
reliability.

The source of the data (eg which vdev in a 3-way mirror) is determined by
the normal spreading
logic with the requirement that the blocks being resilvered have correct
checksums. So, unlike some
brain-dead RAID implementations (eg md-raid), there is no ambiguity about
where the correct
data lives.
 -- richard


On Mon, Dec 20, 2021 at 3:23 AM openzfs via openzfs-developer <
developer@lists.open-zfs.org> wrote:

> I have a question regarding ZFS mirror resilvering. Consider a mirror with
> two vdevs: disk A and B. Disk B is hot-removed. Data is written to mirror
> (only written to disk A). The host is powered off. Disk B is cold-attached
> to the host and booted again. Zpool is imported. What vdev of the mirror is
> leading in the resilver process? i.e. what data will prevail? Is this
> deterministic? ZFS version: zfs-0.8.3 (Ubuntu 20.04).
> *openzfs * / openzfs-developer / see
> discussions  + participants
>  + delivery options
>  Permalink
> 
>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T7209b2fe172e98f1-M9287fdd0fc18a6b32792194e
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] ZFS Guide links on Wikipedia are broken

2020-07-21 Thread Richard Elling
We took a look at those docs about a year or two ago. Many are very outdated, 
even as they relate to Solaris.  It isn’t clear to me if they are a good 
starting point for newer docs. Any other opinions?

  -- richard



> On Jul 21, 2020, at 7:58 PM, Oskar Sharipov  wrote:
> 
> Hello ZFS devs,
> 
> I had registered on openzfs.org/wiki but that doesn't give an
> opportunity to edit a page, I viewed possible links to get required
> rights or make editors informed but I couldn't have found any other ways
> to mark links on Wiki as broken. So I'm writing here.
> 
> At https://openzfs.org/wiki/System_Administration there are three
> elements:
> * ZFS Best Practices Guide
> * ZFS Configuration Guide
> * ZFS Evil Tuning Guide
> and all of them are hosted on solarisinternals.com which is down. I
> found copies on Internet Archive and articles appeared useful so here
> are links Wiki editors could use[1] instead:
> 
> * Applicable to most platforms, but maybe outdated:
> ** 
> [https://web.archive.org/web/20140907010548/http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
>  ZFS Best Practices Guide]
> ** 
> [https://web.archive.org/web/20150523190059/http://www.solarisinternals.com/wiki/index.php/ZFS_Configuration_Guide
>  ZFS Configuration Guide]
> ** 
> [https://web.archive.org/web/20140802153903/http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
>  ZFS Evil Tuning Guide]
> 
> [1] https://openzfs.org/w/index.php?title=System_Administration=edit
> 
> --
> Oskar Sharipov
> site (might be unpaid and cancelled): oskarsh.ru
> e-mail (replace asterisk with dot): oskarsh at riseup * net
> secondary e-mail (same): oskar * sharipov at tutanota * org
> gpg fingerprint: BAC3 F049 748A D098 A144  BA89 0DC4 EA75 714C 75B5

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T3d54b7d46c59aa48-M1560cbf02cd737689e42d26a
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] ZFS atomics, 64-bit on 32-bit platforms

2019-10-11 Thread Richard Elling


> On Oct 11, 2019, at 4:47 PM, Garrett D'Amore  wrote:
> 
> 
>> On 10/11/2019 4:32:11 PM, Richard Elling  
>> wrote:
>> 
>> 
>> 
>>> On Oct 11, 2019, at 2:50 PM, Garrett D'Amore >> <mailto:garr...@damore.org>> wrote:
>>> 
>>> The issue is that you can't just arbitrarily throw a mutex out there -- you 
>>> have to have a place to *store* that, and you can't fit it inside the 
>>> 64-bit value.  With a 64-bit ISA this isn't usually a problem, but with 
>>> 32-bit ISAs it is.
>> 
>> I'm not sure how this affects the compiler builtin atomics since they don't 
>> add mutexes.
> Garrett D'Amore: 
> 
> You're missing the point.  If you have a 32-bit ISA that doesn't offer a 
> 64-bit atomic operation, then you have to fabricate one.  Fabricating one 
> requires a mutex, spinlock, or some other value. 

Or a 64-bit memory barrier. No need for 64-bit ALU.
 -- richard

> 
> What this means is that compiler builtins can't be used to solve the problem 
> of supporting a 64-bit atomic when the underlying platform lacks support. So 
> you have to solve it in software typically, which is what I think this whole 
> discussion is about.
> 
>  - Garrett
> 
>>  -- richard
>> 
>>> 
>>> The only way to store the mutex (which could just be a spinlock) is to have 
>>> some other place that has it -- typically in library code.  Allocation of 
>>> other objects like that normally falls outside the scope of a compiler 
>>> builtin (modulo bringing in a separate runtime object file, which can work 
>>> for user programs but generally not for kernels.)
>>>> On 10/11/2019 11:38:57 AM, Richard Elling 
>>>> >>> <mailto:richard.ell...@richardelling.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Oct 9, 2019, at 11:41 AM, Garrett D'Amore >>>> <mailto:garr...@damore.org>> wrote:
>>>>> 
>>>>> I don't think 32-bit compilers generally offer builtins for 64-bit 
>>>>> atomics.  Frankly, they can't really unless the underlying ISA provides 
>>>>> some additional support for this in particular.
>>>> 
>>>> Yes, that is why the builtins exist... the underlying ISA may have a 
>>>> method that is not part of the C language.
>>>> 
>>>> Worst case, mutex protection will work... slowly.
>>>>  -- richard
>>>> 
>>>>> 
>>>>> To implement a 64-bit atomic on a 32-bit architecture, you generally 
>>>>> needs some additional state somewhere -- typically some sort of mutex or 
>>>>> spinlock.  That has to live somewhere.  (You *might* be able to have a 
>>>>> compiler builtin that provides this along with a compiler runtime which 
>>>>> provides an instance of the spinlock somewhere in the program's data 
>>>>> section.  I think this sort of "builtin" (which isn't really builtin at 
>>>>> all) generally can't be used in operating system kernels -- e.g. with 
>>>>> --freestanding.)
>>>>>> On 10/9/2019 11:26:40 AM, Richard Elling 
>>>>>> >>>>> <mailto:richard.ell...@richardelling.com>> wrote:
>>>>>> 
>>>>>> If it is possible to specify a compiler version, it might be easier to 
>>>>>> use the compiler 
>>>>>> builtin atomics. Just sayin' 
>>>>>> -- richard 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> openzfs: openzfs-developer 
>>>>>> Permalink: 
>>>>>> https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-Mabd6346845b79e16d16f57c2
>>>>>>  
>>>>>> <https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-Mabd6346845b79e16d16f57c2>
>>>>>> Delivery options: 
>>>>>> https://openzfs.topicbox.com/groups/developer/subscription 
>>>>>> <https://openzfs.topicbox.com/groups/developer/subscription>
> 
> openzfs <https://openzfs.topicbox.com/latest> / openzfs-developer / see 
> discussions <https://openzfs.topicbox.com/groups/developer> + participants 
> <https://openzfs.topicbox.com/groups/developer/members> + delivery options 
> <https://openzfs.topicbox.com/groups/developer/subscription>Permalink 
> <https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-M620340096d08af342b9b67d2>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-M5de9d8b8d3f1acdecbed9ddc
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] ZFS atomics, 64-bit on 32-bit platforms

2019-10-11 Thread Richard Elling


> On Oct 11, 2019, at 2:50 PM, Garrett D'Amore  wrote:
> 
> The issue is that you can't just arbitrarily throw a mutex out there -- you 
> have to have a place to *store* that, and you can't fit it inside the 64-bit 
> value.  With a 64-bit ISA this isn't usually a problem, but with 32-bit ISAs 
> it is.

I'm not sure how this affects the compiler builtin atomics since they don't add 
mutexes.
 -- richard

> 
> The only way to store the mutex (which could just be a spinlock) is to have 
> some other place that has it -- typically in library code.  Allocation of 
> other objects like that normally falls outside the scope of a compiler 
> builtin (modulo bringing in a separate runtime object file, which can work 
> for user programs but generally not for kernels.)
>> On 10/11/2019 11:38:57 AM, Richard Elling  
>> wrote:
>> 
>> 
>> 
>>> On Oct 9, 2019, at 11:41 AM, Garrett D'Amore >> <mailto:garr...@damore.org>> wrote:
>>> 
>>> I don't think 32-bit compilers generally offer builtins for 64-bit atomics. 
>>>  Frankly, they can't really unless the underlying ISA provides some 
>>> additional support for this in particular.
>> 
>> Yes, that is why the builtins exist... the underlying ISA may have a method 
>> that is not part of the C language.
>> 
>> Worst case, mutex protection will work... slowly.
>>  -- richard
>> 
>>> 
>>> To implement a 64-bit atomic on a 32-bit architecture, you generally needs 
>>> some additional state somewhere -- typically some sort of mutex or 
>>> spinlock.  That has to live somewhere.  (You *might* be able to have a 
>>> compiler builtin that provides this along with a compiler runtime which 
>>> provides an instance of the spinlock somewhere in the program's data 
>>> section.  I think this sort of "builtin" (which isn't really builtin at 
>>> all) generally can't be used in operating system kernels -- e.g. with 
>>> --freestanding.)
>>>> On 10/9/2019 11:26:40 AM, Richard Elling >>> <mailto:richard.ell...@richardelling.com>> wrote:
>>>> 
>>>> If it is possible to specify a compiler version, it might be easier to use 
>>>> the compiler 
>>>> builtin atomics. Just sayin' 
>>>> -- richard 
>>>> 
>>>> 
>>>> -- 
>>>> openzfs: openzfs-developer 
>>>> Permalink: 
>>>> https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-Mabd6346845b79e16d16f57c2
>>>>  
>>>> <https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-Mabd6346845b79e16d16f57c2>
>>>> Delivery options: 
>>>> https://openzfs.topicbox.com/groups/developer/subscription 
>>>> <https://openzfs.topicbox.com/groups/developer/subscription>
> 
> openzfs <https://openzfs.topicbox.com/latest> / openzfs-developer / see 
> discussions <https://openzfs.topicbox.com/groups/developer> + participants 
> <https://openzfs.topicbox.com/groups/developer/members> + delivery options 
> <https://openzfs.topicbox.com/groups/developer/subscription>Permalink 
> <https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-Mb897f16441ee57856b4926b7>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-M51858eebfe98f256ff2d7f42
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] ZFS atomics, 64-bit on 32-bit platforms

2019-10-11 Thread Richard Elling


> On Oct 9, 2019, at 11:41 AM, Garrett D'Amore  wrote:
> 
> I don't think 32-bit compilers generally offer builtins for 64-bit atomics.  
> Frankly, they can't really unless the underlying ISA provides some additional 
> support for this in particular.

Yes, that is why the builtins exist... the underlying ISA may have a method 
that is not part of the C language.

Worst case, mutex protection will work... slowly.
 -- richard

> 
> To implement a 64-bit atomic on a 32-bit architecture, you generally needs 
> some additional state somewhere -- typically some sort of mutex or spinlock.  
> That has to live somewhere.  (You *might* be able to have a compiler builtin 
> that provides this along with a compiler runtime which provides an instance 
> of the spinlock somewhere in the program's data section.  I think this sort 
> of "builtin" (which isn't really builtin at all) generally can't be used in 
> operating system kernels -- e.g. with --freestanding.)
>> On 10/9/2019 11:26:40 AM, Richard Elling  
>> wrote:
>> 
>> If it is possible to specify a compiler version, it might be easier to use 
>> the compiler
>> builtin atomics. Just sayin'
>> -- richard
>> 
> 
> openzfs <https://openzfs.topicbox.com/latest> / openzfs-developer / see 
> discussions <https://openzfs.topicbox.com/groups/developer> + participants 
> <https://openzfs.topicbox.com/groups/developer/members> + delivery options 
> <https://openzfs.topicbox.com/groups/developer/subscription>Permalink 
> <https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-Me9f5c4a72346d54fcfcd775f>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-M0e1652fa334ab713cd3c6235
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] ZFS atomics, 64-bit on 32-bit platforms

2019-10-09 Thread Richard Elling
If it is possible to specify a compiler version, it might be easier to use the 
compiler
builtin atomics.  Just sayin'
 -- richard


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T3ee8a81d5f09f2ec-Mabd6346845b79e16d16f57c2
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] ZFS atomics, 64-bit on 32-bit platforms

2019-10-03 Thread Richard Elling
isn't this addressed by the compiler builtins for atomics?
 -- richard


> On Oct 3, 2019, at 10:38 AM, Matthew Ahrens  wrote:
> 
> On Wed, Oct 2, 2019 at 7:51 AM Andriy Gapon  > wrote:
> 
> This is a work in progress report for my work on safer use of atomics in ZFS
> that was discussed in the August developer meeting.
> 
> I took the approach suggested at the meeting of creating a new type that is to
> be used for all objects that should be modified atomically.
> The type is trivial:
> typedef struct zatomic64 {
> volatile uint64_t val;
> } zatomic64_t;
> 
> I also made a decision upfront to use a wrapper type only for 64-bit objects 
> as
> 32-bit objects should be already safe on all supported platforms where a 
> native
> integer size is either 32 or 64 bit.
> 
> Then, as suggested, I wrote trivial wrappers for all 64-bit atomic operations
> that are used in ZFS code.  For example:
>   static inline void
>   zatomic_add_64(zatomic64_t *a, uint64_t val)
>   {
>   atomic_add_64(>val, val);
>   }
> 
> A side note on naming.  I simply prepended original function names with 'z'.
> Although names like zatomic64_add() could be prettier and more aligned with 
> the
> type name.
> 
> After that I added three new functions:
> - zatomic_load_64 -- safe (with respect to torn values) read of a 64-bit 
> atomic
> - zatomic_store_64 -- safe write of a 64-bit atomic
> - zatomic_init_64 -- unsafe write of a 64-bit atomic
> 
> The last function just sets zatomic64_t.val and it is supposed to be used when
> it is known that there are no concurrent accesses.  It's pretty redundant the
> field can be set directly or the whole object can be initialized with x = { 0 
> },
> but I decided to add it for symmetry.
> 
> The first two functions are implemented like this:
>   static inline uint64_t
>   zatomic_load_64(zatomic64_t *a)
>   {
>   #ifdef __LP64__
>   return (a->val);
>   #else
>   return (zatomic_add_64_nv(a, 0));
>   #endif
>   }
> 
>   static inline void
>   zatomic_store_64(zatomic64_t *a, uint64_t val)
>   {
>   #ifdef __LP64__
>   a->val = val;
>   #else
>   (void)zatomic_swap_64(a, 0);
>   #endif
>   }
> 
> I am not sure if this was a right way to do it.
> Maybe there should be a standard implementation only for 64-bit platforms and
> each 32-bit platform should do its own thing?
> For example, there is more than one way to get atomic 64-bit loads on x86,
> according to this:
> https://stackoverflow.com/questions/48046591/how-do-i-atomically-move-a-64bit-value-in-x86-asm
>  
> 
> 
> 
> What you did above seems fine to me.
>  
> Anyway, I started converting ZFS (/FreeBSD) code to use the new type.
> Immediately, I got some problems and some questions.
> 
> First, I am quite hesitant about what to do with kstat-s.
> I wanted to confine the change to ZFS, but kstat is an external thing (at 
> least
> on illumos).  For now I decided to leave the kstat-s alone.  Especially as I 
> was
> not going to change any code that reads the kstat-s.
> But this also means that some things like arc_c and arc_p, which are aliases 
> to
> kstat value, remain potentially unsafe with respect to torn reads.
> 
> Yeah, we would have to convert these kstats to have callbacks that do the 
> atomic reads.
>  
> 
> I have not converted atomics in the aggsum code yet.
> I am actually a little bit confused about why as_lower_bound and 
> as_upper_bound
> are manipulated with atomics.  All manipulations seems to be done under 
> as_lock.
> Maybe I overlooked something...
> 
> The comment in aggsum_flush_bucket() has the (perhaps incorrect) 
> justification:
> 
>   /*
>* We use atomic instructions for this because we read the upper and
>* lower bounds without the lock, so we need stores to be atomic.
>*/
>   atomic_add_64((volatile uint64_t *)>as_lower_bound, asb->asc_delta);
> 
> So the assumption is that as long as we do an atomic store, we can read with 
> a normal read.  The lock doesn't help because the readers don't hold the lock.
>  
> 
> I converted refcount.h.
> That had a bit of unexpected effect on rrwlock_t.
> zfs_refcount_t is used for rr_anon_rcount and rr_linked_rcount fields.
> The fields are always accessed rr_lock, so their atomicity is not actually
> required.  I guess that zfs_refcount_t is used for the support of references
> with ownership. 
> 
> That's right.
>  
> So, some bits of the rrwlock code access those fields through
> the refcount interface, but other (conditional) pieces directly access 
> rc_count
> as a simple integer.
> So, there are a couple of options here.  The direct accesses can be replaced
> with refcount accesses, but then we get unnecessary atomic operations in the
> fast paths. 
> 
> That would be true only on 32-bit platforms, right?  I don't think a small 
> performance hit to 32-bit kernels is 

Re: [developer] Pathway to better DDT, and value-for-effort assessment of mitigations in the meantime.

2019-07-09 Thread Richard Elling


> On Jul 7, 2019, at 11:09 AM, Stilez  wrote:
> 
> I feel that, while that's true and valid, it also kind of misses the point?
> 
> What I'm wondering is, are there simple enhancements that would be beneficial 
> in that area, or provide useful internal data. 
> 
> It seems plausible that if a configurable in space map block size helps, 
> perhaps a configurable DDT block size could as well, and that if DDT contents 
> are needed on every file load/save for a deduped pool, then a way to preload 
> them could be beneficial in the same.way as a way to preload spacemap data.

Both spacemaps and DDT are AVL trees. But there is one DDT vs hundreds (or 
more) of spacemaps.
But spacemaps are only needed for writes, so if we aren't allocating space from 
a metaslab, the
spacemap for that metaslab can be evicted from RAM. Or, to look at it another 
way, spacemaps are
constrained to space (LBA range) but DDT covers the whole pool.

> 
> Clearly allocation devices will help, but gains are always layered. "Faster 
> storage will fix it" isnt really an answer, any more than its an answer for 
> any greatly bottlenecked critical pathway in  a file system - and for deduped 
> pools, access to DDT records is *the* critical element, no user data can be 
> read from disk or put to txgs without them. Fundamentally there are a few 
> useful configurables available for spacemaps that could be potential wins if 
> also available for DDTs.  Since analogs for these settings already exist in 
> the code, perhaps no great work is involved in them, and perhaps they would 
> give cheap but significant IO gains for pools with dedup enabled.

In the past, some folks have proposed a different structure for DDT, such as 
using bloom filters or
other fast-lookup techniques. But as Garrett points out, the real wins are hard 
to come by. Meanwhile,
PRs are welcome.
 -- richard

> 
> Hence I think the question is valid, and remains valid both before allocation 
> classes and after them, and might be worth considering deeper.
> 
> 
> 
> On 7 July 2019 16:03:59 Richard Elling  
> wrote:
> 
>> Yes, from the ZoL zpool man page:
>> A device dedicated solely for deduplication tables.
>> 
>>   -- richard
>> 
>> 
>> 
>> On Jul 7, 2019, at 5:41 AM, Stilez > <mailto:stil...@gmail.com>> wrote:
>> 
>>> "Dedup special class"?
>>> 
>>> On 6 July 2019 16:24:27 Richard Elling >> <mailto:richard.ell...@richardelling.com>> wrote:
>>> 
>>>> 
>>>> On Jul 5, 2019, at 9:11 PM, Stilez >>> <mailto:stil...@gmail.com>> wrote:
>>>> 
>>>>> I'm one of many end-users with highly dedupable pools held back by DDT 
>>>>> and spacemap RW inefficiencies. There's been discussion and presentations 
>>>>> - Matt Ahrens' talk at BSDCan 2016 ("Dedup doesn't have to suck") was 
>>>>> especially useful, and allocation classes from the ZoL/ZoF work will 
>>>>> allow metadata-specific offload to SSD. But briad discussion of this 
>>>>> general area is not on the roadmap atm, probably bc so much else is a 
>>>>> priority and seems nobody's stepped up.
>>>> 
>>>> In part because dedup will always be slower than non-dedup while the cost 
>>>> of storage continues to plummet (flash SSDs down 40% in the past year and 
>>>> there is currently an oversupply of NAND). A good starting point for 
>>>> experiments is to use the dedup special class and report back to the 
>>>> community how well it works for you.
>>>> 
>>>>   -- richard
>>>> 
>>>> openzfs <https://openzfs.topicbox.com/latest> / openzfs-developer / see 
>>>> discussions <https://openzfs.topicbox.com/groups/developer> + participants 
>>>> <https://openzfs.topicbox.com/groups/developer/members> + delivery options 
>>>> <https://openzfs.topicbox.com/groups/developer/subscription>Permalink 
>>>> <https://openzfs.topicbox.com/groups/developer/Td9c7189186fd24f2-Mab4823ef7eaa494c26bfa3d8>
> 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Td9c7189186fd24f2-M3d9e7f78fbe5012e53b176a8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Pathway to better DDT, and value-for-effort assessment of mitigations in the meantime.

2019-07-07 Thread Richard Elling
Yes, from the ZoL zpool man page:
A device dedicated solely for deduplication tables.

  -- richard



> On Jul 7, 2019, at 5:41 AM, Stilez  wrote:
> 
> "Dedup special class"?
> 
>> On 6 July 2019 16:24:27 Richard Elling  
>> wrote:
>> 
>> 
>>> On Jul 5, 2019, at 9:11 PM, Stilez  wrote:
>>> 
>>> I'm one of many end-users with highly dedupable pools held back by DDT and 
>>> spacemap RW inefficiencies. There's been discussion and presentations - 
>>> Matt Ahrens' talk at BSDCan 2016 ("Dedup doesn't have to suck") was 
>>> especially useful, and allocation classes from the ZoL/ZoF work will allow 
>>> metadata-specific offload to SSD. But briad discussion of this general area 
>>> is not on the roadmap atm, probably bc so much else is a priority and seems 
>>> nobody's stepped up.
>> 
>> In part because dedup will always be slower than non-dedup while the cost of 
>> storage continues to plummet (flash SSDs down 40% in the past year and there 
>> is currently an oversupply of NAND). A good starting point for experiments 
>> is to use the dedup special class and report back to the community how well 
>> it works for you.
>> 
>>   -- richard
>> 
>> openzfs / openzfs-developer / see discussions + participants + delivery 
>> options Permalink
> 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Td9c7189186fd24f2-Mf90f2c92931dd028277934cd
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Pathway to better DDT, and value-for-effort assessment of mitigations in the meantime.

2019-07-06 Thread Richard Elling
> On Jul 5, 2019, at 9:11 PM, Stilez  wrote:
> 
> I'm one of many end-users with highly dedupable pools held back by DDT and 
> spacemap RW inefficiencies. There's been discussion and presentations - Matt 
> Ahrens' talk at BSDCan 2016 ("Dedup doesn't have to suck") was especially 
> useful, and allocation classes from the ZoL/ZoF work will allow 
> metadata-specific offload to SSD. But briad discussion of this general area 
> is not on the roadmap atm, probably bc so much else is a priority and seems 
> nobody's stepped up.

In part because dedup will always be slower than non-dedup while the cost of 
storage continues to plummet (flash SSDs down 40% in the past year and there is 
currently an oversupply of NAND). A good starting point for experiments is to 
use the dedup special class and report back to the community how well it works 
for you.

  -- richard


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Td9c7189186fd24f2-Mab4823ef7eaa494c26bfa3d8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Using many zpools

2019-07-03 Thread Richard Elling


> On Jul 2, 2019, at 5:48 AM, nagy.att...@gmail.com wrote:
> 
> Glad to hear that! :)
> I'll try to be more verbose then.
> For example I have a machine with 44*4T SATA disks. Each of these disks have 
> a zpool on them, so I have 44 zpools on the machine (with one zfs on each 
> zpool).
> I put files onto these zfs/zpools into hashed directories.
> On file numbers/sizes: one of the zpools currently have:
> 75,504,450 files and df says it has 2.2TiB used, so here the average file 
> size is 31k.
> File serving is done on HTTP.
> Nothing special happens here on the ZFS side, the only difference is that I 
> have single disk zpools.
> 
> And that's why I have some questions on this, I couldn't really find 
> literature on this topic.
> 
> My biggest issue now is just the import of these zpools consume 50+ GiB 
> kernel memory (out of 64 on this machine), before anything could touch the 
> disks (so it's not ARC). And these seem to be something which is/could not 
> shrunk by the kernel (like the ARC, which can dynamically change its size).

This is strange. What does free(1) say about the distribution of memory?
 -- richard

> Therefore if ARC or other kernel (memory) users consume more memory, it 
> quickly turns into a deadlock, the kernel starts to kill userspace processes 
> to the point where it becomes unusable.
> 
> And here come the questions in the original post, from which I'm trying to 
> understand what happens here, why importing a 4T zpool (with the above fill 
> ratios) takes 1-1.5GiB kernel space.
> And is it because of the files on it (so if I would have a 44x that size 
> zpool it would be the same, which I couldn't observe on other machines which 
> have only one zpool) or some kind of per zpool overhead (and if so, what 
> affects that, what could I do to lower that need)?
> 
> openzfs  / openzfs-developer / see 
> discussions  + participants 
>  + delivery options 
> Permalink 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10533b84f9e1cfc5-M3364c2e4eb10be9b68bc1580
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] raidz overhead with ashift=12

2019-06-07 Thread Richard Elling


> On Jun 7, 2019, at 12:15 PM, Mike Gerdts  wrote:
> 
> On Fri, Jun 7, 2019 at 12:03 PM Matthew Ahrens  > wrote:
> On Thu, Jun 6, 2019 at 10:56 PM Mike Gerdts  > wrote:
> I'm motivated to make zfs set refreservation=auto do the right thing in the 
> face of raidz and 4k physical blocks, but have data points that provide 
> inconsistent data.  Experimentation shows raidz2 parity overhead that matches 
> my expectations for raidz1.
> 
> Let's consider the case of a pool with 8 disks in one raidz2 vdev, ashift=12.
> 
> In the spreadsheet 
> 
>  from Matt's How I Learned to Stop Worrying and Love RAIDZ 
> 
>  blog entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the 
> parity and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k 
> or 8k should both end up taking up 30 gig of space.
> 
> That makes sense to me as well.
>  
> 
> Experimentation tells me that they each use just a little bit more than 
> double the amount that was calculated by refreservation=auto.  In each of 
> these cases, compression=off and I've overwritten them with `dd if=/dev/zero 
> ...`
> 
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk0
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk0  used   21.4G  -
> zones/mg/disk0  referenced 21.4G  -
> zones/mg/disk0  logicalused10.0G  -
> zones/mg/disk0  logicalreferenced  10.0G  -
> zones/mg/disk0  volblocksize   8K default
> zones/mg/disk0  refreservation 10.3G  local
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk1
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk1  used   21.4G  -
> zones/mg/disk1  referenced 21.4G  -
> zones/mg/disk1  logicalused10.0G  -
> zones/mg/disk1  logicalreferenced  10.0G  -
> zones/mg/disk1  volblocksize   4K -
> zones/mg/disk1  refreservation 10.6G  local
> $ zpool status zones
>   pool: zones
>  state: ONLINE
>   scan: none requested
> config:
> 
> NAME   STATE READ WRITE CKSUM
> zones  ONLINE   0 0 0
>   raidz2-0 ONLINE   0 0 0
> c0t55CD2E404C314E1Ed0  ONLINE   0 0 0
> c0t55CD2E404C314E85d0  ONLINE   0 0 0
> c0t55CD2E404C315450d0  ONLINE   0 0 0
> c0t55CD2E404C31554Ad0  ONLINE   0 0 0
> c0t55CD2E404C315BB6d0  ONLINE   0 0 0
> c0t55CD2E404C315BCDd0  ONLINE   0 0 0
> c0t55CD2E404C315BFDd0  ONLINE   0 0 0
> c0t55CD2E404C317724d0  ONLINE   0 0 0
> # echo ::spa -c | mdb -k | grep ashift | sort -u
> ashift=000c
> 
> Overwriting from /dev/urandom didn't change the above numbers in any 
> significant way.
> 
> My understanding is that each volblocksize block has data and parity spread 
> across a minimum of 3 devices so that any two could be lost and still 
> recover.  Considering the simple case of volblocksize=4k and ashift=12, 200% 
> overhead for parity (+ no pad) seems spot-on. 
> 
> That's right.  And in the case of volblocksize=8K, you have 2 data + 2 parity 
> + 2 pad = 6 sectors = 24K allocated.
>  
> I seem to be only seeing 100% overhead for parity plus a little for metadata 
> and its parity.
> 
> What fundamental concept am I missing?
> 
> The spreadsheet shows how much space will be allocated, which is reflected in 
> the zpool `allocated` property.  However, you are looking at the zfs `used` 
> and `referenced` properties.  These properties (as well as `available` and 
> all other zfs (not zpool) accounting values) take into account the expected 
> RAIDZ overhead, which is calculated assuming 128K logical size blocks.  This 
> means that zfs accounting hides the parity (and padding) overhead when the 
> block size is around 128K.  Other block sizes may see (typically only 
> slightly) more or less space consumed than expected (e.g. if the `recordsize` 
> property has been changed, a 1GB file may have zfs `used` of 0.9G, or 1.1G).
> 
> As indicated in cell F23, the expected overhead for 4K-sector 8-wide RAIDZ2 
> is 41% (which is around what the RAID5 overhead would be, 2/6 = 33%).  This 
> is taken into account in the "RAID-Z deflation ratio" (`vdev_deflate_ratio`). 
>  In other words, `used = allocated / 1.41`.  If we undo that, we get `21.4G * 
> 1.41 = 30.2G`, which is around what we expected.
> 
> Thanks for that - it should give me 

Re: [developer] raidz overhead with ashift=12

2019-06-07 Thread Richard Elling
> On Jun 6, 2019, at 10:54 PM, Mike Gerdts  wrote:
> 
> I'm motivated to make zfs set refreservation=auto do the right thing in the 
> face of raidz and 4k physical blocks, but have data points that provide 
> inconsistent data.  Experimentation shows raidz2 parity overhead that matches 
> my expectations for raidz1.
> 
> Let's consider the case of a pool with 8 disks in one raidz2 vdev, ashift=12.
> 
> In the spreadsheet 
> 
>  from Matt's How I Learned to Stop Worrying and Love RAIDZ 
> 
>  blog entry, the "RAIDZ2 parity cost" sheet cells F4 and F5 suggest the 
> parity and padding cost is 200%.  That is, a 10 gig zvol with volblocksize=4k 
> or 8k should both end up taking up 30 gig of space.
> 
> Experimentation tells me that they each use just a little bit more than 
> double the amount that was calculated by refreservation=auto.  In each of 
> these cases, compression=off and I've overwritten them with `dd if=/dev/zero 
> ...`

IIRC, the skip blocks are accounted in the pool's "alloc", but not in the 
dataset's
"used"
 -- richard

> 
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk0
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk0  used   21.4G  -
> zones/mg/disk0  referenced 21.4G  -
> zones/mg/disk0  logicalused10.0G  -
> zones/mg/disk0  logicalreferenced  10.0G  -
> zones/mg/disk0  volblocksize   8K default
> zones/mg/disk0  refreservation 10.3G  local
> $ zfs get 
> used,referenced,logicalused,logicalreferenced,volblocksize,refreservation 
> zones/mg/disk1
> NAMEPROPERTY   VALUE  SOURCE
> zones/mg/disk1  used   21.4G  -
> zones/mg/disk1  referenced 21.4G  -
> zones/mg/disk1  logicalused10.0G  -
> zones/mg/disk1  logicalreferenced  10.0G  -
> zones/mg/disk1  volblocksize   4K -
> zones/mg/disk1  refreservation 10.6G  local
> $ zpool status zones
>   pool: zones
>  state: ONLINE
>   scan: none requested
> config:
> 
> NAME   STATE READ WRITE CKSUM
> zones  ONLINE   0 0 0
>   raidz2-0 ONLINE   0 0 0
> c0t55CD2E404C314E1Ed0  ONLINE   0 0 0
> c0t55CD2E404C314E85d0  ONLINE   0 0 0
> c0t55CD2E404C315450d0  ONLINE   0 0 0
> c0t55CD2E404C31554Ad0  ONLINE   0 0 0
> c0t55CD2E404C315BB6d0  ONLINE   0 0 0
> c0t55CD2E404C315BCDd0  ONLINE   0 0 0
> c0t55CD2E404C315BFDd0  ONLINE   0 0 0
> c0t55CD2E404C317724d0  ONLINE   0 0 0
> # echo ::spa -c | mdb -k | grep ashift | sort -u
> ashift=000c
> 
> Overwriting from /dev/urandom didn't change the above numbers in any 
> significant way.
> 
> My understanding is that each volblocksize block has data and parity spread 
> across a minimum of 3 devices so that any two could be lost and still 
> recover.  Considering the simple case of volblocksize=4k and ashift=12, 200% 
> overhead for parity (+ no pad) seems spot-on.  I seem to be only seeing 100% 
> overhead for parity plus a little for metadata and its parity.
> 
> What fundamental concept am I missing?
> 
> TIA,
> Mike
> openzfs  / openzfs-developer / see 
> discussions  + participants 
>  + delivery options 
> Permalink 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3-M96e6a1eff39c33b730555c19
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


[developer] draid workshop agenda: 3-May-2019

2019-04-26 Thread Richard Elling
Here is a link to the draid workshop agenda. Comments are welcome, especially
if you cannot attend. Pass along to interested folks.

https://docs.google.com/document/d/e/2PACX-1vT5zc4ovQcxQpvcXHvvCW_vHPQGcIWHj60RwviSeFfhtUpTpLBzIxJX7dOpgW6Kt-H1h3IqE7h1lTXX/pub
 


 -- richard


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T8a367fd2ad9bcd64-M1bc5a633a6f2c676862754db
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


[developer] draid workshop May 3, 2019

2019-04-18 Thread Richard Elling
We're organizing a workshop for ZFS developer and users who are specifically 
interested in the 
draid project. The goal is to update status and detail the work remaining to 
get draid ready for the
larger ZFS community.

Date: 3-May-2019, 9AM until ...
Location: Delphix office, San Francisco, CA, USA

If you don't know what draid is, please visit 
http://open-zfs.org/wiki/Roadmap#Parity_Declustered_RAIDZ_.28draid.29 


If you're interested in attending, please RSVP to 
richard.ell...@richardelling.com 

 -- richard


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tc8a4c4a4c80fd746-M59571d45c366dc54f4e80010
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Proposal: Platform-specific ZFS properties

2019-02-26 Thread Richard Elling


> On Feb 25, 2019, at 9:53 PM, Garrett D'Amore  wrote:
> 
> I suspect having the platform “resolve” which property to use is going to be 
> problematic at the very best.  The semantics are platform specific, and to 
> make matters worse, I suspect that even on some platforms, you will find 
> distributions that have imparted their own meanings to these properties.  For 
> example, I know of several different distributions built on illumos, and they 
> don’t necessarily handle the properties in a perfectly compatible way.  I 
> suspect the situation may actually be worse on Linux, where some 
> distributions are likely to ship with *wildly* different userland and system 
> layers (e.g. systemd vs. legacy rc scripts), and may thus wind up having 
> quite different supported properties.

Yes. Also consider that there are multiple implementations of the shares. For 
example, 
shareiscsi on Linux could refer to no less than 4 completely different iscsi 
target implementations.


>  
> Probably, the best solution here is for distributions to use their own “user 
> property” for this, and that could apply to operating systems as well.  For 
> example, one can imagine “openindiana:sharenfs” as a property.  The upshot of 
> that is that these properties might not convey automatically when moving file 
> systems between distributions, but as the semantics (and possibly even 
> syntax) of the contents are assumed not be 100% compatible, probably this is 
> for the very best.  It is to be hoped that the NAS vendors building on top of 
> ZFS are already making decent use of user properties (and namespace prefixes) 
> to minimize problems.  Certainly that is the case for RackTop.

Oracle's change from sharesmb to share.smb seems to try to address some sort of 
namespace unification. But it still relies on the backend being unified 
(share(1m)) -- very 
unlikely to happen elsewhere. IMHO Oracle's naming change is not sufficiently 
different from
user properties to warrant a whole new dot-separated naming scheme.

>  
> Automatic conversion (or simply using a single shared value) only makes sense 
> when there is broad agreement about the property contents.  Absent that, 
> namespace separation, *without* naïve attempts to convert, is probably the 
> next best option.

Or redesign entirely. The ease of sharing at filesystem mount time is trivial 
to recreate using
sysevents. What is difficult to do with sysevents is unsharing prior to umount. 
A generic plugin
might be easier and more extensible for those services that are beyond NFS/SMB 
(eg BeeGFS)
1. run this after mount
2. run that prior to umount
 -- richard



--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T6b50d98b3cf62715-M73dbfd564ac4bec338afe810
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] not attachable and importable device

2019-01-26 Thread Richard Elling


> On Jan 26, 2019, at 1:23 PM, za.fus...@gmail.com wrote:
> 
> Hi,
> 
> this is more a question of interest and no loss of data is connected. Hence 
> skip if you don't have time or fun ;)
> 
> 
> I played around with zfs in virtualbox to learn. I used 4 virtual disks: 
> zfsD1, zfsD2, zfsD3 & zfsD4
> Main reason was how to learn  to make a mirrored pool t1 (zfsD1 and zfsD2), 
> transport one disk and recreate the pool on second PC . All worked fine and I 
> now like zfs. I mention it here, to make you aware that my disks contain old 
> zfs pool data.
> 
> Realizing /dev/sdX is mixing up when removing HDs in virtualbox. I wanted to 
> use labels (names?) that I can memorize rather than UUIDs. I used parted to 
> give the partitions the zfsDX names.
> 
> I were able to create a new pool u1 and attach the disk by label. Sorry I 
> used this confusing command sequence:
> #zpool create u1 /dev/sdb
> #zpool export u1
> #zpool import -d /dev/disk/by-partlabel -aN -f
> 
> #zpool statusshows:
> NAME STATE 
> u1 ONLINE
>   zfsD1ONLINE
> 
> The interesting happened when I wanted to attach the second disk
> # zpool attach u1 /dev/disk/py-partlabel/zfsD1  /dev/disk/by-partlabel/zfsD2
> invalid vdev spcification
> use '-f' to overrid the following errors:
> /dev/disk/by-partlabel/zfsD2 is part of exported pool 't1'

Read this again. Read it out loud, if that helps.

The disk came back and was identified as being exported from elsewhere.
Therefore ZFS knows that there is a pool on the disk and tries to prevent you
from accidentally destroying it.

> 
> Now I became curios and tried to reimport that disk. But failed.
> #zpool import -d /dev/disk/by-partlabel/zfsD2 t1
> cannot open '/dev/sdc1': Not a directory
> cannot import 't1': no such pool available

This is expected, the -d option takes a directory, not a file.
 -- richard

> 
> I also tried without luck:
> zpool import - didn' list anything
> zpool import -a   - didn't work
> zpool import -d /dev/disk/by-id -aN -f - t1 not shown at zpool status
> 
> 
> My guess is, if I manage to create a pool t1 and then attach the zfsD2 to it 
> as first (and only) disk, it might work.
> 
> Any comments ?
> 
> Cheers
> Zach
> 
> openzfs  / openzfs-developer / see 
> discussions  + participants 
>  + delivery options 
> Permalink 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tc7e62464ec389621-M2b6ec6c7ecee0d181f802360
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] OpenZFS feature flag at zpool create proposal

2018-12-07 Thread Richard Elling

People don't read man pages, so they won't read about this option or any of the 
other options.
For those people who do read man pages, we can simply document the appropriate 
flags
and their shortcut aliases.

But... people don't read man pages and once the pool is created, it is too late 
to go back and
read the man page.


The good news is, we can make man page changes now. Section "HOW TO CREATE  AN
OS-PORTABLE POOL"
 -- richard

> On Dec 7, 2018, at 11:15 AM, Matthew Ahrens  wrote:
> 
> On Fri, Dec 7, 2018 at 10:49 AM Jason King via openzfs-developer 
> mailto:developer@lists.open-zfs.org>> wrote:
> Are there any concerns this might become a bit unwieldy over time?  I’m 
> undecided on this myself, but thought it’d be worth raising the question.
> 
> Just as a data point, had this existed when ZFS was first publicly released 
> (Nov 2005 is the date I can find), that would mean 13 ‘portable-XXX’ values, 
> plus ‘portable’ (though a number of values might be equivalent) today.
> 
> I don't think so.  The fact that there are N yearly values doesn't place much 
> cognitive burden on users.  We'll want to make sure the implementation 
> gathers this info into one place but it can probably just be one giant table 
> that we add an entry to each year.  We can retroactively create entries 
> starting from 2013 or 2014 (feature flags were introduced May 2012).
> 
> --matt
>  
> 
> 
> 
> From: Matthew Ahrens  
> Reply: openzfs-developer  
> 
> Date: December 7, 2018 at 12:14:17 PM
> To: openzfs-developer  
> 
> Subject:  Re: [developer] OpenZFS feature flag at zpool create proposal 
> 
>> 
>> 
>> On Thu, Dec 6, 2018 at 11:17 AM Matthew Ahrens > > wrote:
>> 
>> 
>> On Thu, Dec 6, 2018 at 10:06 AM Josh Paetzel > > wrote:
>> 
>> 
>> 
>> On Thu, Dec 6, 2018, at 11:40 AM, Matthew Ahrens wrote:
>>> 
>>> 
>>> On Wed, Dec 5, 2018 at 9:21 PM Josh Paetzel >> > wrote:
>>> There was a proposal on this mailing list a week or so ago that sparked 
>>> some good discussion regarding setting feature flags at pool creation.  We 
>>> discussed this proposal on the last OpenZFS call and while we all agreed it 
>>> was a good idea, we felt the UI could use a little workshopping.
>>> 
>>> Thanks for leading this discussion, Josh!
>>>  
>>> 
>>> This proposal is simply a starting point for a discussion.
>>> 
>>> The proposal is to add an optional flag at zpool create time.  -o features=
>>> 
>>> which can be set to one of the following 4 settings:
>>> 
>>> all:  The default and the current behavior of zpool create.  This setting 
>>> enables all features supported by the in use OpenZFS version.
>>> 
>>> compat: Enable async_destroy, empty_bpobj, filesystem_limits, lz4_compress, 
>>> spacemap_histogram, extensible_dataset, bookmarks, enabled_txg, and 
>>> hole_birth. This set of features is supported by FreeBSD/FreeNAS 9.3 (July 
>>> of 2014), ZoL 0.6.4 ( 0.6.5.6 was used in Ubuntu 16.04 LTS), and Illumos in 
>>> the early to mid 2014 era.
>>> 
>>> I think the term "portable" is more specific than "compat[ible]", so "-o 
>>> features=portable" is probably a better name.  Thanks to whoever suggested 
>>> that at the meeting Tuesday.
>> 
>> +1
>> 
>>> I think that the definition of "-o features=portable" needs to change over 
>>> time, as features become more widely available.  Do you have thoughts on 
>>> how specifically we should do that?
>> 
>> One thought would be:  As a delivery vehicle drops out of support we can 
>> remove it from the list.  So say we have -o features=portable include Ubuntu 
>> 16.04 LTS.  When support for Ubuntu 16.04 LTS expires we remove it from the 
>> portable list.  Now, that directly contradicts the first pass at this.  For 
>> instance FreeBSD (and FreeNAS) 9.3-R have long been out of support by the 
>> "vendor"
>> 
>> I think that's a very conservative way to define it - essentially: you 
>> "portable means you can take this pool to any FreeBSD or Ubuntu that's 
>> supported by the vendor".  I would suggest we look at it from the other 
>> direction: "portable means that there's some version of FreeBSD, ZoL, and 
>> illumos that you can take your pool to".  More specifically we could say 
>> "it's in a released version of FreeBSD (e.g. FreeBSD 11.1), a released 
>> version of ZoL (e.g. Zol 0.7.2), and in the illumos master branch" (since 
>> there's no "releases" of illumos).  That said, I'm open to variations on 
>> this (e.g. should we say it's in a SmartOS release rather than illumos 
>> master?  Or it's in FreeBSD-STABLE vs a numbered release?)
>> 
>> Here's a more detailed proposal:
>> 
>> Add new options to "zpool create", "zpool upgrade", and "zpool set":
>>  - zpool create -o features=portable-20XX (yearly values)
>>  - zpool create -o features=portable
>>  - zpool upgrade -o features=portable-20XX
>>  - zpool upgrade 

Re: [developer] Userland ZFS?

2018-10-01 Thread Richard Elling
> On Oct 1, 2018, at 9:47 AM, Chuck Tuffli  wrote:
> 
> The OpenZFS Projects page lists some bullet points for a Userland ZFS 
> implementation. I have a ZFS related experiment that would benefit from 
> running in user-space and was wondering if this is done, a work-in-progress, 
> etc.. Are there any pointers to previous discussions of this and how it was 
> envisioned to work? TIA.

The current userland implementation I'm working with is cstor, a fork of 
OpenZFS on Linux.
I'm not aware of another, current version, though periodically FUSE folks 
mention ZFS on FUSE.
https://github.com/openebs/zfs
 -- richard


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tc0d568ae70dc41d7-M79943709e04641f8c89146f3
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Is L2ARC read process single-threaded?

2018-08-30 Thread Richard Elling


> On Aug 30, 2018, at 7:15 AM, w.kruzel via openzfs-developer 
>  wrote:
> 
> The flamegraphs are here:
> https://drive.google.com/open?id=1vM-5wy4s-QhV2D3hBVh5bPgaqPqEKsMa 
> 
> 
> There are 11 of them.
> Files out.svg to out4.svg are dtrace flamegraphs of reading when L2ARC has 
> been in use.
> out5.svg to out10.svg are dtrace flamegraphs of usign nvmecontrol command - 
> in read mode either using single thread or multiple threads.
> 
> So, what I noticed is, that only when I used nvmecontrol with multiple 
> threads i.e:
> # nvmecontrol perftest -n 4 -o read -s 65536 -t 10 nvme0ns1
> I can then find this process "kernel`nvme_qpair_process_completions" - just 
> search for nvme in the graph.
> It's hard to select it in some of them.
> kernel`nvme_qpair_process_completions
> kernel`intr_event_execute_handlers
> kernel`nvme_qpair_submit_request
> kernel`nvme_qpair_complete_tracker
> kernel`nvme_ctrlr_submit_io_request
> 
> Is this queueing system for access to nvme disk? See out10 and out6 and out7 
> for the nvme processes.
> All I know is that when arc_read runs, it does not talk to these nvme 
> processes.

flamegraphs sample stacks executing on CPUs. They are useless for the analysis 
you're looking for.
ZFS knows nothing about NVMe, SATA, SCSI, or any other low-level block 
protocol. Nor does it care.
To get to your answer, look at the block interface boundary.
 -- richard

> 
> I haven't tried using the nvme as regular disk and then looking at the output 
> yet, as we are currently using the file server quite extensively.
> 
> I have also tried iostats -x, which is very useful besides that it looks at 
> nvd and not nvme (small detail) so it will not notice the nvmecontrol reads.
> 
> openzfs  / openzfs-developer / see 
> discussions  + participants 
>  + delivery options 
> Permalink 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-M19d34ae166b5cee602311d8b
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Potential bug recently introduced in arc_adjust() that leads to unintended pressure on MFU eventually leading to dramatic reduction in its size

2018-08-30 Thread Richard Elling
Hi Mark, 
yes, this is the change I've tested on ZoL. It is a trivial, low-risk change 
that is needed to restore the 
previous behaviour.
 -- richard

> On Aug 30, 2018, at 7:40 AM, Mark Johnston  wrote:
> 
> On Thu, Aug 30, 2018 at 09:55:27AM +0300, Paul wrote:
>> 30 August 2018, 00:22:14, by "Mark Johnston" :
>> 
>>> On Wed, Aug 29, 2018 at 12:42:33PM +0300, Paul wrote:
 Hello team,
 
 
 It seems like a commit on Mar 23 introduced a bug: if during execution of 
 arc_adjust()
 target is reached after MRU is evicted current code continues evicting 
 MFU. Before said
 commit, on the step prior to MFU eviction, target value was recalculated 
 as:
 
  target = arc_size - arc_c;
 
 arc_size here is a global variable that was being updated accordingly, 
 during MRU eviction,
 hence this expression, resulted in zero or negative target if MRU eviction 
 was enough
 to reach the original goal.
 
 Modern version uses cached value of arc_size, called asize:
 
  target = asize - arc_c;
 
 Because asize stays constant during execution of whole body of 
 arc_adjust() it means that
 above expression will always be evaluated to value > 0, causing MFU to be 
 evicted every 
 time, even if MRU eviction has reached the goal already. Because of the 
 difference in 
 nature of MFU and MRU, globally it leads to eventual reduction of amount 
 of MFU in ARC 
 to dramatic numbers.
>>> 
>>> Hi Paul,
>>> 
>>> Your analysis does seem right to me.  I cc'ed the openzfs mailing list
>>> so that an actual ZFS expert can chime in; it looks like this behaviour
>>> is consistent between FreeBSD, illumos and ZoL.
>>> 
>>> Have you already tried the obvious "fix" of subtracting total_evicted
>>> from the MFU target?
>> 
>> We are going to apply the asize patch (plus the ameta, as suggested by 
>> Richard) and reboot 
>> one of our production servers this night or the following.
> 
> Just to be explicit, are you testing something equivalent to the patch
> at the end of this email?
> 
>> Then we have to wait a few days and observer the ARC behaviour.
> 
> Thanks!  Please let us know how it goes: we're preparing to release
> FreeBSD 12.0 shortly and I'd like to get this fixed in head/ as soon as
> possible.
> 
> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c 
> b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
> index 1387925c4607..882c04dba50a 100644
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
> @@ -4446,6 +4446,12 @@ arc_adjust(void)
>arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
>}
> 
> +   /*
> +* Re-sum ARC stats after the first round of evictions.
> +*/
> +   asize = aggsum_value(_size);
> +   ameta = aggsum_value(_meta_used);
> +
>/*
> * Adjust MFU size
> *
> 
> --
> openzfs: openzfs-developer
> Permalink: 
> https://openzfs.topicbox.com/groups/developer/T10a105c53bcce15c-M1c45cd09114d2ce2e8c9dd26
>  
> 
> Delivery options: https://openzfs.topicbox.com/groups/developer/subscription 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10a105c53bcce15c-Mb937b1ff0ccbad450028c211
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Potential bug recently introduced in arc_adjust() that leads to unintended pressure on MFU eventually leading to dramatic reduction in its size

2018-08-29 Thread Richard Elling
Thanks for passing this along, Mark.
Comments embedded

> On Aug 29, 2018, at 2:22 PM, Mark Johnston  wrote:
> 
> On Wed, Aug 29, 2018 at 12:42:33PM +0300, Paul wrote:
>> Hello team,
>> 
>> 
>> It seems like a commit on Mar 23 introduced a bug: if during execution of 
>> arc_adjust()
>> target is reached after MRU is evicted current code continues evicting MFU. 
>> Before said
>> commit, on the step prior to MFU eviction, target value was recalculated as:

arc_size is hot, so it was broken up into per-cpu counters and asize is now a 
snapshot
of the sum of the counters...

>> 
>>  target = arc_size - arc_c;
>> 
>> arc_size here is a global variable that was being updated accordingly, 
>> during MRU eviction,
>> hence this expression, resulted in zero or negative target if MRU eviction 
>> was enough
>> to reach the original goal.
>> 
>> Modern version uses cached value of arc_size, called asize:
>> 
>>  target = asize - arc_c;
>> 
>> Because asize stays constant during execution of whole body of arc_adjust() 
>> it means that
>> above expression will always be evaluated to value > 0, causing MFU to be 
>> evicted every 
>> time, even if MRU eviction has reached the goal already. Because of the 
>> difference in 
>> nature of MFU and MRU, globally it leads to eventual reduction of amount of 
>> MFU in ARC 
>> to dramatic numbers.
> 
> Hi Paul,
> 
> Your analysis does seem right to me.  I cc'ed the openzfs mailing list
> so that an actual ZFS expert can chime in; it looks like this behaviour
> is consistent between FreeBSD, illumos and ZoL.

Agree. In the pre-aggsum code, arc_size would have changed after the MRU 
adjustment.
Now it does not. I have at least one correlation to this occuring in a 
repeatable test that
I can run on my ZoL test machine (when it is finished punishing some other 
code).

> 
> Have you already tried the obvious "fix" of subtracting total_evicted
> from the MFU target?

ameta also needs to be re-aggsummed after the MRU adjustments.
 -- richard

>> Servers that run the version of FreeBSD prior to the issue have this picture 
>> of ARC:
>> 
>>   ARC: 369G Total, 245G MFU, 97G MRU, 36M Anon, 3599M Header, 24G Other
>> 
>> As you can see, MFU dominates. This is a nature of our workload: we have a 
>> considerably 
>> small dataset that we use constantly and repeatedly; and a large dataset 
>> that we use
>> rarely.
>> 
>> But on the modern version of FreeBSD picture is dramatically different: 
>> 
>>   ARC: 360G Total, 50G MFU, 272G MRU, 211M Anon, 7108M Header, 30G Other
>> 
>> This leads to a much heavier burden on the disk sub-system.
>> 
>> 
>> Commit that introduced a bug: 
>> https://github.com/freebsd/freebsd/commit/555f9563c9dc217341d4bb5129f5d233cf1f92b8

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10a105c53bcce15c-M53cff3b459ecfa98e446241c
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Is L2ARC read process single-threaded?

2018-08-29 Thread Richard Elling



> On Aug 29, 2018, at 11:49 AM, w.kruzel via openzfs-developer 
>  wrote:
> 
> I think yes, there is sufficient demand to have I/O at such level. What do 
> you mean by higher rate for the same workload? If types of devices - I have 
> tested two Intel nvme disks and one of them had a throughput limit on 1 
> thread at about 225MB/s while the other had an output of 285MB/s
> I shall provide the flamegraphs tomorrow.
> But I see a pattern between single threaded reads from NVME and multi 
> threaded reads.

It is usually easy to see the I/Os to a device and see how many are queued 
there. 
I don't run FreeNAS or *BSD, but I'd be surprised if "iostat -x" doesn't show 
the queue
depth. Alternatively, iosnoop can show exactly when I/Os are scheduled and 
completed
so it is easy to observe overlaps.

Now measurements at the disk won't prove that the I/O is multithreaded, but it 
can
disprove that the I/O is single-threaded.
 -- richard


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-Mca2a0302a1fc292e441d8265
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Is L2ARC read process single-threaded?

2018-08-23 Thread Richard Elling
L2ARC uses the ZIO pipeline, just like everything else. Very parallel. But if 
your workload isn’t parallel, then...

  -- richard



> On Aug 23, 2018, at 7:03 PM, Jason Matthews  wrote:
> 
> 
> In 1989 a 4mb stick of ram was like $800. RAM is cheap despite price fixing. 
> 
> Having recently maxed out the 64mb of RAM on my personal IPX in like 1994, I 
> remember telling Len Rose I can’t imagine having a gigabyte of RAM. He 
> laughed at me. 
> 
> Now I sit on racks of systems with 768gb of RAM. RAM is cheap. 
> 
> J. 
> 
> Sent from my iPhone
> 
>> On Aug 23, 2018, at 6:08 PM, Rich via openzfs-developer 
>>  wrote:
>> 
>> Not every board can take ass-tons of RAM, and DDR4 RAM prices have
>> gone markedly up in the last 2 years, not down. (There's even a fun
>> price-fixing lawsuit or two in the works.)
>> 
>> You'd also need to buy a decently high-end chip to exceed 128 GB of
>> RAM on your server.
>> 
>> So while I agree the need for L2ARC in a number of situations has gone
>> down, it's hardly limited to the "highly budget oriented plays."
>> 
>> - Rich
>> 
>>> On Thu, Aug 23, 2018 at 8:42 PM, Jason Matthews  wrote:
>>> 
>>> Maybe I am doing it wrong but I am using NVMe for primary storage and ass
>>> tons of ram for arc.
>>> 
>>> I think L2ARC is relegated to the highly budget oriented plays these days as
>>> RAM is s cheap. Buy some more ram and fooor-get-about it.
>>> 
>>> J.
>>> 
>>> Sent from my iPhone
>>> 
>>> On Aug 23, 2018, at 1:07 PM, Sanjay Nadkarni 
>>> wrote:
>>> 
>>> Found this link for FreeBSD too
>>> http://www.brendangregg.com/blog/2015-03-10/freebsd-flame-graphs.html
>>> 
>>> -Sanjay
>>> 
>>> 
>>> 
>>> On 8/23/18 1:02 PM, Sanjay Nadkarni wrote:
>>> 
>>> Would be useful if you could get flamegraphs when you run into this. See
>>> https://github.com/brendangregg/FlameGraph
>>> 
>>> Once we have that, the we can have a better understanding of what's going on
>>> and we can dtrace it further to figure it out.
>>> 
>>> -Sanjay
>>> 
>>> 
>>> 
>>> On 8/23/18 5:47 AM, w.kruzel via openzfs-developer wrote:
>>> 
>>> It's interesting what you said, as I have two examples (both with different
>>> Intel nvme disks) that show otherwise.
>>> 
>>> Being nvme, I was expecting read performance from L2ARC at 2GB/s+ levels,
>>> yet I only get ~200MB/s read speeds when I certainly know it is being read
>>> from L2ARC.
>>> 
>>> When tested (details in my earlier post), I can get the same speed with
>>> nvmecontrol perftest that is single threaded.
>>> Multi threaded perftest gives 2GB/s + output.
>>> 
>>> I also have this raised with FreeNAS where one of their devs tested it on
>>> other make L2ARC he had installed and confirmed that it looks like single
>>> threaded process.
>>> 
>>> Can someone look into this please?
>>> It is a major performance hit for L2ARC, making it not really fit for
>>> purpose :(
>>> 
>>> Kind regards,
>>> Wojciech Kruzel
>>> 
>>> 
>>> 
>>> openzfs / openzfs-developer / see discussions + participants + delivery
>>> options Permalink

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-M658226abcd72e10176704f0d
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Is L2ARC read process single-threaded?

2018-08-22 Thread Richard Elling


> On Aug 22, 2018, at 11:06 AM, w.kruzel via openzfs-developer 
>  wrote:
> 
> I would really like to know if the L2ARC read process single-threaded.

It is not single threaded.
 -- richard

> Also how can we make it multi threaded and is it possible?
> 
> Thanks,
> Wojciech
> openzfs  / openzfs-developer / see 
> discussions  + participants 
>  + delivery options 
> Permalink 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Tf62628db027682f7-M6057a6efe78ca1ab5f261085
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] re-adding a drive to a pool causes resilver to start over?

2018-07-09 Thread Richard Elling



> On Jul 9, 2018, at 2:39 PM, Ken Merry  wrote:
> 
> Hi ZFS folks,
> 
> We (Spectra Logic) have seen some odd behavior with resilvers in RAIDZ3 pools.
> 
> The codebase in question is FreeBSD stable/11 from July 2017, at 
> approximately FreeBSD SVN version 321310.
> 
> We have customer systems with (sometimes) hundreds of SMR drives in RAIDZ3 
> vdevs in a large pool.  (A typical arrangement is a 23-drive RAIDZ3, and some 
> customers will put everything in one giant pool made up of a number of 
> 23-drive RAIDZ3 arrays.)
> 
> The SMR drives in question have a bug that sometimes causes them to go off 
> the SAS bus for up to two minutes.  (They’re usually gone a lot less than 
> that, up to 10 seconds.)  Once they come back online, zfsd puts the drive 
> back in the pool and makes it online.

ouch

> 
> If a resilver is active on a different drive, once the drive that temporarily 
> left comes back, the resilver apparently starts over from the beginning.
> 
> This leads to resilvers that take forever to complete, especially on systems 
> with high load.
> 
> Is this expected behavior?

scans/resilvers are at the DSL layer. The scan thread goes through each dataset 
and starts at 
the txg needed (full scan starts at txg= effectively 0).

> 
> It seems that only one scan can be active on a pool at any given time.  Is 
> that correct?  If so, is that true for an entire top level pool, or just a 
> given redundancy group?  (In this case, it would be the RAIDZ3 vdev.)

There is one scan thread.

> 
> Is there anything we can do to make sure the resilvers complete in a 
> reasonable period of time or otherwise improve the behavior?  (Short of 
> putting in different drives…I have already suggested that.)

There are ways to change or tune the ZIO scheduler, but that won't make SMR 
drives any faster.
 -- richard

> 
> Thanks,
> 
> Ken
>  —
> Ken Merry
> k...@freebsd.org
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2a7340f4c0c48fa9-M1557c2e89aed98caf806a17a
Delivery options: https://openzfs.topicbox.com/groups


Re: [developer] zfs tests design/issues

2018-06-26 Thread Richard Elling


> On Jun 26, 2018, at 3:49 AM, Josh Paetzel  wrote:
> On Jun 26, 2018, at 4:03 AM, Igor Kozhukhov  > wrote:
> 
>> personally to me, KSH is wrong with result - because 64bit application 
>> should to use 64bit intager as others applications.
>> BASH and ZSH are agreed with me and produce correct result.
>> i didn’t check others shells.
>> how is old/updated KSH?
>> based on Debian changelog file - 93u+20120801-3.1 - it is code drop from 
>> 2012 year.
>> 
>> but BASH is:
>> https://packages.debian.org/stretch/bash 
>> 
>> 4.4 latest version
>> http://git.savannah.gnu.org/cgit/bash.git?h=devel 
>> 
>> 
>> My primary goal was and is: to use more universal/the same tools as others 
>> platforms: Linux, FreeBSD, OSX.
>> and i think - BASH will be much more better as default shell for zfs tests.
>> because others platforms - like Linux, FreeBSD, OSX - have no KSH as default 
>> installed and should install it only for zfs tests.
>> 
>> I’d like to see comments about this idea from OpenZFS community.
>> 
>> also, about others ideas, described in first email with this subject.
>> 
>> -Igor
>> 
> 
> My $.02 is:
> 
> By the time something is complicated enough you’re contemplating porting it 
> to a different shell, just use python.

python2 or python3?  :-/

> 
> I’ll be the first to admit I’m anti-shell. It’s a result of 25 years of 
> experience with multi-platform shell scripts.

To pile on this, anyone who has ever tried to write cross-platform shells knows 
that the builtin math
is not consistent. Back in the 1990's many people were also transitioning from 
32 to 64 bit and got bit.
Hence, you'll often see calls to bc for real math and portability (of sorts).

Next you'll find that different OSes deliver different versions for various 
non-technical reasons. So just
because you're running "bash" it doesn't mean you have all of the current 
features. Apple, I'm looking at you...
 -- richard


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T42f8147492666f65-Maf6912692830bfc594bd8d27
Delivery options: https://openzfs.topicbox.com/groups


[developer] Re: [openzfs/openzfs] 9466 add JSON output support to channel programs (#619)

2018-05-16 Thread Richard Elling
richardelling commented on this pull request.



> @@ -0,0 +1,30 @@
+#!/bin/ksh -p
+#
+# CDDL HEADER START
+#
+# The contents of this file are subject to the terms of the
+# Common Development and Distribution License (the "License").
+# You may not use this file except in compliance with the License.
+#
+# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+# or http://www.opensolaris.org/os/licensing.

nit: but while we're copying... opensolaris.org is long gone. This is a better 
URL:
https://opensource.org/licenses/CDDL-1.0


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/619#pullrequestreview-120817324
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/discussions/T0465226805877059-M89cef064143e9a6809261ec0
Delivery options: https://openzfs.topicbox.com/groups


[developer] Re: [openzfs/openzfs] 9337 zfs get all is slow due to uncached metadata (#599)

2018-03-25 Thread Richard Elling
In ZoL there is a dbuf_stats kstat that is the appropriate place for these. It 
will be natural to extend for the ZoL port.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/599#issuecomment-375989049
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/discussions/T644dae5d5a17704c-M5ccf1db6c1bc5c2d18dbb8ea
Delivery options: https://openzfs.topicbox.com/groups


Re: [developer] Input on refreservation=auto

2018-03-19 Thread Richard Elling
> On Mar 19, 2018, at 7:40 AM, Mike Gerdts <mike.ger...@joyent.com> wrote:
> 
> 
> On Fri, Mar 16, 2018 at 5:38 PM, Richard Elling 
> <richard.ell...@richardelling.com <mailto:richard.ell...@richardelling.com>> 
> wrote:
>> On Mar 15, 2018, at 9:48 PM, Mike Gerdts <mike.ger...@joyent.com 
>> <mailto:mike.ger...@joyent.com>> wrote:
>> I had started down that route, then convinced myself that it may be there 
>> for good reason.  Can you explain how a refreservation greater than the size 
>> calculated by zvol_volsize_to_reservation() is useful?  I couldn't come up 
>> with a way (aside from a potential bug in that copies is not always 
>> accounted for).
> 
> Sure, raidz skip blocks are not accounted for. In part this is logically due 
> to skip blocks being assigned
> at the SPA layer and reservations are at the DSL layer. The pathological 
> example is raidz2 on 4kn disks
> with volblocksize=8k (default). The predicted reservation is 8k per block 
> (logical) plus 8k parity = 16k, but
> the actual allocated space is 24k. The DSL "free" space assumes 16k so it 
> overestimates the usable space.
> Thus you can run out of allocated space in the pool before hitting 
> refreservation -- a bad thing.
> One way to innoculate is to increase refreservation to a value greater than 
> volsize. 
> 
> 
> Thanks for that lesson.  It seems as though it would be better to fix 
> zvol_volsize_to_reservation() to account for this rather than to leave it to 
> system operators to know about this and then each independently come up with 
> the appropriate algorithm to reserve the right amount of space.  Is the 
> algorithm for performing the proper calculation written down somewhere that 
> is publicly accessible?

Sure, the seminal blog is: 
https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
 
<https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz>

Each vdev can have a different allocation ratio and raid config, but if you 
pick the worst case for the pool at 
import/create then store it in spa_t, then it is almost handy.

For human users, they often do not realize the defaults at play here. Also, 
there are far too many people 
propagating the "always set ashift=12 virus" so it could be handy to print a 
warning in zfs(1m).
 -- richard



--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/discussions/Te3d593ba00521b6d-Ma355379a37d594b1204a7ecf
Delivery options: https://openzfs.topicbox.com/groups


Re: [developer] Input on refreservation=auto

2018-03-16 Thread Richard Elling


> On Mar 15, 2018, at 9:48 PM, Mike Gerdts <mike.ger...@joyent.com> wrote:
> 
> On Thu, Mar 15, 2018 at 9:10 PM, Richard Elling 
> <richard.ell...@richardelling.com <mailto:richard.ell...@richardelling.com>> 
> wrote:
> 
> 
>> On Mar 15, 2018, at 1:30 PM, Mike Gerdts <mike.ger...@joyent.com 
>> <mailto:mike.ger...@joyent.com>> wrote:
>> 
>> On Thu, Mar 15, 2018 at 3:00 PM, Matthew Ahrens <mahr...@delphix.com 
>> <mailto:mahr...@delphix.com>> wrote:
>> Yes, I agree and that all sounds great (including "zfs set 
>> refreservation=auto" to get back to the originally-computed refreservation). 
>>  A shame that we didn't catch this when implementing "zfs clone" back in the 
>> day.
>> 
>> I assume that refreservation will continue to be a non-inheritable property, 
>> and that "refreservation=auto" is just a shortcut for "refreservation=123GB" 
>> (or whatever the right number is).  So if you set it to "auto", "zfs get" 
>> will show "123GB".  And changing the volsize will do whatever it does today.
>> 
>> Pretty much, but currently you can't set refreservation to a value greater 
>> than volsize.  The largest explicit value that is allowed is still volsize.
> 
> That is a simple bug to fix and I thought we already had a fix, but perhaps 
> only in ZoL? In any case, 
> the fix needs to be in openzfs. IMHO, we really don't need to check the 
> requested refreservation 
> against volsize at all.
> 
> I had started down that route, then convinced myself that it may be there for 
> good reason.  Can you explain how a refreservation greater than the size 
> calculated by zvol_volsize_to_reservation() is useful?  I couldn't come up 
> with a way (aside from a potential bug in that copies is not always accounted 
> for).

Sure, raidz skip blocks are not accounted for. In part this is logically due to 
skip blocks being assigned
at the SPA layer and reservations are at the DSL layer. The pathological 
example is raidz2 on 4kn disks
with volblocksize=8k (default). The predicted reservation is 8k per block 
(logical) plus 8k parity = 16k, but
the actual allocated space is 24k. The DSL "free" space assumes 16k so it 
overestimates the usable space.
Thus you can run out of allocated space in the pool before hitting 
refreservation -- a bad thing.
One way to innoculate is to increase refreservation to a value greater than 
volsize. 

Note: there are several other remedies to this pathological example, but they 
aren't pertinent to this discussion.
 -- richard

> 
> Overwriting a zvol, taking a snapshot, then overwriting it again doesn't seem 
> to reduce usedbyrefreservation.
> 
> # zfs create -V 100m zones/t/100m
> # dd if=/dev/zero of=/dev/zvol/rdsk/zones/t/100m bs=1024k
> write: I/O error
> 101+0 records in
> 101+0 records out
> 104857600 bytes transferred in 0.280976 secs (373191031 bytes/sec)
> 
> # zfs get -p  space,refreservation,volsize zones/t/100m
> NAME  PROPERTY  VALUE  SOURCE
> zones/t/100m  name  zones/t/100m   -
> zones/t/100m  available 270941458432   -
> zones/t/100m  used  110362624  -
> zones/t/100m  usedbysnapshots   0  -
> zones/t/100m  usedbydataset 105127936  -
> zones/t/100m  usedbyrefreservation  5234688-
> zones/t/100m  usedbychildren0  -
> zones/t/100m  refreservation110362624  local
> zones/t/100m  volsize   104857600  local
> 
> # zfs snapshot zones/t/100m@1
> # dd if=/dev/zero of=/dev/zvol/rdsk/zones/t/100m bs=1024k
> write: I/O error
> 101+0 records in
> 101+0 records out
> 104857600 bytes transferred in 0.421021 secs (249055635 bytes/sec)
> 
> # zfs get -p  space,refreservation,volsize zones/t/100m
> NAME  PROPERTY  VALUE  SOURCE
> zones/t/100m  name  zones/t/100m   -
> zones/t/100m  available 270835998720   -
> zones/t/100m  used  215490560  -
> zones/t/100m  usedbysnapshots   105127936  -
> zones/t/100m  usedbydataset 105127936  -
> zones/t/100m  usedbyrefreservation  5234688-
> zones/t/100m  usedbychildren0  -
> zones/t/100m  refreservation110362624  local
> zones/t/100m  volsize   104857600  local
> 
> 
> Frankly, I have some concerns about these numbers from before the snapshot 
> (same output as first set above, just trimmed)
> 
> # zfs get -p  space,refreservation,volsize zones/t/100m
> NAME  PROPERTY  VALUE  SOURCE
> zones/t/

[developer] Re: [openzfs/openzfs] 9194 mechanism to override ashift at pool creation time (#570)

2018-03-09 Thread Richard Elling
richardelling commented on this pull request.



> @@ -1431,6 +1433,7 @@ vdev_open(vdev_t *vd)
vd->vdev_asize = asize;
vd->vdev_max_asize = max_asize;
vd->vdev_ashift = MAX(ashift, vd->vdev_ashift);
+   vd->vdev_ashift = MAX(zfs_ashift_override, vd->vdev_ashift);

A better name is `zfs_ashift_min` which is a fine approach.

For the name `zfs_ashift_override` the expected code is something like:
`vd->vdev_ashift = zfs_ashift_override > 0 ? zfs_ashift_override : 
vd->vdev_ashift;`
which is also a fine approach.

Basically, I'd like to see it become less ambiguous and more consistent with 
other tunables.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/570#discussion_r173607794
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/discussions/Te17b61abfaa32615-Mb4537cdc078097099d823227
Delivery options: https://openzfs.topicbox.com/groups


[developer] Re: [openzfs/openzfs] nuke spa_dbgmsg (#580)

2018-03-06 Thread Richard Elling
richardelling commented on this pull request.



> @@ -55,11 +55,10 @@ extern boolean_t zfs_free_leak_on_eio;
 #defineZFS_DEBUG_DNODE_VERIFY  (1 << 2)
 #defineZFS_DEBUG_SNAPNAMES (1 << 3)
 #defineZFS_DEBUG_MODIFY(1 << 4)
-#defineZFS_DEBUG_SPA   (1 << 5)

agree. It is more important that the existing, operable values remain <<6, <<7, 
... than <<5 is used. One disadvantage of using bitmasks. This LGTM. Thanks

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/580#discussion_r172722215
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/discussions/T15bf5c1e0b672170-M2d75cb6b3003fe890895794c
Delivery options: https://openzfs.topicbox.com/groups


[developer] Re: [openzfs/openzfs] DLPX-49012 nuke spa_dbgmsg (#580)

2018-03-05 Thread Richard Elling
richardelling commented on this pull request.



> @@ -55,11 +55,10 @@ extern boolean_t zfs_free_leak_on_eio;
 #defineZFS_DEBUG_DNODE_VERIFY  (1 << 2)
 #defineZFS_DEBUG_SNAPNAMES (1 << 3)
 #defineZFS_DEBUG_MODIFY(1 << 4)
-#defineZFS_DEBUG_SPA   (1 << 5)

precisely

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/580#discussion_r172379152
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/discussions/Tc900752b49c91015-M7e485c8aef249426aa48f04a
Delivery options: https://openzfs.topicbox.com/groups


[developer] Re: [openzfs/openzfs] DLPX-49012 nuke spa_dbgmsg (#580)

2018-03-05 Thread Richard Elling
richardelling commented on this pull request.



> @@ -55,11 +55,10 @@ extern boolean_t zfs_free_leak_on_eio;
 #defineZFS_DEBUG_DNODE_VERIFY  (1 << 2)
 #defineZFS_DEBUG_SNAPNAMES (1 << 3)
 #defineZFS_DEBUG_MODIFY(1 << 4)
-#defineZFS_DEBUG_SPA   (1 << 5)

I'd rather not remove this for the selfishly simple case that existing debug 
scripts outside of the OpenZFS tree have documented these and exposed them to 
people. Rather than removal, can we just add a /* deprecated */ comment?


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/580#pullrequestreview-101312084
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/discussions/Tc900752b49c91015-M75ec336cca9764b0da7e65d3
Delivery options: https://openzfs.topicbox.com/groups


[developer] Re: [openzfs/openzfs] 9194 mechanism to override ashift at pool creation time (#570)

2018-02-24 Thread Richard Elling
richardelling commented on this pull request.



> @@ -1431,6 +1433,7 @@ vdev_open(vdev_t *vd)
vd->vdev_asize = asize;
vd->vdev_max_asize = max_asize;
vd->vdev_ashift = MAX(ashift, vd->vdev_ashift);
+   vd->vdev_ashift = MAX(zfs_ashift_override, vd->vdev_ashift);

it seems strange than an override really isn't an override, it is MAX() of the 
override and something else


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/570#pullrequestreview-99123563
--
openzfs-developer
Archives: 
https://openzfs.topicbox.com/groups/developer/discussions/Te17b61abfaa32615-M36177da9ac3fd11b67cb841d
Powered by Topicbox: https://topicbox.com


Re: [developer] implement soft-hba property

2018-02-03 Thread Richard Elling
> On Feb 3, 2018, at 8:18 AM, Garrett D'Amore <garr...@damore.org> wrote:
> 
> sd.conf is ugly as hell though.

agree 100%

> 
> What we *really* need is a better tunable system — more like what we have 
> with dladm for NICs, but intended for storage.  I think we’d like to be able 
> to set tunables for both targets and HBAs.  I have some ideas here, but 
> precious little time to do anything about it.

yep. when we made the sd changes we also added kstats to count retries and the 
various
causes for timeouts. In the end, it begins to look more like network -- a good 
thing. Multipathing
futher complicates things, but the heavy lifting is in sd.

FWIW, it is relatively easy to use mdb to change the pertinent values 
on-the-fly for testing. One
reason this is important is because, as-is, there is little visibility and 
dtracing to get internal state
of sd is tricky because the function boundaries aren't always where you need 
them to be. For AoE
in particular, you want to move towards faster timeouts and more retries before 
failing the I/O.
 -- richard

> 
>  - Garrett
> 
> On Sat, Feb 3, 2018 at 8:11 AM Richard Elling 
> <richard.ell...@richardelling.com <mailto:richard.ell...@richardelling.com>> 
> wrote:
> 
> On Feb 3, 2018, at 7:57 AM, Igor Kozhukhov <i...@dilos.org 
> <mailto:i...@dilos.org>> wrote:
> 
>> Hi All,
>> 
>> i’d like to propose implenetation for soft-hba property for scsi layer with 
>> timing updates.
>> 
>> Problem: we have hw hba like LSI, where we are using direct connetion of 
>> drives to HBA with low latency.
> 
> In the past, we’ve done this with an override property, similar to the other 
> properties for sd devices like physical block size, retries, etc. In other 
> words, adding a new global (aka /etc/system virus) is not as elegant as 
> adding an override in sd.conf
> 
>   -- richard
> 
>> 
>> We have iSCSI drives with sd module usage where we try to use timing related 
>> to HW HBA.
>> but, latency to denwork drives are different with direct connected local 
>> drives.
>> 
>> I have impleneted my idea to split sd_soft_io_time for hw nad soft drives:
>> https://bitbucket.org/dilos/dilos-illumos/commits/b82c668f9e837864afca983809c4d4a28f5051b0
>>  
>> <https://bitbucket.org/dilos/dilos-illumos/commits/b82c668f9e837864afca983809c4d4a28f5051b0>
>> https://bitbucket.org/dilos/dilos-illumos/commits/4d805e127dc4211671ad70c2e026014e6de4a991
>>  
>> <https://bitbucket.org/dilos/dilos-illumos/commits/4d805e127dc4211671ad70c2e026014e6de4a991>
>> 
>> and i’m using it with iscsi and aoe drives.
>> 
>> right now we can use /etc/system vriable sd_soft_io_time
>> for soft drives and it is not impacts for real hw drives.
>> 
>> it is example of my idea and if it is can be interest to others i can try 
>> contribute it ot you can try port it to illumos tree.
>> i have no time try to setup and use original build env and can provide 
>> reports based on DilOS, but i hope change sin sd module based on smartos 
>> changes and can be usable/applicable for illumos.
>> 
>> Best regards,
>> -Igor
>> 
> 
> openzfs-developer | Archives 
> <https://openzfs.topicbox.com/groups/developer/discussions/Ta14496565d3bc26c-M4f82719b034d4d8e242e40bb>
>  | Powered by Topicbox <https://topicbox.com/>

--
openzfs-developer
Archives: 
https://openzfs.topicbox.com/groups/developer/discussions/Ta14496565d3bc26c-Mb87b51ac094daaeb8e8e8561
Powered by Topicbox: https://topicbox.com


Re: [developer] implement soft-hba property

2018-02-03 Thread Richard Elling
> On Feb 3, 2018, at 7:57 AM, Igor Kozhukhov  wrote:
> 
> Hi All,
> 
> i’d like to propose implenetation for soft-hba property for scsi layer with 
> timing updates.
> 
> Problem: we have hw hba like LSI, where we are using direct connetion of 
> drives to HBA with low latency.

In the past, we’ve done this with an override property, similar to the other 
properties for sd devices like physical block size, retries, etc. In other 
words, adding a new global (aka /etc/system virus) is not as elegant as adding 
an override in sd.conf

  -- richard

> 
> We have iSCSI drives with sd module usage where we try to use timing related 
> to HW HBA.
> but, latency to denwork drives are different with direct connected local 
> drives.
> 
> I have impleneted my idea to split sd_soft_io_time for hw nad soft drives:
> https://bitbucket.org/dilos/dilos-illumos/commits/b82c668f9e837864afca983809c4d4a28f5051b0
> https://bitbucket.org/dilos/dilos-illumos/commits/4d805e127dc4211671ad70c2e026014e6de4a991
> 
> and i’m using it with iscsi and aoe drives.
> 
> right now we can use /etc/system vriable sd_soft_io_time
> for soft drives and it is not impacts for real hw drives.
> 
> it is example of my idea and if it is can be interest to others i can try 
> contribute it ot you can try port it to illumos tree.
> i have no time try to setup and use original build env and can provide 
> reports based on DilOS, but i hope change sin sd module based on smartos 
> changes and can be usable/applicable for illumos.
> 
> Best regards,
> -Igor
> 
> openzfs-developer | Archives | Powered by Topicbox

--
openzfs-developer
Archives: 
https://openzfs.topicbox.com/groups/developer/discussions/Ta14496565d3bc26c-M8f8a4e821a54690ec36d2f07
Powered by Topicbox: https://topicbox.com


[developer] Re: [openzfs/openzfs] 7938 disable LBA weighting on files and SSDs (#470)

2017-09-22 Thread Richard Elling
richardelling commented on this pull request.



> @@ -59,6 +59,11 @@ vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t 
> *max_psize,
int error;
 
/*
+* Rotational optimizations only make sense on block devices

I don't think we need a code change here, there is lots of existing baggage to 
deal with. For example, in illumos there are two flags in sd: 
un_f_is_rotational and un_f_is_solid_state.

I'd be happier with the comments taking the perspective that rotation is the 
exception that applies to block devices.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/470#discussion_r140586759
--
openzfs-developer
Archives: 
https://openzfs.topicbox.com/groups/developer/discussions/T2365753bb8ec5214-M16fb4b23eac60ee4b53ac331
Powered by Topicbox: https://topicbox.com


Re: [developer] ZFS user quotas with compression

2016-11-07 Thread Richard Elling

> On Nov 7, 2016, at 2:45 PM, Jorgen Lundman <lund...@lundman.net> wrote:
> 
> Richard Elling wrote:
> 
>> 
>> No, this is a lie. If I upload 50GB of data that compresses or dedups to 
>> 25GB, then I want to pay for 25GB.
>>  — richard
> 
> Huh neat. So how far does it stretch though? If I have compression off, you 
> are happy. What if I use lz4, but gzip-9 would save more, should you get 
> money back? What is Super-lz4 comes out next year, should you be compensated? 
> What if I get more active and re-encode your video file with x265 and save 
> you even more money? I'm fiddling with your bits man! That's totally not cool 
> (but I'm saving you money!)
> 
> But the best one, that confused me. We have 10G of space (actually it is free 
> space) and when customers download their 10G backup, and found it took 13G of 
> local space, they called support to complain. Seriously. Customers are funny.
> 
> But light-heartedness aside, at the end of the day, I am not suggesting ZFS 
> change quota, I'm not even suggesting you change how you use your quota. But 
> lets talk about adding the additional feature for those who want it. I 
> certainly do not want lua in my kernel, but I'm not trying to stop the elders 
> from adding it :)
> 
> Just because all major "cloud" storage does it the non-ZFS way, doesn't mean 
> it this way is good, or "right". But it does create a defacto standard, an 
> industry standard.

Respectfully, you’re changing the subject. We’re talking about quota, not 
billing.
Logicalreferenced is already available for billing.
 — richard



---
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com


Re: [developer] ZFS user quotas with compression

2016-11-07 Thread Richard Elling

> On Nov 7, 2016, at 1:21 PM, Paul B. Henson <hen...@acm.org> wrote:
> 
>> From: Richard Elling
>> Sent: Monday, November 07, 2016 9:44 AM
>> 
>> As a customer, I don’t like getting ripped off. I think you need to find a 
>> better
>> justification.
> 
> That seems a bit harsh :). If you pay for "50GB of data storage" and you 
> upload 50GB of data, how are you getting ripped off? Even if there's a magic 
> fairy in the background that makes your 50GB of data only actually take up 
> 40GB of service provider storage?

No, this is a lie. If I upload 50GB of data that compresses or dedups to 25GB, 
then I want to pay for 25GB.
 — richard



---
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com


Re: [developer] Re: ZFS user quotas with compression

2016-11-04 Thread Richard Elling

> On Nov 4, 2016, at 11:19 AM, Ben RUBSON <ben.rub...@gmail.com> wrote:
> 
> 
>> On 04 Nov 2016, at 19:13, Richard Elling <richard.ell...@richardelling.com 
>> <mailto:richard.ell...@richardelling.com>> wrote:
>> 
>> 
>>> On Nov 4, 2016, at 11:11 AM, MrRakeshsank . <rksh.s...@gmail.com 
>>> <mailto:rksh.s...@gmail.com>> wrote:
>>> 
>>> any one has any ideas on it? Thanks!
>> 
>> Why would you want a "logical size" quota as opposed to the currently 
>> implemented
>> "allocation size" quota?
> 
> I admit I would be interested in this feature too : a user needs 10GB,
> you give him 10GB through a "logical" quota,
> and as a storage admin you enable compression to better handle your storage,
> being able to address more users, then reducing costs.
> Interesting :)

indeed :-)

But while this simple example might work for a compression-only environment, 
its core
concepts are utterly destroyed snapshots, clones, and dedup are used.

A better idea is to bill for logical size, constrain with allocation size.
 — richard




---
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com


Re: [developer] Re: ZFS user quotas with compression

2016-11-04 Thread Richard Elling

> On Nov 4, 2016, at 11:11 AM, MrRakeshsank .  wrote:
> 
> any one has any ideas on it? Thanks!

Why would you want a "logical size" quota as opposed to the currently 
implemented
"allocation size" quota?
 — richard

> 
> 
> On Tue, Nov 1, 2016 at 1:33 PM, MrRakeshsank .  > wrote:
> how to handle the quotas better when compression is on?
> 
> I enabled a user quota for 10GB, but when I could copy the files up to 15 or 
> 16GB, probably due to compression.
> 
> How to make sure or force to use the absolute number even with the 
> compression?
> 
> # zfs userspace liberate |grep -i user1
> 
> TYPENAME   USED  QUOTA
> POSIX User  user1  15.1G10G
> 
> 
> 
> # zfs get compression testpool 
> NAME  PROPERTY VALUE SOURCE
> testpool  compression  onlocal
> 
> 
> 
> openzfs-developer | Archives 
>   
>  | 
> Modify  Your Subscription 
> 



---
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com