Re: [Nfs-ganesha-devel] rpcping comparison nfs-server

2018-03-28 Thread William Allen Simpson

On 3/27/18 9:34 AM, William Allen Simpson wrote:

On 3/25/18 1:44 PM, William Allen Simpson wrote:

On 3/23/18 1:30 PM, William Allen Simpson wrote:

Ran some apples to apples comparisons today V2.7-dev.5:


Without the client-side rbtrees, rpcping works a lot better:


Thought of a small tweak to the list adding routine, so it doesn't
kick the epoll timer unless the SVCXPRT was added to the end of its
timeout list (a much rarer occurrence, but it could happen).

The numbers don't change much, so I ran more of them.  Not sorted
this time, but you get the gist.  Still seeing a huge improvement
around 1,000, with a rough plateau over 10,000 calls.


I've spent some time trying to figure out the plateau.  Turned out
it was a programming error, a break where it should be a continue.
So all the later entries were timing out, but counted as responses.

Sadly, those higher throughput numbers can be disregarded. :(

In my latest rpcping code, I've added a counter for timeouts.

Happily, the profile shows spending our time in recv() and writev(),
exactly as it should be.  No more 48% of time in rbtree_insert.

Still having a problem with rpcping running to completion.  Needs
more debugging

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping comparison nfs-server

2018-03-27 Thread William Allen Simpson

On 3/25/18 1:44 PM, William Allen Simpson wrote:

On 3/23/18 1:30 PM, William Allen Simpson wrote:

Ran some apples to apples comparisons today V2.7-dev.5:


Without the client-side rbtrees, rpcping works a lot better:


Thought of a small tweak to the list adding routine, so it doesn't
kick the epoll timer unless the SVCXPRT was added to the end of its
timeout list (a much rarer occurrence, but it could happen).

The numbers don't change much, so I ran more of them.  Not sorted
this time, but you get the gist.  Still seeing a huge improvement
around 1,000, with a rough plateau over 10,000 calls.

But the raw data looks to me like Ganesha edges up past the kernel
around 1,000,000  Or maybe the extra Ganesha system call overhead
cancels out distributed over a longer period of time?

Probably need to run hundreds of times to get a better distribution,
but more than I'm willing to do by hand.

Happy baking!


Ganesha (worst, best):

rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 33950.1556, total 33950.1556
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 43668.3435, total 43668.3435



rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 151800.6287, total 151800.6287
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 167828.8817, total 167828.8817


rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 144967.5809, total 144967.5809
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 219739.3627, total 219739.3627
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 218477.8040, total 218477.8040
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 126693.0146, total 126693.0146
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 131807.8768, total 131807.8768

rpcping tcp localhost count=1 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 265231.6362, total 265231.6362
rpcping tcp localhost count=1 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 281711.3287, total 281711.3287
rpcping tcp localhost count=1 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 258412.9101, total 258412.9101
rpcping tcp localhost count=1 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 244638.8736, total 244638.8736
rpcping tcp localhost count=1 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 264594.2726, total 264594.2726

rpcping tcp localhost count=10 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 281988.8465, total 281988.8465
rpcping tcp localhost count=10 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 282341.2245, total 282341.2245
rpcping tcp localhost count=10 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 286837.9973, total 286837.9973
rpcping tcp localhost count=10 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 277970.8432, total 277970.8432
rpcping tcp localhost count=10 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 285086.8682, total 285086.8682

rpcping tcp localhost count=100 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 292704.4142, total 292704.4142
rpcping tcp localhost count=100 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 296892.2598, total 296892.2598
rpcping tcp localhost count=100 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 287227.5968, total 287227.5968
rpcping tcp localhost count=100 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 295969.2889, total 295969.2889
rpcping tcp localhost count=100 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 294702.5526, total 294702.5526





Kernel (worst, best):

rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 46826.6383, total 46826.6383
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 52915.1652, total 52915.1652



rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 175773.3986, total 175773.3986
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 189168.4778, total 189168.4778


rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 

Re: [Nfs-ganesha-devel] rpcping profile

2018-03-25 Thread Matt Benjamin
With N=10 and num_calls=100, on Lemon, test_rbt averages 2.8M
reqs/s.  That's about half the rate when N=1, which I think is
expected.  If this is really an available rbt in-order
search-remove-insert retire rate when N is 10, my intuition would
be it's sufficiently fast not to be the bottleneck your result claims,
and I think it's necessary to understand why.

Matt

On Sun, Mar 25, 2018 at 6:17 PM, Matt Benjamin  wrote:
> 1 What is the peak outstanding size of outstanding calls
>
> 1.1 if e.g. > 100k is that correct: as last week, why would a sensible
> client issue more than e.g. 1000 calls without seeing replies?
>
> 1.3 if outstanding calls is <= 1, why can test_rbt retire millions of
> duty cycles / s in that scenario?
>
> 2 what does the search workload look like when replies are mixed with calls?
> Ie bidirectional rpc this is intended for?
>
> 2.2 Hint: xid dist is not generally sorted;  client defines only its own
> issue order, not reply order nor peer xids;  why is it safe to base reply
> matching around xids being in sorted order?
>
> Matt
>
> On Sun, Mar 25, 2018, 1:40 PM William Allen Simpson
>  wrote:
>>
>> On 3/24/18 7:50 AM, William Allen Simpson wrote:
>> > Noting that the top problem is exactly my prediction by knowledge of
>> > the code:
>> >clnt_req_callback() opr_rbtree_insert()
>> >
>> > The second is also exactly as expected:
>> >
>> >svc_rqst_expire_insert() opr_rbtree_insert() svc_rqst_expire_cmpf()
>> >
>> > These are both inserted in ascending order, sorted in ascending order,
>> > and removed in ascending order
>> >
>> > QED: rb_tree is a poor data structure for this purpose.
>>
>> I've replaced those 2 rbtrees with TAILQ, so that we are not
>> spending 49% of the time there anymore, and am now seeing:
>>
>> rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049
>> program=13 version=3 procedure=0): mean 151800.6287, total 151800.6287
>> rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049
>> program=13 version=3 procedure=0): mean 167828.8817, total 167828.8817
>>
>> This is probably good enough for now.  Time to move on to
>> more interesting things.
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Nfs-ganesha-devel mailing list
>> Nfs-ganesha-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping profile

2018-03-25 Thread Matt Benjamin
1 What is the peak outstanding size of outstanding calls

1.1 if e.g. > 100k is that correct: as last week, why would a sensible
client issue more than e.g. 1000 calls without seeing replies?

1.3 if outstanding calls is <= 1, why can test_rbt retire millions of
duty cycles / s in that scenario?

2 what does the search workload look like when replies are mixed with
calls?  Ie bidirectional rpc this is intended for?

2.2 Hint: xid dist is not generally sorted;  client defines only its own
issue order, not reply order nor peer xids;  why is it safe to base reply
matching around xids being in sorted order?

Matt

On Sun, Mar 25, 2018, 1:40 PM William Allen Simpson <
william.allen.simp...@gmail.com> wrote:

> On 3/24/18 7:50 AM, William Allen Simpson wrote:
> > Noting that the top problem is exactly my prediction by knowledge of
> > the code:
> >clnt_req_callback() opr_rbtree_insert()
> >
> > The second is also exactly as expected:
> >
> >svc_rqst_expire_insert() opr_rbtree_insert() svc_rqst_expire_cmpf()
> >
> > These are both inserted in ascending order, sorted in ascending order,
> > and removed in ascending order
> >
> > QED: rb_tree is a poor data structure for this purpose.
>
> I've replaced those 2 rbtrees with TAILQ, so that we are not
> spending 49% of the time there anymore, and am now seeing:
>
> rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 151800.6287, total 151800.6287
> rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 167828.8817, total 167828.8817
>
> This is probably good enough for now.  Time to move on to
> more interesting things.
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping comparison nfs-server

2018-03-25 Thread William Allen Simpson

On 3/23/18 1:30 PM, William Allen Simpson wrote:

Ran some apples to apples comparisons today V2.7-dev.5:


Without the client-side rbtrees, rpcping works a lot better:



Ganesha (worst, best):

rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 33950.1556, total 33950.1556
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 43668.3435, total 43668.3435



rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 151800.6287, total 151800.6287
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 167828.8817, total 167828.8817



Kernel (worst, best):

rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 46826.6383, total 46826.6383
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 52915.1652, total 52915.1652



rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 175773.3986, total 175773.3986
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 189168.4778, total 189168.4778

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping profile

2018-03-25 Thread William Allen Simpson

On 3/24/18 7:50 AM, William Allen Simpson wrote:

Noting that the top problem is exactly my prediction by knowledge of
the code:
   clnt_req_callback() opr_rbtree_insert()

The second is also exactly as expected:

   svc_rqst_expire_insert() opr_rbtree_insert() svc_rqst_expire_cmpf()

These are both inserted in ascending order, sorted in ascending order,
and removed in ascending order

QED: rb_tree is a poor data structure for this purpose.


I've replaced those 2 rbtrees with TAILQ, so that we are not
spending 49% of the time there anymore, and am now seeing:

rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 151800.6287, total 151800.6287
rpcping tcp localhost count=1000 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 167828.8817, total 167828.8817

This is probably good enough for now.  Time to move on to
more interesting things.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-15 Thread Daniel Gryniewicz
100k is a much more accurate measurement.  I haven't gotten any
crashes since the fixes from yesterday, but I can keep trying.


On Thu, Mar 15, 2018 at 12:10 PM, William Allen Simpson
 wrote:
> On 3/15/18 10:23 AM, Daniel Gryniewicz wrote:
>>
>> Can you try again with a larger count, like 100k?  500 is still quite
>> small for a loop benchmark like this.
>>
> In the code, I commented that 500 is minimal.  I've done a pile of
> 100, 200, 300, and they perform roughly the same as 500.
>
> rpcping tcp localhost count=100 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 46812.8194, total 46812.8194
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 41285.4267, total 41285.4267
>
> 100k is a lot less (when it works).
>
> tests/rpcping tcp localhost -c 10
> rpcping tcp localhost count=10 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 15901.7190, total 15901.7190
> tests/rpcping tcp localhost -c 10
> rpcping tcp localhost count=10 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 15894.9971, total 15894.9971
>
> tests/rpcping tcp localhost -c 10 -t 2
> double free or corruption (out)
> Aborted (core dumped)
>
> tests/rpcping tcp localhost -c 10 -t 2
> double free or corruption (out)
> corrupted double-linked list (not small)
> Aborted (core dumped)
>
> Looks like we have a nice dump test case! ;)

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-15 Thread William Allen Simpson

On 3/15/18 10:23 AM, Daniel Gryniewicz wrote:

Can you try again with a larger count, like 100k?  500 is still quite
small for a loop benchmark like this.


In the code, I commented that 500 is minimal.  I've done a pile of
100, 200, 300, and they perform roughly the same as 500.

rpcping tcp localhost count=100 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 46812.8194, total 46812.8194
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 41285.4267, total 41285.4267

100k is a lot less (when it works).

tests/rpcping tcp localhost -c 10
rpcping tcp localhost count=10 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 15901.7190, total 15901.7190
tests/rpcping tcp localhost -c 10
rpcping tcp localhost count=10 threads=1 workers=5 (port=2049 
program=13 version=3 procedure=0): mean 15894.9971, total 15894.9971

tests/rpcping tcp localhost -c 10 -t 2
double free or corruption (out)
Aborted (core dumped)

tests/rpcping tcp localhost -c 10 -t 2
double free or corruption (out)
corrupted double-linked list (not small)
Aborted (core dumped)

Looks like we have a nice dump test case! ;)

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-15 Thread Daniel Gryniewicz
Can you try again with a larger count, like 100k?  500 is still quite
small for a loop benchmark like this.

Daniel

On Thu, Mar 15, 2018 at 9:02 AM, William Allen Simpson
 wrote:
> On 3/14/18 3:33 AM, William Allen Simpson wrote:
>>
>> rpcping tcp localhost threads=1 count=500 (port=2049 program=13
>> version=3 procedure=0): mean 51285.7754, total 51285.7754
>
>
> DanG pushed the latest code onto ntirpc this morning, and I'll submit a
> pullup for Ganesha later today.
>
> I've changed the calculations to be in the final loop, holding onto
> the hope that the original design of averaging each threat result
> might introduce quantization errors.  But it didn't significantly
> change the results.
>
> I've improved the pretty print a bit, now including the worker pool.
> The default 5 worker threads are each handling the incoming replies
> concurrently, so they hopefully keep working without a thread switch.
>
> Another thing I've noted is that the best result is almost always the
> first result after an idle period.  That's opposite of my expectations.
>
> Could it be that the default Ganesha worker pool size of 200 (default)
> or 500 (configured) is much too large, thread scheduler thrashing?
>
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 50989.4139, total 50989.4139
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 32562.0173, total 32562.0173
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 34479.7577, total 34479.7577
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 34070.8189, total 34070.8189
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 33861.2689, total 33861.2689
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 35843.8433, total 35843.8433
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 35367.2721, total 35367.2721
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 31642.2972, total 31642.2972
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 34738.4166, total 34738.4166
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 33211.7319, total 33211.7319
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 35000.5520, total 35000.5520
> rpcping tcp localhost count=500 threads=1 workers=5 (port=2049
> program=13 version=3 procedure=0): mean 36557.6578, total 36557.6578

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-15 Thread William Allen Simpson

On 3/14/18 3:33 AM, William Allen Simpson wrote:

rpcping tcp localhost threads=1 count=500 (port=2049 program=13 version=3 
procedure=0): mean 51285.7754, total 51285.7754


DanG pushed the latest code onto ntirpc this morning, and I'll submit a
pullup for Ganesha later today.

I've changed the calculations to be in the final loop, holding onto
the hope that the original design of averaging each threat result
might introduce quantization errors.  But it didn't significantly
change the results.

I've improved the pretty print a bit, now including the worker pool.
The default 5 worker threads are each handling the incoming replies
concurrently, so they hopefully keep working without a thread switch.

Another thing I've noted is that the best result is almost always the
first result after an idle period.  That's opposite of my expectations.

Could it be that the default Ganesha worker pool size of 200 (default)
or 500 (configured) is much too large, thread scheduler thrashing?

rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 50989.4139, total 50989.4139
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 32562.0173, total 32562.0173
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 34479.7577, total 34479.7577
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 34070.8189, total 34070.8189
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 33861.2689, total 33861.2689
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 35843.8433, total 35843.8433
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 35367.2721, total 35367.2721
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 31642.2972, total 31642.2972
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 34738.4166, total 34738.4166
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 33211.7319, total 33211.7319
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 35000.5520, total 35000.5520
rpcping tcp localhost count=500 threads=1 workers=5 (port=2049 program=13 
version=3 procedure=0): mean 36557.6578, total 36557.6578

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-14 Thread Matt Benjamin
Hi Bill,

I was not (not intentionally, and, I think, not at all) imputing
sentiment to Daniel nor myself.  I was paraphrasing statements Daniel
made not only informally to me, but publically in this week's
nfs-ganesha call, in which you were present.

I'm not denigrating your work, but I did dispute your conclusion on
the performance of rbtree in the client in a specific scenario, for
which I provided detailed measurement and a program for reproducing
them.  I did also challenge the unambiguous claim by you that
something about my use of rbtree in ntirpc and DRC was an
embarrassment to computing science, but did not use inappropriate
language or imputations (fighting words) in doing so.  THAT appears to
be a denigration of my work by you, not the other way around.  There
were other examples in that email, but it was the most glaring.  I
could go into that pattern further, but I don't think I can or should
on a public mailing list.  I won't post further on this or similar
threads.

Matt

On Wed, Mar 14, 2018 at 4:27 PM, William Allen Simpson
 wrote:
> On 3/14/18 7:27 AM, Matt Benjamin wrote:
>>
>> Daniel doesn't think you've measured much accurately yet, but at least
>> the effort (if not the discussion) aims to.
>>
> I'm sure Daniel can speak for himself.  At your time of writing,
> Daniel had not yet arrived in the office after my post this am.
>
> So I'm assuming you're speculating.  Or denigrating my work and
> attributing that sentiment to Daniel.  I'd appreciate you cease
> doing that.
>
> I've done my best with the Tigran's code design that you held onto
> for 6 years without putting it into the tree or keeping it up-to-date.
>
> At this time, there's no indication any numbers are in error.
>
> If you have quantitative information, please provide it.



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-14 Thread William Allen Simpson

On 3/14/18 7:27 AM, Matt Benjamin wrote:

Daniel doesn't think you've measured much accurately yet, but at least
the effort (if not the discussion) aims to.


I'm sure Daniel can speak for himself.  At your time of writing,
Daniel had not yet arrived in the office after my post this am.

So I'm assuming you're speculating.  Or denigrating my work and
attributing that sentiment to Daniel.  I'd appreciate you cease
doing that.

I've done my best with the Tigran's code design that you held onto
for 6 years without putting it into the tree or keeping it up-to-date.

At this time, there's no indication any numbers are in error.

If you have quantitative information, please provide it.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-14 Thread Matt Benjamin
Daniel doesn't think you've measured much accurately yet, but at least
the effort (if not the discussion) aims to.

On Wed, Mar 14, 2018 at 2:54 AM, William Allen Simpson
 wrote:

Matt

-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-14 Thread William Allen Simpson

On 3/13/18 1:58 PM, Daniel Gryniewicz wrote:

rpcping was not thread safe.  I have fixes for it incoming.


With DanG's significant help, we now have better timing results.

There was an implicit assumption in the ancient code that it was
calling single threaded tirpc, while ntirpc is multi-threaded.

The documentation on clock_gettime() says that we cannot obtain
correct timer results between threads.  The starting and stopping
timer calls must be on the same thread.

I've returned the code to the original design that records only
client replies, not the connection create and destroy.  As
expected, the reports have improved by that margin.

Same result.  More calls ::= slower times.

rpcping tcp localhost threads=1 count=500 (port=2049 program=13 version=3 
procedure=0): mean 51285.7754, total 51285.7754
rpcping tcp localhost threads=1 count=1000 (port=2049 program=13 version=3 
procedure=0): mean 44849.7587, total 44849.7587
rpcping tcp localhost threads=1 count=2000 (port=2049 program=13 version=3 
procedure=0): mean 32418.8600, total 32418.8600
rpcping tcp localhost threads=1 count=3000 (port=2049 program=13 version=3 
procedure=0): mean 22578.4432, total 22578.4432
rpcping tcp localhost threads=1 count=5000 (port=2049 program=13 version=3 
procedure=0): mean 18748.8576, total 18748.8576
rpcping tcp localhost threads=1 count=7000 (port=2049 program=13 version=3 
procedure=0): mean 18532.9326, total 18532.9326
rpcping tcp localhost threads=1 count=1 (port=2049 program=13 version=3 
procedure=0): mean 17750.2026, total 17750.2026

As before, multiple call threads are not helping:

rpcping tcp localhost threads=2 count=750 (port=2049 program=13 version=3 
procedure=0): mean 14615.7612, total 29231.5224
rpcping tcp localhost threads=3 count=750 (port=2049 program=13 version=3 
procedure=0): mean 8456.7597, total 25370.2792
rpcping tcp localhost threads=5 count=750 (port=2049 program=13 version=3 
procedure=0): mean 3851.8920, total 19259.4602

We've tried limiting the number of reply threads (was one worker per
reply up to 500 with recycling above that), but the overhead of creating
threads is swamped by something else.  No consistent difference.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-14 Thread William Allen Simpson

On 3/13/18 8:27 AM, Matt Benjamin wrote:

On Tue, Mar 13, 2018 at 2:38 AM, William Allen Simpson
 wrote:

but if we assume xids retire in xid order also,


They do.  Should be no variance.  Eliminating the dupreq caching --
also using the rbtree -- significantly improved the timing.


It's certainly correct not to cache, but it's also a special case that
arises from...benchmarking with rpcping, not NFS.


Never-the-less, "significantly improved the timing".

Duplicates are rare.  The DRC needs to be able to get out of the way,
and shouldn't add significant overhead.



Same goes for retire order.  Who said, let's assume the rpcping
requests retire in order?  Oh yes, me above.  


Actually, me in an earlier part of the thread.



Do you think NFS
requests in general are required to retire in arrival order?  No, of
course not.  What workload is the general case for the DRC?  NFS.


The question is not, do (RPC CALL) NFS requests retire in arrival order.

The question in this thread is how far out of order do RPC REPLY retire,
and best computer science data structure(s) for this workload.



Apparently picked the worst tree choice for this data, according to
computer science. If all you have is a hammer


What motivates you to write this stuff?


Correctness.



Here are two facts you may have overlooked:

1. The DRC has a constant insert-delete workload, and for this
application, IIRC, I put the last inserted entries directly into the
cache.  This both applies standard art on trees (rbtree vs avl
perfomance on insert/delete heavy workloads, and ostensibly avoids
searching the tree in the common case;  I measured hitrate informally,
looked to be working).


I have no idea why we are off on this tangent here.  The subject is
rpcping, not the DRC.

As to the DRC, we know that in fact the ntirpc "citihash" was of the
wrong data in GSS (the always changing ciphertext instead of the
plaintext), so in that case there was *no* hit rate at all.

In ntirpc v1.6, we now have a formal call to checksum, instead of an
ad hoc addition to the decode.  So we should be getting a better hit
rate.  I look forward to publication of your hit rate results.


2. the key in the DRC caches is hk,not xid.


That should improve the results for DRC RB-trees.

As I've mentioned before, I've never really examined the DRC code.
In person yesterday afternoon, you agreed that the repeated mallocs
in that code provide contention during concurrent thread processing
in the main path.

I've promised to take a look during my zero-copy efforts.

But this thread is about rpcping data structures.



What have you compared it to?  Need a gtest of avl and tailq with the
same data.  That's what the papers I looked at do


[...]

The rb tree either is, or isn't a major contributor to latency.  We'll
ditch it if it is.  Substituting a tailq (linear search) seems an
unlikely choice, but if you can prove your case with the numbers, no
one's going to object.


Thank you.  I'll probably try that in a week or so.

Right now, as mentioned on the conference call, I need some help
diagnosing why the rpcping code crashes.  Some assumptions about
threading seem to be wrong.  DanG is helping immensely!

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-13 Thread Daniel Gryniewicz

rpcping was not thread safe.  I have fixes for it incoming.

Daniel

On 03/13/2018 12:13 PM, William Allen Simpson wrote:

On 3/13/18 2:38 AM, William Allen Simpson wrote:

In my measurements, using the new CLNT_CALL_BACK(), the client thread
starts sending a stream of pings.  In every case, it peaks at a
relatively stable rate.


DanG suggested that timing was dominated by the system time calls.

The previous numbers were switched to a finer grained timer than
the original code.  JeffL says that clock_gettime() should have had
negligible overhead.

But just to make sure, I've eliminated the per thread timers and
substituted one before and one after.  Unlike previously, this
will include the overhead of setting up the client, in addition to
completing all the callback returns.

Same result.  More calls ::= slower times.

rpcping tcp localhost threads=1 count=1000 (port=2049 program=13 
version=3 procedure=0): average 36012.0254, total 36012.0254
rpcping tcp localhost threads=1 count=1500 (port=2049 program=13 
version=3 procedure=0): average 33720.9125, total 33720.9125
rpcping tcp localhost threads=1 count=2000 (port=2049 program=13 
version=3 procedure=0): average 25604.7542, total 25604.7542
rpcping tcp localhost threads=1 count=3000 (port=2049 program=13 
version=3 procedure=0): average 21170.0836, total 21170.0836
rpcping tcp localhost threads=1 count=5000 (port=2049 program=13 
version=3 procedure=0): average 18163.2451, total 18163.2451


Including the 3-way handshake time for setting up the clients does affect
the overall throughput numbers.

rpcping tcp localhost threads=2 count=1500 (port=2049 program=13 
version=3 procedure=0): average 10379.3976, total 20758.7951
rpcping tcp localhost threads=2 count=1500 (port=2049 program=13 
version=3 procedure=0): average 10746.9395, total 21493.8790


rpcping tcp localhost threads=3 count=1500 (port=2049 program=13 
version=3 procedure=0): average 5473.3780, total 16420.1339
rpcping tcp localhost threads=3 count=1500 (port=2049 program=13 
version=3 procedure=0): average 5886.5549, total 17659.6646


rpcping tcp localhost threads=5 count=1500 (port=2049 program=13 
version=3 procedure=0): average 3396.9438, total 16984.7190
rpcping tcp localhost threads=5 count=1500 (port=2049 program=13 
version=3 procedure=0): average 3455.3026, total 17276.5131



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-13 Thread William Allen Simpson

On 3/13/18 2:38 AM, William Allen Simpson wrote:

In my measurements, using the new CLNT_CALL_BACK(), the client thread
starts sending a stream of pings.  In every case, it peaks at a
relatively stable rate.


DanG suggested that timing was dominated by the system time calls.

The previous numbers were switched to a finer grained timer than
the original code.  JeffL says that clock_gettime() should have had
negligible overhead.

But just to make sure, I've eliminated the per thread timers and
substituted one before and one after.  Unlike previously, this
will include the overhead of setting up the client, in addition to
completing all the callback returns.

Same result.  More calls ::= slower times.

rpcping tcp localhost threads=1 count=1000 (port=2049 program=13 version=3 
procedure=0): average 36012.0254, total 36012.0254
rpcping tcp localhost threads=1 count=1500 (port=2049 program=13 version=3 
procedure=0): average 33720.9125, total 33720.9125
rpcping tcp localhost threads=1 count=2000 (port=2049 program=13 version=3 
procedure=0): average 25604.7542, total 25604.7542
rpcping tcp localhost threads=1 count=3000 (port=2049 program=13 version=3 
procedure=0): average 21170.0836, total 21170.0836
rpcping tcp localhost threads=1 count=5000 (port=2049 program=13 version=3 
procedure=0): average 18163.2451, total 18163.2451

Including the 3-way handshake time for setting up the clients does affect
the overall throughput numbers.

rpcping tcp localhost threads=2 count=1500 (port=2049 program=13 version=3 
procedure=0): average 10379.3976, total 20758.7951
rpcping tcp localhost threads=2 count=1500 (port=2049 program=13 version=3 
procedure=0): average 10746.9395, total 21493.8790

rpcping tcp localhost threads=3 count=1500 (port=2049 program=13 version=3 
procedure=0): average 5473.3780, total 16420.1339
rpcping tcp localhost threads=3 count=1500 (port=2049 program=13 version=3 
procedure=0): average 5886.5549, total 17659.6646

rpcping tcp localhost threads=5 count=1500 (port=2049 program=13 version=3 
procedure=0): average 3396.9438, total 16984.7190
rpcping tcp localhost threads=5 count=1500 (port=2049 program=13 version=3 
procedure=0): average 3455.3026, total 17276.5131

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-13 Thread Matt Benjamin
On Tue, Mar 13, 2018 at 2:38 AM, William Allen Simpson
 wrote:
> On 3/12/18 6:25 PM, Matt Benjamin wrote:
>>
>> If I understand correctly, we always insert records in xid order, and
>> xid is monotonically increasing by 1.  I guess pings might come back
>> in any order,
>
>
> No, they always come back in order.  This is TCP.  I've gone to some
> lengths to fix the problem that operations were being executed in
> arbitrary order.  (As was reported in the past.)

We're aware of the issues with former req queuing.  It was one of my
top priorities to fix in napalm, and we did it.

>
> For UDP, there is always the possibility of loss or re-ordering of
> datagrams, one of the reasons for switching to TCP in NFSv3 (and
> eliminating UDP in NFSv4).
>
> Threads can still block in apparently random order, because of
> timing variances inside FSAL calls.  Should not be an issue here.
>
>
>> but if we assume xids retire in xid order also,
>
>
> They do.  Should be no variance.  Eliminating the dupreq caching --
> also using the rbtree -- significantly improved the timing.

It's certainly correct not to cache, but it's also a special case that
arises from...benchmarking with rpcping, not NFS.
Same goes for retire order.  Who said, let's assume the rpcping
requests retire in order?  Oh yes, me above.  Do you think NFS
requests in general are required to retire in arrival order?  No, of
course not.  What workload is the general case for the DRC?  NFS.

>
> Apparently picked the worst tree choice for this data, according to
> computer science. If all you have is a hammer

What motivates you to write this stuff?

Here are two facts you may have overlooked:

1. The DRC has a constant insert-delete workload, and for this
application, IIRC, I put the last inserted entries directly into the
cache.  This both applies standard art on trees (rbtree vs avl
perfomance on insert/delete heavy workloads, and ostensibly avoids
searching the tree in the common case;  I measured hitrate informally,
looked to be working).

2. the key in the DRC caches is hk,not xid.

>
>
>> and keep
>> a window of 1 records in-tree, that seems maybe like a reasonable
>> starting point for measuring this?
>> I've not tried 10,000 or 100,000 recently.  (The original code
>
> default sent 100,000.)
>
> I've not recorded how many remain in-tree during the run.
>
> In my measurements, using the new CLNT_CALL_BACK(), the client thread
> starts sending a stream of pings.  In every case, it peaks at a
> relatively stable rate.
>
> For 1,000, <4,000/s.  For 100, 40,000/s.  Fairly linear relationship.
>
> By running multiple threads, I showed that each individual thread ran
> roughly the same (on average).  But there is some variance per run.
>
> I only posted the 5 thread results, lowest and highest achieved.
>
> My original message had up to 200 threads and 4 results, but I decided
> such a long series was overkill, so removed them before sending.
>
> That 4,000 and 40,000 per client thread was stable across all runs.
>
>
>> I wrote a gtest program (gerrit) that I think does the above in a
>> single thread, no locks, for 1M cycles (search, remove, insert).  On
>> lemon, compiled at O2, the gtest profiling says the test finishes in
>> less than 150ms (I saw as low as 124).  That's over 6M cycles/s, I
>> think.
>>
> What have you compared it to?  Need a gtest of avl and tailq with the
> same data.  That's what the papers I looked at do

The point is, that is very low latency, a lot less than I expected.
It's probably minimized from CPU caching and so forth, but it tries to
address the more basic question, is expected or unexpected latency
from searching the rb tree a likely contributor to overall latency?
If we get 2M retires per sec (let alone 6-7), is that a likely
supposition?

The rb tree either is, or isn't a major contributor to latency.  We'll
ditch it if it is.  Substituting a tailq (linear search) seems an
unlikely choice, but if you can prove your case with the numbers, no
one's going to object.

Matt

-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-13 Thread William Allen Simpson

On 3/12/18 6:25 PM, Matt Benjamin wrote:

If I understand correctly, we always insert records in xid order, and
xid is monotonically increasing by 1.  I guess pings might come back
in any order, 


No, they always come back in order.  This is TCP.  I've gone to some
lengths to fix the problem that operations were being executed in
arbitrary order.  (As was reported in the past.)

For UDP, there is always the possibility of loss or re-ordering of
datagrams, one of the reasons for switching to TCP in NFSv3 (and
eliminating UDP in NFSv4).

Threads can still block in apparently random order, because of
timing variances inside FSAL calls.  Should not be an issue here.


but if we assume xids retire in xid order also, 


They do.  Should be no variance.  Eliminating the dupreq caching --
also using the rbtree -- significantly improved the timing.

Apparently picked the worst tree choice for this data, according to
computer science.  If all you have is a hammer



and keep
a window of 1 records in-tree, that seems maybe like a reasonable
starting point for measuring this?
I've not tried 10,000 or 100,000 recently.  (The original code

default sent 100,000.)

I've not recorded how many remain in-tree during the run.

In my measurements, using the new CLNT_CALL_BACK(), the client thread
starts sending a stream of pings.  In every case, it peaks at a
relatively stable rate.

For 1,000, <4,000/s.  For 100, 40,000/s.  Fairly linear relationship.

By running multiple threads, I showed that each individual thread ran
roughly the same (on average).  But there is some variance per run.

I only posted the 5 thread results, lowest and highest achieved.

My original message had up to 200 threads and 4 results, but I decided
such a long series was overkill, so removed them before sending.

That 4,000 and 40,000 per client thread was stable across all runs.



I wrote a gtest program (gerrit) that I think does the above in a
single thread, no locks, for 1M cycles (search, remove, insert).  On
lemon, compiled at O2, the gtest profiling says the test finishes in
less than 150ms (I saw as low as 124).  That's over 6M cycles/s, I
think.


What have you compared it to?  Need a gtest of avl and tailq with the
same data.  That's what the papers I looked at do

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-12 Thread Matt Benjamin
That's certainly suggestive.

I found it hard to believe the red-black tree performance could be
that bad, at a loading of 10K items--even inserting, searching, and
removing in-order.  Then again, I never benchmarked the opr_rbtree
code.

If I understand correctly, we always insert records in xid order, and
xid is monotonically increasing by 1.  I guess pings might come back
in any order, but if we assume xids retire in xid order also, and keep
a window of 1 records in-tree, that seems maybe like a reasonable
starting point for measuring this?

I wrote a gtest program (gerrit) that I think does the above in a
single thread, no locks, for 1M cycles (search, remove, insert).  On
lemon, compiled at O2, the gtest profiling says the test finishes in
less than 150ms (I saw as low as 124).  That's over 6M cycles/s, I
think.

Matt

Matt

On Mon, Mar 12, 2018 at 4:06 PM, William Allen Simpson
 wrote:
> [These are with a Ganesha that doesn't dupreq cache the null operation.]
>
> Just how slow is this RB tree?
>
> Here's a comparison of 1000 entries versus 100 entries in ops per second:
>
> rpcping tcp localhost threads=5 count=1000 (port=2049 program=13
> version=3 procedure=0): average 2963.2517, total 14816.2587
> rpcping tcp localhost threads=5 count=1000 (port=2049 program=13
> version=3 procedure=0): average 3999.0897, total 19995.4486
>
> rpcping tcp localhost threads=5 count=100 (port=2049 program=13
> version=3 procedure=0): average 39738.1842, total 198690.9208
> rpcping tcp localhost threads=5 count=100 (port=2049 program=13
> version=3 procedure=0): average 39913.1032, total 199565.5161



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-12 Thread William Allen Simpson

[These are with a Ganesha that doesn't dupreq cache the null operation.]

Just how slow is this RB tree?

Here's a comparison of 1000 entries versus 100 entries in ops per second:

rpcping tcp localhost threads=5 count=1000 (port=2049 program=13 version=3 
procedure=0): average 2963.2517, total 14816.2587
rpcping tcp localhost threads=5 count=1000 (port=2049 program=13 version=3 
procedure=0): average 3999.0897, total 19995.4486

rpcping tcp localhost threads=5 count=100 (port=2049 program=13 version=3 
procedure=0): average 39738.1842, total 198690.9208
rpcping tcp localhost threads=5 count=100 (port=2049 program=13 version=3 
procedure=0): average 39913.1032, total 199565.5161

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-12 Thread William Allen Simpson

One of the limiting factors in our Ganesha performance is that the
NULL operation is going through the dupreq code.  That can be
easily fixed with a check that jumps to nocache.

One of the limiting factors in our ntirpc performance seems to be the
call_replies tree that stores the xid of calls to match replies.

Currently, we are using an RB tree.  The XID advances sequentially.

BTW, we have the same problem with fd.  The fd advances sequentially.

Performing sequential inserts, the AVL algorithm is 37.5% faster!
  
https://refactoringlightly.wordpress.com/2017/10/29/performance-of-avl-red-black-trees-in-java/

There is one tree per connection.  We don't really need to worry much
about out of order replies.  So the best structure would be a simpler
tailq list.

In the short term, we discussed hashing the XID before insertion.
But that still has the rapid insertion/deletion issue.

Apparently, insertions and deletions can be so slow that it takes a
count of about 40 before AVL outperforms simple list.  In the ping
case, we'll never see more than 1.  Its replies will always be
sequential on a TCP connection.

Do we know of any NFS case where we expect any RPC call to return
behind more than 40 other calls on the same connection?

Do we know of any NFS case where we expect to make 40 concurrent
RPC calls on the same connection?

[Remember, we are not talking about client to server calls.  We
are talking about server to client back-channel.]

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-12 Thread William Allen Simpson

[top post]
Matt produced a new tcp-only variant that skipped rpcbind.

I tried it, and immediately got crashes.  So I've pushed out a few
bug fixes  With my fixes, here are the results on my desktop.

First and foremost, I compared with my prior results against rpcbind,
and they were comparable.

Next I tried the default nfsv3:

rpcping tcp localhost threads=1 count=1500 (port=2049 program=13 version=3 
procedure=0): 5., total 5.
rpcping tcp localhost threads=2 count=1500 (port=2049 program=13 version=3 
procedure=0): 5000., total 1.
rpcping tcp localhost threads=3 count=1500 (port=2049 program=13 version=3 
procedure=0): 4166.6667, total 12500.

rpcping tcp localhost threads=5 count=500 (port=2049 program=13 version=3 
procedure=0): 4469.6970, total 22348.4848
rpcping tcp localhost threads=7 count=500 (port=2049 program=13 version=3 
procedure=0): 3019.9580, total 21139.7059
rpcping tcp localhost threads=10 count=500 (port=2049 program=13 version=3 
procedure=0): 1769.2308, total 17692.3077

Note we are almost entirely bound by Ganesha.  Results are progressively
worse than against rpcbind.

Finally, I tried nfsv4:

rpcping tcp localhost threads=1 count=1500 (port=2049 program=13 version=4 
procedure=0): 25000., total 25000.
rpcping tcp localhost threads=2 count=1500 (port=2049 program=13 version=4 
procedure=0): 13068.1818, total 26136.3636
rpcping tcp localhost threads=3 count=1500 (port=2049 program=13 version=4 
procedure=0): 4000., total 12000.
rpcping tcp localhost threads=5 count=1500 (port=2049 program=13 version=4 
procedure=0): 2743.1290, total 13715.6448

rpcping tcp localhost threads=7 count=500 (port=2049 program=13 version=4 
procedure=0): 2521.0084, total 17647.0588
rpcping tcp localhost threads=10 count=500 (port=2049 program=13 version=4 
procedure=0): 1731.3390, total 17313.3903
rpcping tcp localhost threads=15 count=500 (port=2049 program=13 version=4 
procedure=0): 1142.3732, total 17135.5981


On 3/8/18 8:03 PM, William Allen Simpson wrote:

rpcping tcp localhost threads=3 count=100 (program=10 version=4 
procedure=0): .6667, total 2.
rpcping tcp localhost threads=5 count=100 (program=10 version=4 
procedure=0): 1., total 5.
rpcping tcp localhost threads=7 count=100 (program=10 version=4 
procedure=0): 8571.4286, total 6.
rpcping tcp localhost threads=10 count=100 (program=10 version=4 
procedure=0): 7000., total 7.
rpcping tcp localhost threads=15 count=100 (program=10 version=4 
procedure=0): 5666.6667, total 85000.
rpcping tcp localhost threads=20 count=100 (program=10 version=4 
procedure=0): 3750., total 75000.
rpcping tcp localhost threads=25 count=100 (program=10 version=4 
procedure=0): 2420., total 60500.



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel


Re: [Nfs-ganesha-devel] rpcping

2018-03-08 Thread William Allen Simpson

On 3/8/18 12:33 PM, William Allen Simpson wrote:

Still having no luck.  Instead of relying on RPC itself, checked
with Ganesha about what it registers, and tried some of those.


Without running Ganesha, rpcinfo reports portmapper services by default
on my machine.  Can talk to it via localhost (but not 127.0.0.1 loopback).

bill@simpson91:~/rdma/build_ganesha$ rpcinfo
   program version netid addressserviceowner
104tcp6  ::.0.111   portmapper superuser
103tcp6  ::.0.111   portmapper superuser
104udp6  ::.0.111   portmapper superuser
103udp6  ::.0.111   portmapper superuser
104tcp   0.0.0.0.0.111  portmapper superuser
103tcp   0.0.0.0.0.111  portmapper superuser
102tcp   0.0.0.0.0.111  portmapper superuser
104udp   0.0.0.0.0.111  portmapper superuser
103udp   0.0.0.0.0.111  portmapper superuser
102udp   0.0.0.0.0.111  portmapper superuser
104local /run/rpcbind.sock  portmapper superuser
103local /run/rpcbind.sock  portmapper superuser

TCP works.  UDP with the same parameters hangs forever.

tests/rpcping tcp localhost 1 1000 10 4
rpcping tcp localhost threads=1 count=1000 (program=10 version=4 
procedure=0): 5., total 5.
tests/rpcping tcp localhost 1 1 10 4
rpcping tcp localhost threads=1 count=1 (program=10 version=4 
procedure=0): 17543.8596, total 17543.8596
tests/rpcping tcp localhost 1 10 10 4
^C

What's interesting to me is that 1,000 async calls has much
better throughput (calls per second) than 10,000.  Hard to
say where is the bottleneck without profiling.

100,000 async calls bogs down so long that I gave up.  Same
with 2 threads and 10,000 -- or 3 threads down to 100.

tests/rpcping tcp localhost 2 1000 10 4
rpcping tcp localhost threads=2 count=1000 (program=10 version=4 
procedure=0): 8333., total 1.6667
tests/rpcping tcp localhost 2 1 10 4
^C

tests/rpcping tcp localhost 3 1000 10 4
^C
tests/rpcping tcp localhost 3 500 10 4
^C
tests/rpcping tcp localhost 3 100 10 4
rpcping tcp localhost threads=3 count=100 (program=10 version=4 
procedure=0): .6667, total 2.
tests/rpcping tcp localhost 5 100 10 4
rpcping tcp localhost threads=5 count=100 (program=10 version=4 
procedure=0): 1., total 5.
tests/rpcping tcp localhost 7 100 10 4
rpcping tcp localhost threads=7 count=100 (program=10 version=4 
procedure=0): 8571.4286, total 6.
tests/rpcping tcp localhost 10 100 10 4
rpcping tcp localhost threads=10 count=100 (program=10 version=4 
procedure=0): 7000., total 7.
tests/rpcping tcp localhost 15 100 10 4
rpcping tcp localhost threads=15 count=100 (program=10 version=4 
procedure=0): 5666.6667, total 85000.
tests/rpcping tcp localhost 20 100 10 4
rpcping tcp localhost threads=20 count=100 (program=10 version=4 
procedure=0): 3750., total 75000.
tests/rpcping tcp localhost 25 100 10 4
rpcping tcp localhost threads=25 count=100 (program=10 version=4 
procedure=0): 2420., total 60500.

Note that 5 threads and 100 catches up to 1 thread and 1,000?

So the bottleneck is probably in ntirpc.  That seems validated by 7 to
25 threads; portmapper will handle more requests (with diminishing
returns), but ntirpc cannot handle more results (on the same thread).

Oh well, against nfs-ganesha still doesn't work.

tests/rpcping tcp localhost 1 10 13 4
clnt_ncreate failed: RPC: Unknown protocol
tests/rpcping tcp localhost 1 10 13 3
clnt_ncreate failed: RPC: Unknown protocol

But it's in the rpcinfo:

   program version netid addressserviceowner
104tcp6  ::.0.111   portmapper superuser
103tcp6  ::.0.111   portmapper superuser
104udp6  ::.0.111   portmapper superuser
103udp6  ::.0.111   portmapper superuser
104tcp   0.0.0.0.0.111  portmapper superuser
103tcp   0.0.0.0.0.111  portmapper superuser
102tcp   0.0.0.0.0.111  portmapper superuser
104udp   0.0.0.0.0.111  portmapper superuser
103udp   0.0.0.0.0.111  portmapper superuser
102udp   0.0.0.0.0.111  portmapper superuser
104local /run/rpcbind.sock  portmapper superuser
103local /run/rpcbind.sock  portmapper superuser
133udp   0.0.0.0.8.1nfssuperuser
133udp6  :::0.0.0.0.8.1