Re: [vpp-dev] [csit-dev] Faster PAPI

2021-04-09 Thread Paul Vinciguerra
It seems to me that the root of the problem is points 2/6. Can we address
the issue by adding support for an iterator/generator type to the API and
pass it across the wire and have the api service construct the concrete
commands on the VPP side?

If I misunderstand and these are all separate problems with papi, then we
are better to address the deficiencies in a way that is consistent across
all clients rather than look for side-channel alternatives.

Paul



On Thu, Apr 8, 2021 at 9:29 AM Vratko Polak -X (vrpolak - PANTHEON TECH SRO
at Cisco)  wrote:

> Now back to this branch of the conversation.
>
> 1a) would be:
> >> That means using VAT for some operations (e.g. adding multiple routes
> [5]),
> >> and creating "exec scripts" [6] for operations without VAT one-liner.
>
> > 1c) Use a debug CLI script file and "exec"
>
> Very similar to 1a), but replacing VAT1 with exec and CLI.
> I think lately VAT1 is actually more stable than CLI,
> and one-liners are faster than long scripts,
> so no real pros for 1c) (unless VAT1 is going to be deprecated).
>
> > 1b) Pre-generate a JSON file with all commands and load that into VPP
> with VAT2.
>
> After thinking about this, I think it is worth a try.
> VAT2 uses simple logic to deserialize data and call binary API,
> making it fairly resistant behavior changes
> (as we have crcchecker to guard production APIs).
> On CSIT side the code would not look much different
> from the 1a) case we support today.
> I will probably create a proof of concept in Gerrit
> to see what the performance is.
>
> Contrary to pure PAPI solutions,
> many keywords in CSIT would need to know how to emit
> their command for the json+VAT2 call (not just for PAPI+socket),
> but that is not much trouble.
>
> >> 2. Support "vector operations" in VPP via binary API.
> >
> > Could you ellaborate on this idea?
>
> I think your 6) below is the same thing.
>
> >> 3. VPP PAPI improvements only.
> >> No changes to VPP API, just changes to PAPI to allow better speed for
> socket interactions.
>
> This is what https://gerrit.fd.io/r/c/vpp/+/31920
> (mentioned in the other branch of this conversation)
> is an example for.
>
> >> CSIT would need a fast way to synthetize binary messages.
> >
> > What would this be?
> > E.g. we could do serializations in C with a small Python wrapper.
>
> Multiple possibilities. The main thing is
> this would be done purely using CSIT code,
> so no dependency on VPP PAPI (or any other) code.
> (So does not need to be decided in this conversation.)
>
>
> https://gerrit.fd.io/r/c/csit/+/26019/140/resources/libraries/python/bytes_template.py
> is an example. It is Python, but fast enough for CSIT purposes.
>
> >> 4. CSIT hacks only (Gerrit 26019).
> >
> > This is the idea explained below, where you serialize message once and
> replays right?
> > In addition to tweaking the reader thread etc?
>
> > 5) Paul's proposal. I don't know if he has measured performance impact
> on that.
>
> Which one, vapi+swig?
> I am not sure what the complete solution looks like.
> I believe vapi needs to be on VPP machine,
> but CSIT Python stuff (for swig) is on a separate machine.
> I do not see swig having any transport capabilities,
> so we would need to insert them (socket, file transfer, something else)
> somewhere, and I do not see a good place.
>
> > 6) The add 4 million routes that CSIT uses, generates routes according
> to a set of parameters.
> >We could add a small plugin on the VPP side that exposes an API of
> the sort
> >"create  routes from ".
>
> Subtypes:
> 6a) Official plugin in VPP repo, needs a maintainer.
> 6b) Plugin outside VPP repo (perhaps in CSIT). Needs to be built and
> deployed.
>
> > Is it the serialisation, or the message RTT or both?
> > If serialisation, doing it in C should help.
> > If it's the RTT, larger messages is one option.
>
> There are three known bottlenecks. Two for RTT, one for serialization.
> I do not recall the real numbers,
> but for the sake of discussion, you can assume
> removing bottlenecks in the following order,
> each bottleneck removed halves the configuration time.
>
> First bottleneck: Sync commands.
> Solution: Send multiple commands before reading replies.
> No support from VPP needed, maybe except reporting
> how big the last message sent was (to avoid UDS buffers getting full).
>
> Second bottleneck: Background threads VPP PAPI code uses
> to read replies asynchronously from socket, deserialize them,
> and put to a queue for user to read.
> Solution: Hack vpp_transport_socket.VppTransport after connect
> to stop message_thread and read from socket directly.
> Or use a different VppTransport implementation
> that does not start the thread (nor multiprocessing queues) in first place.
> 26019 does the former, 31920 does the latter.
> This bottleneck is the primary reason I started this conversation.
>
> Third bottleneck: Serialization of many commands (and deserialization of
> responses).
> I do not recall if the slower part is 

Re: [vpp-dev] [csit-dev] Faster PAPI

2021-04-08 Thread Vratko Polak -X (vrpolak - PANTHEON TECHNOLOGIES at Cisco) via lists.fd.io
Now back to this branch of the conversation.

1a) would be:
>> That means using VAT for some operations (e.g. adding multiple routes [5]),
>> and creating "exec scripts" [6] for operations without VAT one-liner.

> 1c) Use a debug CLI script file and "exec"

Very similar to 1a), but replacing VAT1 with exec and CLI.
I think lately VAT1 is actually more stable than CLI,
and one-liners are faster than long scripts,
so no real pros for 1c) (unless VAT1 is going to be deprecated). 

> 1b) Pre-generate a JSON file with all commands and load that into VPP with 
> VAT2.

After thinking about this, I think it is worth a try.
VAT2 uses simple logic to deserialize data and call binary API,
making it fairly resistant behavior changes
(as we have crcchecker to guard production APIs).
On CSIT side the code would not look much different
from the 1a) case we support today.
I will probably create a proof of concept in Gerrit
to see what the performance is.

Contrary to pure PAPI solutions,
many keywords in CSIT would need to know how to emit
their command for the json+VAT2 call (not just for PAPI+socket),
but that is not much trouble.

>> 2. Support "vector operations" in VPP via binary API.
>
> Could you ellaborate on this idea?

I think your 6) below is the same thing.

>> 3. VPP PAPI improvements only.
>> No changes to VPP API, just changes to PAPI to allow better speed for socket 
>> interactions.

This is what https://gerrit.fd.io/r/c/vpp/+/31920
(mentioned in the other branch of this conversation)
is an example for.

>> CSIT would need a fast way to synthetize binary messages.
>
> What would this be?
> E.g. we could do serializations in C with a small Python wrapper.

Multiple possibilities. The main thing is
this would be done purely using CSIT code,
so no dependency on VPP PAPI (or any other) code.
(So does not need to be decided in this conversation.)

https://gerrit.fd.io/r/c/csit/+/26019/140/resources/libraries/python/bytes_template.py
is an example. It is Python, but fast enough for CSIT purposes.

>> 4. CSIT hacks only (Gerrit 26019).
>
> This is the idea explained below, where you serialize message once and 
> replays right?
> In addition to tweaking the reader thread etc?

> 5) Paul's proposal. I don't know if he has measured performance impact on 
> that.

Which one, vapi+swig?
I am not sure what the complete solution looks like.
I believe vapi needs to be on VPP machine,
but CSIT Python stuff (for swig) is on a separate machine.
I do not see swig having any transport capabilities,
so we would need to insert them (socket, file transfer, something else)
somewhere, and I do not see a good place.

> 6) The add 4 million routes that CSIT uses, generates routes according to a 
> set of parameters.
>We could add a small plugin on the VPP side that exposes an API of the sort
>"create  routes from ".

Subtypes:
6a) Official plugin in VPP repo, needs a maintainer.
6b) Plugin outside VPP repo (perhaps in CSIT). Needs to be built and deployed.

> Is it the serialisation, or the message RTT or both?
> If serialisation, doing it in C should help.
> If it's the RTT, larger messages is one option.

There are three known bottlenecks. Two for RTT, one for serialization.
I do not recall the real numbers,
but for the sake of discussion, you can assume
removing bottlenecks in the following order,
each bottleneck removed halves the configuration time.

First bottleneck: Sync commands.
Solution: Send multiple commands before reading replies.
No support from VPP needed, maybe except reporting
how big the last message sent was (to avoid UDS buffers getting full).

Second bottleneck: Background threads VPP PAPI code uses
to read replies asynchronously from socket, deserialize them,
and put to a queue for user to read.
Solution: Hack vpp_transport_socket.VppTransport after connect
to stop message_thread and read from socket directly.
Or use a different VppTransport implementation
that does not start the thread (nor multiprocessing queues) in first place.
26019 does the former, 31920 does the latter.
This bottleneck is the primary reason I started this conversation.

Third bottleneck: Serialization of many commands (and deserialization of 
responses).
I do not recall if the slower part is VPP PAPI code (vpp_serializer)
or CSIT preparing arguments to serialize. Say, it is both.
Many solutions to this one, important from CSIT reviewers point of view,
but not that important for VPP developers (PAPI or otherwise).

Fourth bottleneck: SSH forwarding volume of data between
socket endpoints on different machines.
Solution: Forward just a single command, and have a utility/plugin/whatever
on VPP machine to execute the implied bulk work quickly.
This sidesteps all the previous bottlenecks,
so it is the secondary reason for this conversation.

Personally, I do not like putting CSIT utilities on VPP machine,
but sometimes it is the least evil (e.g. we do that for reading stats segment).

1a) with VAT1 is a fourth bottleneck solution,

Re: [vpp-dev] [csit-dev] Faster PAPI

2021-02-03 Thread Ole Troan
> 1. Keep the status quo.
> That means using VAT for some operations (e.g. adding multiple routes [5]),
> and creating "exec scripts" [6] for operations without VAT one-liner.
> Pros: No work needed, good speed, old VPP versions are supported.
> Cons: Relying on VAT functionality (outside API compatibility rules).

1b) Pre-generate a JSON file with all commands and load that into VPP with VAT2.
1c) Use a debug CLI script file and "exec"

> 2. Support "vector operations" in VPP via binary API.
> This will probably need a new VPP plugin to host the implementations.
> Pros: Fast speed, small CSIT work, guarded by API compatibility rules.
> Cons: New VPP plugin of questionable usefulness outside CSIT,
> plugin maintainer needed, old VPP versions not supported.

Could you ellaborate on this idea?

> 3. VPP PAPI improvements only.
> No changes to VPP API, just changes to PAPI to allow better speed for socket 
> interactions.
> CSIT would need a fast way to synthetize binary messages.
> Pros: Small VPP work, good speed, only "official" VPP API is used.
> Cons: Brittle CSIT message handling, old VPP versions not supported.

What would this be?
E.g. we could do serializations in C with a small Python wrapper.

> 4. CSIT hacks only (Gerrit 26019).
> No changes to VPP API nor PAPI. CSIT code messes with PAPI internals.
> CSIT needs a fast way to synthetize binary messages.
> Pros: Code is ready, good speed, old VPP versions are supported.
> Cons: Brittle CSIT message handling, risky with respect to VPP PAPI changes.

This is the idea explained below, where you serialize message once and replays 
right?
In addition to tweaking the reader thread etc?

5) Paul's proposal. I don't know if he has measured performance impact on that.

6) The add 4 million routes that CSIT uses, generates routes according to a set 
of parameters.
   We could add a small plugin on the VPP side that exposes an API of the sort
   "create  routes from ".

Is it the serialisation, or the message RTT or both?
If serialisation, doing it in C should help.
If it's the RTT, larger messages is one option.
Uploading all messages to VPP and then ask VPP to process them, receiving a 
single reply is also an option.
Is this your "vector" idea?

> The open questions:
> Do you see any other options?
> Did I miss some important pros or cons?
> Which option do you prefer?

The lowest hanging fruit is likely 6. But longer term I'd prefer a more generic 
solution.

Best regards,
Ole

> [2] https://lists.fd.io/g/vpp-dev/topic/78362835#18092
> [3] https://gerrit.fd.io/r/c/csit/+/26019/140
> [4] 
> https://gerrit.fd.io/r/c/csit/+/26019/140#message-314d168d8951b539e588e644a875624f5ca3fb77
> [5] 
> https://github.com/FDio/csit/blob/b5073afc4a941ea33ce874e016fe86384ae7a60d/resources/templates/vat/vpp_route_add.vat
> [6] 
> https://github.com/FDio/csit/blob/b5073afc4a941ea33ce874e016fe86384ae7a60d/resources/libraries/python/TestConfig.py#L121-L150
> 
> From: vpp-dev@lists.fd.io  On Behalf Of Vratko Polak -X 
> (vrpolak - PANTHEON TECHNOLOGIES at Cisco) via lists.fd.io
> Sent: Thursday, 2020-May-07 18:35
> To: vpp-dev@lists.fd.io
> Cc: csit-...@lists.fd.io
> Subject: [vpp-dev] Faster PAPI
> 
> Hello people interested in PAPI (VPP's Python API client library).
> 
> In CSIT, our tests are using PAPI to interact with VPP.
> We are using socket transport (instead of shared memory transport),
> as VPP is running on machines separate from machines running the tests.
> We use SSH to forward the socket between the machines.
> 
> Some of our scale tests need to send high number of commands towards VPP.
> The largest test sends 4 million commands (ip_route_add_del with ip6 
> addresses).
> You can imagine that can take a while.
> Even using PAPI in asynchronous mode, it takes tens of minutes per million 
> commands.
> 
> I was able to speed that up considerably, just by changing code on CSIT side.
> The exact code change is [0], but that may be hard to review.
> Gerrit does not even recognize the new PapiSocketExecutor.py
> to be an edited copy of the old PapiExecutor.py file.
> 
> That code relies on the fact that Python is quite permissive language,
> not really distinguishing private fields and methods from public ones.
> So the current code is vulnerable to refactors of VPP PAPI code.
> Also, pylint (static code analysis tool CSIT uses) is complaining.
> 
> The proper fix is to change the VPP PAPI code,
> so that it exposes the inner parts the new CSIT code needs to access
> (or some abstractions of them).
> 
> For that I have created [1], which shows the changed VPP PAPI code.
> Commit message contains a simplified example of how the new features can be 
> used.
> 
> The changed VPP code allows three performance improvements.
> 
> 1. Capturing raw bytes sent.
> For complicated commands, many CPU cycles are spent serializing
> command arguments (basically nested python dicts) into bytes.
> If user (CSIT code) has access to the message as serialized by PAPI (VPP 
> code),
>