Re: [patch net-next v2 00/10] Add support for resource abstraction
Thu, Jan 25, 2018 at 04:38:41PM CET, d...@cumulusnetworks.com wrote: >On 1/25/18 8:24 AM, Jiri Pirko wrote: > Now I remember. You wrote it independently and but needed iproute2 be a > delivery vehicle. It uses none of the common infrastructure from > iproute2. Could we make this more difficult Feel free to rewrite it to use lib/libnetlink.c. Should not be that hard. Note that at the time I was pushing devlink userspace, tipc also used libmnl as a part of iproute2, so devlink was not the first one. That is why I decided not to rewrite. As of the rest of the "common infrastructure", what exactly do you have in mind? >>> >>> This is what I am getting at. Apparently, these resource patches for >>> devlink require a patched libmnl to work properly. It is wrong for >> >> What do you mean, "work properly". If there is no patched libmnl, only >> thing what happens is that the extack won't be processed and shown. >> That is the same behaviour as if you compile iproute2 package without >> libmnl - all works, used just don't see extack messages. >> >> I don't see any problem in that. Do you? > >I have libmnl installed. I build iproute2. I don't get extended ack >messages on devlink failures. That is a problem. That is building a >command that is known not to work as it should given the build dependencies. Known to work, but not for devlink. But nevermind.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/25/18 8:24 AM, Jiri Pirko wrote: Now I remember. You wrote it independently and but needed iproute2 be a delivery vehicle. It uses none of the common infrastructure from iproute2. Could we make this more difficult >>> >>> Feel free to rewrite it to use lib/libnetlink.c. Should not be that >>> hard. Note that at the time I was pushing devlink userspace, tipc also >>> used libmnl as a part of iproute2, so devlink was not the first one. >>> That is why I decided not to rewrite. >>> >>> As of the rest of the "common infrastructure", what exactly do you >>> have in mind? >>> >> >> This is what I am getting at. Apparently, these resource patches for >> devlink require a patched libmnl to work properly. It is wrong for > > What do you mean, "work properly". If there is no patched libmnl, only > thing what happens is that the extack won't be processed and shown. > That is the same behaviour as if you compile iproute2 package without > libmnl - all works, used just don't see extack messages. > > I don't see any problem in that. Do you? I have libmnl installed. I build iproute2. I don't get extended ack messages on devlink failures. That is a problem. That is building a command that is known not to work as it should given the build dependencies.
Re: [patch net-next v2 00/10] Add support for resource abstraction
Thu, Jan 04, 2018 at 05:58:51PM CET, d...@cumulusnetworks.com wrote: >On 1/3/18 11:36 AM, Jiri Pirko wrote: >> Wed, Jan 03, 2018 at 07:29:46PM CET, d...@cumulusnetworks.com wrote: >>> On 1/3/18 11:17 AM, Jiri Pirko wrote: Wed, Jan 03, 2018 at 07:14:16PM CET, d...@cumulusnetworks.com wrote: > On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: >> As I stated this is a user-space bug which I fixed, and updated my repo >> so please pull. Devlink uses mnl,and currently mnl does not support >> extended ack. I added support for this in my local ver of libmnl: >> >> https://github.com/arkadis/libmnl.git >> >> On branch master, so you can check it out. Besides this bugs, which were >> userspace, can please specify what are the pending problems from your >> point of view? Thanks! > > devlink is in iproute2 package and it has extack support. See 'git log > lib/libnetlink.c' Dave, devlink uses libmnl. >>> >>> Now I remember. You wrote it independently and but needed iproute2 be a >>> delivery vehicle. It uses none of the common infrastructure from >>> iproute2. Could we make this more difficult >> >> Feel free to rewrite it to use lib/libnetlink.c. Should not be that >> hard. Note that at the time I was pushing devlink userspace, tipc also >> used libmnl as a part of iproute2, so devlink was not the first one. >> That is why I decided not to rewrite. >> >> As of the rest of the "common infrastructure", what exactly do you >> have in mind? >> > >This is what I am getting at. Apparently, these resource patches for >devlink require a patched libmnl to work properly. It is wrong for What do you mean, "work properly". If there is no patched libmnl, only thing what happens is that the extack won't be processed and shown. That is the same behaviour as if you compile iproute2 package without libmnl - all works, used just don't see extack messages. I don't see any problem in that. Do you? >iproute2 to accept this patch and to build a devlink command that we >know does not work. That means iproute2 needs a dependency on a specific >version of libmnl - a version which does not yet exist because those >changes have not been accepted and libmnl version released. Adding that >dependency is going to inconvenience to all current users of the >iproute2 repo.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/4/18 9:13 AM, Arkadi Sharshevsky wrote: Also, it seems like the occ of 0 is wrong since we know from past responses that if I set linear to 0 all of networking breaks. Ok, this was a David bug. I was running ifreload after the devlink reload command, but all of my connections to the switch are through breakout ports and the ifreload was not running the devlink port split command. Doing that before the ifreload and all is fine.
Re: [patch net-next v2 00/10] Add support for resource abstraction
From: David AhernDate: Thu, 4 Jan 2018 09:58:51 -0700 > This is what I am getting at. Apparently, these resource patches for > devlink require a patched libmnl to work properly. It is wrong for > iproute2 to accept this patch and to build a devlink command that we > know does not work. That means iproute2 needs a dependency on a specific > version of libmnl - a version which does not yet exist because those > changes have not been accepted and libmnl version released. Adding that > dependency is going to inconvenience to all current users of the > iproute2 repo. +1
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/3/18 11:36 AM, Jiri Pirko wrote: > Wed, Jan 03, 2018 at 07:29:46PM CET, d...@cumulusnetworks.com wrote: >> On 1/3/18 11:17 AM, Jiri Pirko wrote: >>> Wed, Jan 03, 2018 at 07:14:16PM CET, d...@cumulusnetworks.com wrote: On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: > As I stated this is a user-space bug which I fixed, and updated my repo > so please pull. Devlink uses mnl,and currently mnl does not support > extended ack. I added support for this in my local ver of libmnl: > > https://github.com/arkadis/libmnl.git > > On branch master, so you can check it out. Besides this bugs, which were > userspace, can please specify what are the pending problems from your > point of view? Thanks! devlink is in iproute2 package and it has extack support. See 'git log lib/libnetlink.c' >>> >>> Dave, devlink uses libmnl. >>> >> >> Now I remember. You wrote it independently and but needed iproute2 be a >> delivery vehicle. It uses none of the common infrastructure from >> iproute2. Could we make this more difficult > > Feel free to rewrite it to use lib/libnetlink.c. Should not be that > hard. Note that at the time I was pushing devlink userspace, tipc also > used libmnl as a part of iproute2, so devlink was not the first one. > That is why I decided not to rewrite. > > As of the rest of the "common infrastructure", what exactly do you > have in mind? > This is what I am getting at. Apparently, these resource patches for devlink require a patched libmnl to work properly. It is wrong for iproute2 to accept this patch and to build a devlink command that we know does not work. That means iproute2 needs a dependency on a specific version of libmnl - a version which does not yet exist because those changes have not been accepted and libmnl version released. Adding that dependency is going to inconvenience to all current users of the iproute2 repo.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/4/18 9:13 AM, Arkadi Sharshevsky wrote: > > > On 01/04/2018 05:58 PM, David Ahern wrote: >> On 1/4/18 2:24 AM, Arkadi Sharshevsky wrote: Again, my comments all stem from user experience. Can you explain what "double_word" means for a unit? I would expect a units to be kB or count (or items or entries). >>> >>> Double word is 64 bit, dont understand why this is confusing. >> >> As Andrew pointed out, double word can have a range of sizes. >> >> To my point here, a 'unit' for a number should not be the number of bits >> it is represented by. I believe all of the kvd sizes are in entries >> (ie., a linear size of 98304 means I can have 98,304 entries in that >> resource). >> >>> $ devlink resource show pci/:03:00.0 pci/:03:00.0: name kvd size 245760 unit double_word size_valid true resources: name linear size 98304 occ 0 unit double_word name hash_double size 60416 unit double_word name hash_single size 87040 unit double_word While that is confusing here from the userspace command it goes hand in hand with patch 2 and: +enum devlink_resource_unit { + DEVLINK_RESOURCE_UNIT_DOUBLE_WORD, +}; Also, it seems like the occ of 0 is wrong since we know from past responses that if I set linear to 0 all of networking breaks. >>> >>> You are clearly misunderstanding what is occupancy of the resource >>> and what kvd linear is good for. As I mentioned in the last response >>> kvd linear is mapped to adjacency table. So in case its 0 no nexthop >>> routes could be configured, this information is provided by the >>> dpipe<-->resource. >>> >>> Occupancy means how much of the resource is used right now, why is >>> this wrong? and how its related to the size 0 exactly? >> >> The summary line above shows the current kvd/linear occupancy is '0'. >> That suggests my L3 only deployment is not using any kvd/linear which >> means I can set its allocation to 0 and devote more kvd resources to the >> hash regions. >> >> But, as I pointed out in previous responses I can not set linear to 0. >> If I do all of networking breaks. That suggests that kvd/linear is used >> by more networking entities than you are currently reporting. Hence, >> telling me the kvd/linear occupancy is 0 is wrong. >> >> I believe the stems from the how you are determining occupancy and the >> fact that not all tables have been implemented. Showing the current >> occupancy of a resource is very helpful, so it should be part of the API. >> >> However, until mlxsw shows a proper value (either by implementing all >> dpipe tables or changing how it is calculated) it should not be >> returning that attribute. As it stands you are giving a user invalid data. >> > > Wait, the occupancy is very precise it goes directly to the linear > allocator inside the driver and gives exact current usage. You assumed > it goes to the dpipe tables which is incorrect. I am trying *guess* based on this discussion as to why it is 0. In past responses you push that the tables expose how the resource is used. > > The linear kvd memory is used by: > 1. The adjacency table is responsible for the ECMP and nexthop >resolution. > 2. ACL flexible actions blocks (dpipe will contain table for this). > > I dont know what you configured but in case you got some remote > routes with a nexthop the occ should not be zero. > > If your router contains only local routes and dont use ACL you can > shrink it to zero, no problem. My current setup is an L3 only deployment, no ACLs and no bridges. We are meeting in 15 minutes to discuss this design; let's use some of the time for debugging this problem. > >>> How does a user learn the granularity of a resource: $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 5 Error: mlxsw_spectrum: resource set with wrong granularity. Try again with 51000 and then 52000 and ... Why not export the granularity read-only? I don't see it in the proposed UAPI. >>> >>> I would like more adding the granularity size to the extack string >>> instead of adding this to UAPI. The user will try to update once >>> and will get the required granularity in the error message. >> >> A user should not have to run a command to get an error message to know >> essential data to running a command with the right value. Further, do >> not assume 'user' is a person or the devlink command. >> >> Since granularity is essential to running a valid command, it should be >> an attribute for each resource. >> > > This is also your approach towards the min/max for resource sizes? > Am I correct? yes.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 01/04/2018 05:58 PM, David Ahern wrote: > On 1/4/18 2:24 AM, Arkadi Sharshevsky wrote: >>> Again, my comments all stem from user experience. >>> >>> Can you explain what "double_word" means for a unit? I would expect a >>> units to be kB or count (or items or entries). >>> >> >> Double word is 64 bit, dont understand why this is confusing. > > As Andrew pointed out, double word can have a range of sizes. > > To my point here, a 'unit' for a number should not be the number of bits > it is represented by. I believe all of the kvd sizes are in entries > (ie., a linear size of 98304 means I can have 98,304 entries in that > resource). > >> >>> $ devlink resource show pci/:03:00.0 >>> pci/:03:00.0: >>> name kvd size 245760 unit double_word size_valid true >>> resources: >>> name linear size 98304 occ 0 unit double_word >>> name hash_double size 60416 unit double_word >>> name hash_single size 87040 unit double_word >>> >>> While that is confusing here from the userspace command it goes hand in >>> hand with patch 2 and: >>> >>> +enum devlink_resource_unit { >>> + DEVLINK_RESOURCE_UNIT_DOUBLE_WORD, >>> +}; >>> >>> >>> Also, it seems like the occ of 0 is wrong since we know from past >>> responses that if I set linear to 0 all of networking breaks. >> >> You are clearly misunderstanding what is occupancy of the resource >> and what kvd linear is good for. As I mentioned in the last response >> kvd linear is mapped to adjacency table. So in case its 0 no nexthop >> routes could be configured, this information is provided by the >> dpipe<-->resource. >> >> Occupancy means how much of the resource is used right now, why is >> this wrong? and how its related to the size 0 exactly? > > The summary line above shows the current kvd/linear occupancy is '0'. > That suggests my L3 only deployment is not using any kvd/linear which > means I can set its allocation to 0 and devote more kvd resources to the > hash regions. > > But, as I pointed out in previous responses I can not set linear to 0. > If I do all of networking breaks. That suggests that kvd/linear is used > by more networking entities than you are currently reporting. Hence, > telling me the kvd/linear occupancy is 0 is wrong. > > I believe the stems from the how you are determining occupancy and the > fact that not all tables have been implemented. Showing the current > occupancy of a resource is very helpful, so it should be part of the API. > > However, until mlxsw shows a proper value (either by implementing all > dpipe tables or changing how it is calculated) it should not be > returning that attribute. As it stands you are giving a user invalid data. > Wait, the occupancy is very precise it goes directly to the linear allocator inside the driver and gives exact current usage. You assumed it goes to the dpipe tables which is incorrect. The linear kvd memory is used by: 1. The adjacency table is responsible for the ECMP and nexthop resolution. 2. ACL flexible actions blocks (dpipe will contain table for this). I dont know what you configured but in case you got some remote routes with a nexthop the occ should not be zero. If your router contains only local routes and dont use ACL you can shrink it to zero, no problem. >> >>> >>> >>> >>> How does a user learn the granularity of a resource: >>> >>> $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 5 >>> Error: mlxsw_spectrum: resource set with wrong granularity. >>> >>> Try again with 51000 and then 52000 and ... Why not export the >>> granularity read-only? I don't see it in the proposed UAPI. >>> >> >> I would like more adding the granularity size to the extack string >> instead of adding this to UAPI. The user will try to update once >> and will get the required granularity in the error message. > > A user should not have to run a command to get an error message to know > essential data to running a command with the right value. Further, do > not assume 'user' is a person or the devlink command. > > Since granularity is essential to running a valid command, it should be > an attribute for each resource. > This is also your approach towards the min/max for resource sizes? Am I correct? > >> >>> >>> And then on the reload: >>> >>> $ devlink reload pci/:03:00.0 >>> Error: devlink: resources size validation failed. >>> >>> Since the reload is not doing any resource sizing that error message is >>> confusing. Maybe something like "Sum of the resource components exceeds >>> total size." >>> >> >> No problem, sounds better. >> >>> >>> Finally, I still contend a 1-line description of each of the resources >>> goes a long way to improving the user friendliness of this set. >>> >> >> Strongly against it. >> >
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/4/18 2:24 AM, Arkadi Sharshevsky wrote: >> Again, my comments all stem from user experience. >> >> Can you explain what "double_word" means for a unit? I would expect a >> units to be kB or count (or items or entries). >> > > Double word is 64 bit, dont understand why this is confusing. As Andrew pointed out, double word can have a range of sizes. To my point here, a 'unit' for a number should not be the number of bits it is represented by. I believe all of the kvd sizes are in entries (ie., a linear size of 98304 means I can have 98,304 entries in that resource). > >> $ devlink resource show pci/:03:00.0 >> pci/:03:00.0: >> name kvd size 245760 unit double_word size_valid true >> resources: >> name linear size 98304 occ 0 unit double_word >> name hash_double size 60416 unit double_word >> name hash_single size 87040 unit double_word >> >> While that is confusing here from the userspace command it goes hand in >> hand with patch 2 and: >> >> +enum devlink_resource_unit { >> +DEVLINK_RESOURCE_UNIT_DOUBLE_WORD, >> +}; >> >> >> Also, it seems like the occ of 0 is wrong since we know from past >> responses that if I set linear to 0 all of networking breaks. > > You are clearly misunderstanding what is occupancy of the resource > and what kvd linear is good for. As I mentioned in the last response > kvd linear is mapped to adjacency table. So in case its 0 no nexthop > routes could be configured, this information is provided by the > dpipe<-->resource. > > Occupancy means how much of the resource is used right now, why is > this wrong? and how its related to the size 0 exactly? The summary line above shows the current kvd/linear occupancy is '0'. That suggests my L3 only deployment is not using any kvd/linear which means I can set its allocation to 0 and devote more kvd resources to the hash regions. But, as I pointed out in previous responses I can not set linear to 0. If I do all of networking breaks. That suggests that kvd/linear is used by more networking entities than you are currently reporting. Hence, telling me the kvd/linear occupancy is 0 is wrong. I believe the stems from the how you are determining occupancy and the fact that not all tables have been implemented. Showing the current occupancy of a resource is very helpful, so it should be part of the API. However, until mlxsw shows a proper value (either by implementing all dpipe tables or changing how it is calculated) it should not be returning that attribute. As it stands you are giving a user invalid data. > >> >> >> >> How does a user learn the granularity of a resource: >> >> $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 5 >> Error: mlxsw_spectrum: resource set with wrong granularity. >> >> Try again with 51000 and then 52000 and ... Why not export the >> granularity read-only? I don't see it in the proposed UAPI. >> > > I would like more adding the granularity size to the extack string > instead of adding this to UAPI. The user will try to update once > and will get the required granularity in the error message. A user should not have to run a command to get an error message to know essential data to running a command with the right value. Further, do not assume 'user' is a person or the devlink command. Since granularity is essential to running a valid command, it should be an attribute for each resource. > >> >> And then on the reload: >> >> $ devlink reload pci/:03:00.0 >> Error: devlink: resources size validation failed. >> >> Since the reload is not doing any resource sizing that error message is >> confusing. Maybe something like "Sum of the resource components exceeds >> total size." >> > > No problem, sounds better. > >> >> Finally, I still contend a 1-line description of each of the resources >> goes a long way to improving the user friendliness of this set. >> > > Strongly against it. >
Re: [patch net-next v2 00/10] Add support for resource abstraction
> Double word is 64 bit, dont understand why this is confusing. In an ASCI, the definition of a word can be quite flexible. I've seen designs using 14 bit words, since 14 bits was all that was needed to represent the data to be held. I've also seen a 16 bit word used to hold a signed value, with the binary point before the last nibble, so it could hold -2047.9375 to +2047.9375. I might have that wrong. It gave me a headache at the time, but the synthesizer had no such problems. Why not use u8, u16, u14, u32, u64, which we can all understand without confusion. Andrew
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 01/04/2018 04:28 AM, David Ahern wrote: > On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: >> >> >> On 01/02/2018 08:05 PM, David Ahern wrote: >>> On 1/1/18 7:58 AM, Arkadi Sharshevsky wrote: Just to summarize the current fixes required: 1. ERIF dpipe table size is reporting wrong size. More precisely the ERIF table does not take rifs, so it should not be linked to the rif bank resource (is not part of this patchset, future extension). 2. Extended ACK user-space bug. 3. ABI documentation- Not sure we agreed upon it, Jiri? If I missed something please respond. Nothing of the fixes mentioned above is relevant for this patchset actually. >>> >>> Can you fix the userspace command and then we come back to what else is >>> needed? Right now, it is hard to tell what is a user space bug and what >>> is a kernel space bug. >>> >>> For example: >>> $ devlink resource set pci/:03:00.0 path /kvd/linear size 1 >>> $ devlink resource show pci/:03:00.0 >>> pci/:03:00.0: >>> name kvd size 245760 size_valid true >>> resources: >>> name linear size 98304 occ 0 >>> name hash_double size 60416 >>> name hash_single size 87040 >>> >>> The set command did not fail, yet there is no size_new arg in the output >>> like there is for this change: >>> >>> $ devlink resource set pci/:03:00.0 path /kvd/linear size 0 >>> $ devlink resource show pci/:03:00.0 >>> pci/:03:00.0: >>> name kvd size 245760 size_valid true >>> resources: >>> name linear size 98304 size_new 0 occ 0 >>> name hash_double size 60416 >>> name hash_single size 87040 >>> >> >> As I stated this is a user-space bug which I fixed, and updated my repo >> so please pull. Devlink uses mnl,and currently mnl does not support >> extended ack. I added support for this in my local ver of libmnl: >> >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Farkadis%2Flibmnl.git=02%7C01%7Carkadis%40mellanox.com%7C9a369b54cdec48a5e1d208d5531adbdb%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636506297202356822=6KOkBz5PHqu6z8nlexSdggZj42LE4geiYVFA%2BgcgaPE%3D=0 >> >> On branch master, so you can check it out. Besides this bugs, which were >> userspace, can please specify what are the pending problems from your >> point of view? Thanks! >> > > Again, my comments all stem from user experience. > > Can you explain what "double_word" means for a unit? I would expect a > units to be kB or count (or items or entries). > Double word is 64 bit, dont understand why this is confusing. > $ devlink resource show pci/:03:00.0 > pci/:03:00.0: > name kvd size 245760 unit double_word size_valid true > resources: > name linear size 98304 occ 0 unit double_word > name hash_double size 60416 unit double_word > name hash_single size 87040 unit double_word > > While that is confusing here from the userspace command it goes hand in > hand with patch 2 and: > > +enum devlink_resource_unit { > + DEVLINK_RESOURCE_UNIT_DOUBLE_WORD, > +}; > > > Also, it seems like the occ of 0 is wrong since we know from past > responses that if I set linear to 0 all of networking breaks. You are clearly misunderstanding what is occupancy of the resource and what kvd linear is good for. As I mentioned in the last response kvd linear is mapped to adjacency table. So in case its 0 no nexthop routes could be configured, this information is provided by the dpipe<-->resource. Occupancy means how much of the resource is used right now, why is this wrong? and how its related to the size 0 exactly? > > > > How does a user learn the granularity of a resource: > > $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 5 > Error: mlxsw_spectrum: resource set with wrong granularity. > > Try again with 51000 and then 52000 and ... Why not export the > granularity read-only? I don't see it in the proposed UAPI. > I would like more adding the granularity size to the extack string instead of adding this to UAPI. The user will try to update once and will get the required granularity in the error message. > > And then on the reload: > > $ devlink reload pci/:03:00.0 > Error: devlink: resources size validation failed. > > Since the reload is not doing any resource sizing that error message is > confusing. Maybe something like "Sum of the resource components exceeds > total size." > No problem, sounds better. > > Finally, I still contend a 1-line description of each of the resources > goes a long way to improving the user friendliness of this set. > Strongly against it.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: > > > On 01/02/2018 08:05 PM, David Ahern wrote: >> On 1/1/18 7:58 AM, Arkadi Sharshevsky wrote: >>> >>> Just to summarize the current fixes required: >>> >>> 1. ERIF dpipe table size is reporting wrong size. More precisely the >>>ERIF table does not take rifs, so it should not be linked to the rif >>>bank resource (is not part of this patchset, future extension). >>> 2. Extended ACK user-space bug. >>> 3. ABI documentation- Not sure we agreed upon it, Jiri? >>> >>> If I missed something please respond. Nothing of the fixes mentioned >>> above is relevant for this patchset actually. >>> >> >> Can you fix the userspace command and then we come back to what else is >> needed? Right now, it is hard to tell what is a user space bug and what >> is a kernel space bug. >> >> For example: >> $ devlink resource set pci/:03:00.0 path /kvd/linear size 1 >> $ devlink resource show pci/:03:00.0 >> pci/:03:00.0: >> name kvd size 245760 size_valid true >> resources: >> name linear size 98304 occ 0 >> name hash_double size 60416 >> name hash_single size 87040 >> >> The set command did not fail, yet there is no size_new arg in the output >> like there is for this change: >> >> $ devlink resource set pci/:03:00.0 path /kvd/linear size 0 >> $ devlink resource show pci/:03:00.0 >> pci/:03:00.0: >> name kvd size 245760 size_valid true >> resources: >> name linear size 98304 size_new 0 occ 0 >> name hash_double size 60416 >> name hash_single size 87040 >> > > As I stated this is a user-space bug which I fixed, and updated my repo > so please pull. Devlink uses mnl,and currently mnl does not support > extended ack. I added support for this in my local ver of libmnl: > > https://github.com/arkadis/libmnl.git > > On branch master, so you can check it out. Besides this bugs, which were > userspace, can please specify what are the pending problems from your > point of view? Thanks! > Again, my comments all stem from user experience. Can you explain what "double_word" means for a unit? I would expect a units to be kB or count (or items or entries). $ devlink resource show pci/:03:00.0 pci/:03:00.0: name kvd size 245760 unit double_word size_valid true resources: name linear size 98304 occ 0 unit double_word name hash_double size 60416 unit double_word name hash_single size 87040 unit double_word While that is confusing here from the userspace command it goes hand in hand with patch 2 and: +enum devlink_resource_unit { + DEVLINK_RESOURCE_UNIT_DOUBLE_WORD, +}; Also, it seems like the occ of 0 is wrong since we know from past responses that if I set linear to 0 all of networking breaks. How does a user learn the granularity of a resource: $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 5 Error: mlxsw_spectrum: resource set with wrong granularity. Try again with 51000 and then 52000 and ... Why not export the granularity read-only? I don't see it in the proposed UAPI. And then on the reload: $ devlink reload pci/:03:00.0 Error: devlink: resources size validation failed. Since the reload is not doing any resource sizing that error message is confusing. Maybe something like "Sum of the resource components exceeds total size." Finally, I still contend a 1-line description of each of the resources goes a long way to improving the user friendliness of this set.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 01/03/2018 08:29 PM, David Ahern wrote: > On 1/3/18 11:17 AM, Jiri Pirko wrote: >> Wed, Jan 03, 2018 at 07:14:16PM CET, d...@cumulusnetworks.com wrote: >>> On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: As I stated this is a user-space bug which I fixed, and updated my repo so please pull. Devlink uses mnl,and currently mnl does not support extended ack. I added support for this in my local ver of libmnl: https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Farkadis%2Flibmnl.git=02%7C01%7Carkadis%40mellanox.com%7C5c86b6240eb84459c6ae08d552d7f9a4%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636506009929977440=sgrNzMhPwe63BIVxexZTjl%2FXqW51kpuRiHVhTDNaa70%3D=0 On branch master, so you can check it out. Besides this bugs, which were userspace, can please specify what are the pending problems from your point of view? Thanks! >>> >>> devlink is in iproute2 package and it has extack support. See 'git log >>> lib/libnetlink.c' >> >> Dave, devlink uses libmnl. >> > > Now I remember. You wrote it independently and but needed iproute2 be a > delivery vehicle. It uses none of the common infrastructure from > iproute2. Could we make this more difficult > > Sometime in the next day I will jump through the hoops to get a proper > devlink command. > This actually was very confusing, I think the extack should be handled by libmnl and iproute should use mnl_cb_run() routines and not to implement its own. That way we could both benefit from that. You actually do use libmnl in libnetlink.c only for parsing the headers, and its a dependency for extack handling. I see this as a completely independent user space issue, which doesn't have to do anything with this patchset. Not to mention that everything is working right now.
Re: [patch net-next v2 00/10] Add support for resource abstraction
Wed, Jan 03, 2018 at 07:29:46PM CET, d...@cumulusnetworks.com wrote: >On 1/3/18 11:17 AM, Jiri Pirko wrote: >> Wed, Jan 03, 2018 at 07:14:16PM CET, d...@cumulusnetworks.com wrote: >>> On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: As I stated this is a user-space bug which I fixed, and updated my repo so please pull. Devlink uses mnl,and currently mnl does not support extended ack. I added support for this in my local ver of libmnl: https://github.com/arkadis/libmnl.git On branch master, so you can check it out. Besides this bugs, which were userspace, can please specify what are the pending problems from your point of view? Thanks! >>> >>> devlink is in iproute2 package and it has extack support. See 'git log >>> lib/libnetlink.c' >> >> Dave, devlink uses libmnl. >> > >Now I remember. You wrote it independently and but needed iproute2 be a >delivery vehicle. It uses none of the common infrastructure from >iproute2. Could we make this more difficult Feel free to rewrite it to use lib/libnetlink.c. Should not be that hard. Note that at the time I was pushing devlink userspace, tipc also used libmnl as a part of iproute2, so devlink was not the first one. That is why I decided not to rewrite. As of the rest of the "common infrastructure", what exactly do you have in mind? > >Sometime in the next day I will jump through the hoops to get a proper >devlink command.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/3/18 11:17 AM, Jiri Pirko wrote: > Wed, Jan 03, 2018 at 07:14:16PM CET, d...@cumulusnetworks.com wrote: >> On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: >>> As I stated this is a user-space bug which I fixed, and updated my repo >>> so please pull. Devlink uses mnl,and currently mnl does not support >>> extended ack. I added support for this in my local ver of libmnl: >>> >>> https://github.com/arkadis/libmnl.git >>> >>> On branch master, so you can check it out. Besides this bugs, which were >>> userspace, can please specify what are the pending problems from your >>> point of view? Thanks! >> >> devlink is in iproute2 package and it has extack support. See 'git log >> lib/libnetlink.c' > > Dave, devlink uses libmnl. > Now I remember. You wrote it independently and but needed iproute2 be a delivery vehicle. It uses none of the common infrastructure from iproute2. Could we make this more difficult Sometime in the next day I will jump through the hoops to get a proper devlink command.
Re: [patch net-next v2 00/10] Add support for resource abstraction
Wed, Jan 03, 2018 at 07:14:16PM CET, d...@cumulusnetworks.com wrote: >On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: >> As I stated this is a user-space bug which I fixed, and updated my repo >> so please pull. Devlink uses mnl,and currently mnl does not support >> extended ack. I added support for this in my local ver of libmnl: >> >> https://github.com/arkadis/libmnl.git >> >> On branch master, so you can check it out. Besides this bugs, which were >> userspace, can please specify what are the pending problems from your >> point of view? Thanks! > >devlink is in iproute2 package and it has extack support. See 'git log >lib/libnetlink.c' Dave, devlink uses libmnl.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote: > As I stated this is a user-space bug which I fixed, and updated my repo > so please pull. Devlink uses mnl,and currently mnl does not support > extended ack. I added support for this in my local ver of libmnl: > > https://github.com/arkadis/libmnl.git > > On branch master, so you can check it out. Besides this bugs, which were > userspace, can please specify what are the pending problems from your > point of view? Thanks! devlink is in iproute2 package and it has extack support. See 'git log lib/libnetlink.c' You do not need to modify libmnl to get extack output in devlink. Can you update the command accordingly? Once it works I will come back to this topic and take another look.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 01/02/2018 08:05 PM, David Ahern wrote: > On 1/1/18 7:58 AM, Arkadi Sharshevsky wrote: >> >> Just to summarize the current fixes required: >> >> 1. ERIF dpipe table size is reporting wrong size. More precisely the >>ERIF table does not take rifs, so it should not be linked to the rif >>bank resource (is not part of this patchset, future extension). >> 2. Extended ACK user-space bug. >> 3. ABI documentation- Not sure we agreed upon it, Jiri? >> >> If I missed something please respond. Nothing of the fixes mentioned >> above is relevant for this patchset actually. >> > > Can you fix the userspace command and then we come back to what else is > needed? Right now, it is hard to tell what is a user space bug and what > is a kernel space bug. > > For example: > $ devlink resource set pci/:03:00.0 path /kvd/linear size 1 > $ devlink resource show pci/:03:00.0 > pci/:03:00.0: > name kvd size 245760 size_valid true > resources: > name linear size 98304 occ 0 > name hash_double size 60416 > name hash_single size 87040 > > The set command did not fail, yet there is no size_new arg in the output > like there is for this change: > > $ devlink resource set pci/:03:00.0 path /kvd/linear size 0 > $ devlink resource show pci/:03:00.0 > pci/:03:00.0: > name kvd size 245760 size_valid true > resources: > name linear size 98304 size_new 0 occ 0 > name hash_double size 60416 > name hash_single size 87040 > As I stated this is a user-space bug which I fixed, and updated my repo so please pull. Devlink uses mnl,and currently mnl does not support extended ack. I added support for this in my local ver of libmnl: https://github.com/arkadis/libmnl.git On branch master, so you can check it out. Besides this bugs, which were userspace, can please specify what are the pending problems from your point of view? Thanks!
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 1/1/18 7:58 AM, Arkadi Sharshevsky wrote: > > Just to summarize the current fixes required: > > 1. ERIF dpipe table size is reporting wrong size. More precisely the >ERIF table does not take rifs, so it should not be linked to the rif >bank resource (is not part of this patchset, future extension). > 2. Extended ACK user-space bug. > 3. ABI documentation- Not sure we agreed upon it, Jiri? > > If I missed something please respond. Nothing of the fixes mentioned > above is relevant for this patchset actually. > Can you fix the userspace command and then we come back to what else is needed? Right now, it is hard to tell what is a user space bug and what is a kernel space bug. For example: $ devlink resource set pci/:03:00.0 path /kvd/linear size 1 $ devlink resource show pci/:03:00.0 pci/:03:00.0: name kvd size 245760 size_valid true resources: name linear size 98304 occ 0 name hash_double size 60416 name hash_single size 87040 The set command did not fail, yet there is no size_new arg in the output like there is for this change: $ devlink resource set pci/:03:00.0 path /kvd/linear size 0 $ devlink resource show pci/:03:00.0 pci/:03:00.0: name kvd size 245760 size_valid true resources: name linear size 98304 size_new 0 occ 0 name hash_double size 60416 name hash_single size 87040
Re: [patch net-next v2 00/10] Add support for resource abstraction
Tue, Jan 02, 2018 at 02:41:13PM CET, and...@lunn.ch wrote: >> Question is where to put it. It is mlxsw-specific thing, moreover, >> Spectrum-specific thing, same as dpipe tables etc. Not sure. Perhaps >> Documentation/networking/mlxsw.txt ? > >Hi Jiri > >Documentation/ABI seems like the correct place. There is nothing in >the README which says it is limited to files. You could use an name >like devlink-driver-mlxsw. Okay. Sounds sane. Thanks.
Re: [patch net-next v2 00/10] Add support for resource abstraction
> Question is where to put it. It is mlxsw-specific thing, moreover, > Spectrum-specific thing, same as dpipe tables etc. Not sure. Perhaps > Documentation/networking/mlxsw.txt ? Hi Jiri Documentation/ABI seems like the correct place. There is nothing in the README which says it is limited to files. You could use an name like devlink-driver-mlxsw. Andrew
Re: [patch net-next v2 00/10] Add support for resource abstraction
Mon, Jan 01, 2018 at 03:58:33PM CET, arka...@mellanox.com wrote: > > >On 12/26/2017 01:23 PM, Jiri Pirko wrote: >> From: Jiri Pirko>> >> Many of the ASIC's internal resources are limited and are shared between >> several hardware procedures. For example, unified hash-based memory can >> be used for many lookup purposes, like FDB and LPM. In many cases the user >> can provide a partitioning scheme for such a resource in order to perform >> fine tuning for his application. In such cases performing driver reload is >> needed for the changes to take place, thus this patchset also adds support >> for hot reload. >> >> Such an abstraction can be coupled with devlink's dpipe interface, which >> models the ASIC's pipeline as a graph of match/action tables. By modeling >> the hardware resource object, and by coupling it to several dpipe tables, >> further visibility can be achieved in order to debug ASIC-wide issues. >> >> The proposed interface will provide the user the ability to understand the >> limitations of the hardware, and receive notification regarding its >> occupancy. >> Furthermore, monitoring the resource occupancy can be done in real-time and >> can be useful in many cases. >> --- >> Userspace part prototype can be found at >> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Farkadis%2Fiproute2%2F=02%7C01%7Carkadis%40mellanox.com%7C1ae3d8b4854a454e21e008d54c5329e3%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636498842440762657=7MC2BFQFxjnmHqy2sOOL9VEa4ZGq6e5Z2n2WvuNgyFk%3D=0 >> at resource_dev branch. >> >> v1->v2 >> - Add resource size attribute. >> - Fix split bug. >> > >Just to summarize the current fixes required: > >1. ERIF dpipe table size is reporting wrong size. More precisely the > ERIF table does not take rifs, so it should not be linked to the rif > bank resource (is not part of this patchset, future extension). >2. Extended ACK user-space bug. >3. ABI documentation- Not sure we agreed upon it, Jiri? Question is where to put it. It is mlxsw-specific thing, moreover, Spectrum-specific thing, same as dpipe tables etc. Not sure. Perhaps Documentation/networking/mlxsw.txt ? > >If I missed something please respond. Nothing of the fixes mentioned >above is relevant for this patchset actually. > >Couple of key-points: > >1. Constrains\trade off about setting the sizes - this can be obtained > trivially from the resource tree nested structure. >2. Dpipe provides the mapping of hardware processes to resources. >3. Units - each resource specifies his units, if dpipe table's size is > X and its related to some resource its size is normalized to that > resources basic unit. > >IMO this is the most hardware exact interaction, and this is the way it >should be exported from the kernel, if something is not presented in >'user' convenient way some utilities can be implemented in userspace >to easily do it. Furthermore, some examples will be provided for the >whole kvd tree partition for different cases (IPv6 heavy etc..). >Advanced user will be able to tweak it as they like. > >Regarding the 'switchdev' layer I think that kernel's software tables >like nexthops/neigh/routes should be mapped to dpipe tables and not >to resources directly: Sure. dpipe table -> resource mapping is the only one that makes sense. > >kernel_fdb--> dpipe_fdb -->/kvd/hash_single. > >> Arkadi Sharshevsky (10): >> devlink: Add per devlink instance lock >> devlink: Add support for resource abstraction >> devlink: Add support for reload >> devlink: Add relation between dpipe and resource >> mlxsw: pci: Add support for performing bus reset >> mlxsw: spectrum: Register KVD resources with devlink >> mlxsw: spectrum_dpipe: Connect dpipe tables to resources >> mlxsw: spectrum: Add support for getting kvdl occupancy >> mlxsw: pci: Add support for getting resource through devlink >> mlxsw: core: Add support for reload >> >> drivers/net/ethernet/mellanox/mlxsw/core.c | 85 ++- >> drivers/net/ethernet/mellanox/mlxsw/core.h | 16 +- >> drivers/net/ethernet/mellanox/mlxsw/i2c.c | 5 +- >> drivers/net/ethernet/mellanox/mlxsw/pci.c | 98 ++-- >> drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 205 >> drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 13 + >> .../net/ethernet/mellanox/mlxsw/spectrum_dpipe.c | 72 ++- >> .../net/ethernet/mellanox/mlxsw/spectrum_kvdl.c| 26 + >> include/net/devlink.h | 97 >> include/uapi/linux/devlink.h | 21 + >> net/core/devlink.c | 573 >> ++--- >> 11 files changed, 1079 insertions(+), 132 deletions(-) >>
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/26/2017 01:23 PM, Jiri Pirko wrote: > From: Jiri Pirko> > Many of the ASIC's internal resources are limited and are shared between > several hardware procedures. For example, unified hash-based memory can > be used for many lookup purposes, like FDB and LPM. In many cases the user > can provide a partitioning scheme for such a resource in order to perform > fine tuning for his application. In such cases performing driver reload is > needed for the changes to take place, thus this patchset also adds support > for hot reload. > > Such an abstraction can be coupled with devlink's dpipe interface, which > models the ASIC's pipeline as a graph of match/action tables. By modeling > the hardware resource object, and by coupling it to several dpipe tables, > further visibility can be achieved in order to debug ASIC-wide issues. > > The proposed interface will provide the user the ability to understand the > limitations of the hardware, and receive notification regarding its occupancy. > Furthermore, monitoring the resource occupancy can be done in real-time and > can be useful in many cases. > --- > Userspace part prototype can be found at > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Farkadis%2Fiproute2%2F=02%7C01%7Carkadis%40mellanox.com%7C1ae3d8b4854a454e21e008d54c5329e3%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636498842440762657=7MC2BFQFxjnmHqy2sOOL9VEa4ZGq6e5Z2n2WvuNgyFk%3D=0 > at resource_dev branch. > > v1->v2 > - Add resource size attribute. > - Fix split bug. > Just to summarize the current fixes required: 1. ERIF dpipe table size is reporting wrong size. More precisely the ERIF table does not take rifs, so it should not be linked to the rif bank resource (is not part of this patchset, future extension). 2. Extended ACK user-space bug. 3. ABI documentation- Not sure we agreed upon it, Jiri? If I missed something please respond. Nothing of the fixes mentioned above is relevant for this patchset actually. Couple of key-points: 1. Constrains\trade off about setting the sizes - this can be obtained trivially from the resource tree nested structure. 2. Dpipe provides the mapping of hardware processes to resources. 3. Units - each resource specifies his units, if dpipe table's size is X and its related to some resource its size is normalized to that resources basic unit. IMO this is the most hardware exact interaction, and this is the way it should be exported from the kernel, if something is not presented in 'user' convenient way some utilities can be implemented in userspace to easily do it. Furthermore, some examples will be provided for the whole kvd tree partition for different cases (IPv6 heavy etc..). Advanced user will be able to tweak it as they like. Regarding the 'switchdev' layer I think that kernel's software tables like nexthops/neigh/routes should be mapped to dpipe tables and not to resources directly: kernel_fdb--> dpipe_fdb -->/kvd/hash_single. > Arkadi Sharshevsky (10): > devlink: Add per devlink instance lock > devlink: Add support for resource abstraction > devlink: Add support for reload > devlink: Add relation between dpipe and resource > mlxsw: pci: Add support for performing bus reset > mlxsw: spectrum: Register KVD resources with devlink > mlxsw: spectrum_dpipe: Connect dpipe tables to resources > mlxsw: spectrum: Add support for getting kvdl occupancy > mlxsw: pci: Add support for getting resource through devlink > mlxsw: core: Add support for reload > > drivers/net/ethernet/mellanox/mlxsw/core.c | 85 ++- > drivers/net/ethernet/mellanox/mlxsw/core.h | 16 +- > drivers/net/ethernet/mellanox/mlxsw/i2c.c | 5 +- > drivers/net/ethernet/mellanox/mlxsw/pci.c | 98 ++-- > drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 205 > drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 13 + > .../net/ethernet/mellanox/mlxsw/spectrum_dpipe.c | 72 ++- > .../net/ethernet/mellanox/mlxsw/spectrum_kvdl.c| 26 + > include/net/devlink.h | 97 > include/uapi/linux/devlink.h | 21 + > net/core/devlink.c | 573 > ++--- > 11 files changed, 1079 insertions(+), 132 deletions(-) >
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/31/2017 05:46 PM, David Ahern wrote: > On 12/31/17 3:52 AM, Arkadi Sharshevsky wrote: >>> [1] This is allowed by the current patch set and perhaps it should not be: >>> >>> $ ip ro ls vrf vrf1101 >>> unreachable default metric 8192 >>> 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload >>> 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload >>> 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload >>> 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload >>> 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload >>> 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload >>> 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload >>> 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload >>> >>> $ devlink resource set pci/:03:00.0 path /kvd/linear size 0 >> >> This line actually did nothing, because size zero is not acceptable >> see patch 6. This is pure userpsace problem that error is not shown. > > Then perhaps you have a kernel side bug. After the reload I get this: > > $ devlink resource show pci/:03:00.0 > pci/:03:00.0: > name kvd size 245760 size_valid true > resources: > name linear size 0 occ 0 > name hash_double size 60416 > name hash_single size 87040 > Actually no bug here, the linear can be zero. This implies no nexthop routes. The adj table uses it as you can see. > >> >> You can verify it by dumping the resources and see that there is no >> pending change (only size and not size_new). >> >>> $ devlink reload pci/:03:00.0 >>> $ ip ro ls vrf vrf1101 >>> unreachable default metric 8192 >>> >> >> So you just performed full reload of the driver which includes >> unregistration of all the netdevs and full init. KVD update requires >> full teardown of the driver. > > you are right, I forgot to do a networking reload. Because of the above > (0 was actually taken) all kinds of errors are spewed on 'ifreload -av' > and there is no change to the ro ls: > > $ ip ro ls vrf vrf1101 > unreachable default metric 8192 > >> >> The system will not get back to the same state after reloading, >> It's should be done on init. But it doesn't have to be like this >> this, each driver provides his own reload devlink op implementation >> so in our case full blown reset is required. >> >> >>> [2] Same exact result for setting hash_double to 0: >>> $ ip ro ls vrf vrf1101 >>> unreachable default metric 8192 >>> 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload >>> 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload >>> 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload >>> 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload >>> 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload >>> 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload >>> 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload >>> 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload >>> >>> $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 0 > > On this command you are correct, 0 is not taken: > $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 0 > $ devlink resource show pci/:03:00.0 > pci/:03:00.0: > name kvd size 245760 size_valid true > resources: > name linear size 0 occ 0 > name hash_double size 60416 > name hash_single size 87040 > > but the 'set' command did not fail with a proper extack based error > message, so consider this another a bug report. > Yeah, this is a bug, but a userspace one. I actually sniffed the nl messages with nlmon and saw the extended ack packet with the required string.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/31/17 3:52 AM, Arkadi Sharshevsky wrote: >> [1] This is allowed by the current patch set and perhaps it should not be: >> >> $ ip ro ls vrf vrf1101 >> unreachable default metric 8192 >> 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload >> 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload >> 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload >> 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload >> 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload >> 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload >> 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload >> 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload >> >> $ devlink resource set pci/:03:00.0 path /kvd/linear size 0 > > This line actually did nothing, because size zero is not acceptable > see patch 6. This is pure userpsace problem that error is not shown. Then perhaps you have a kernel side bug. After the reload I get this: $ devlink resource show pci/:03:00.0 pci/:03:00.0: name kvd size 245760 size_valid true resources: name linear size 0 occ 0 name hash_double size 60416 name hash_single size 87040 > > You can verify it by dumping the resources and see that there is no > pending change (only size and not size_new). > >> $ devlink reload pci/:03:00.0 >> $ ip ro ls vrf vrf1101 >> unreachable default metric 8192 >> > > So you just performed full reload of the driver which includes > unregistration of all the netdevs and full init. KVD update requires > full teardown of the driver. you are right, I forgot to do a networking reload. Because of the above (0 was actually taken) all kinds of errors are spewed on 'ifreload -av' and there is no change to the ro ls: $ ip ro ls vrf vrf1101 unreachable default metric 8192 > > The system will not get back to the same state after reloading, > It's should be done on init. But it doesn't have to be like this > this, each driver provides his own reload devlink op implementation > so in our case full blown reset is required. > > >> [2] Same exact result for setting hash_double to 0: >> $ ip ro ls vrf vrf1101 >> unreachable default metric 8192 >> 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload >> 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload >> 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload >> 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload >> 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload >> 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload >> 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload >> 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload >> >> $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 0 On this command you are correct, 0 is not taken: $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 0 $ devlink resource show pci/:03:00.0 pci/:03:00.0: name kvd size 245760 size_valid true resources: name linear size 0 occ 0 name hash_double size 60416 name hash_single size 87040 but the 'set' command did not fail with a proper extack based error message, so consider this another a bug report.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/30/2017 11:15 PM, David Ahern wrote: > On 12/28/17 1:21 AM, Yuval Mintz wrote: >> I think it goes the other way around. The dpipe tables are the ones that >> can be translated to functionality; The resources are internal and >> HW-specific >> representing the possible internal division of resources - >> but a given resource sn't necessarily mapped to a single networking feature. >> [It might be in some cases, but not in the general case] > > This is what I am getting at -- a single resource /kvd/linear is used > for multiple networking features, and those networking features do map > to well known entities -- fdb entries, ACL entries, ipv4/v6 host > entries, LPM entries, etc. > > Nothing about the output from devlink helps the user in any way to > understand how to change the resource values. Saying that these The current patchset adds the following dpipe table <--> resource relation host4 -- hash single host6 -- hash double adj -- linear By dumping the resources via the 'resource show' you can the tree like structure, you can see that you have a tradeoff between those subparts. So for example if a user would like to increase the number of nexthops with the expense of neighbors, it is pretty clear. As more dpipe table will be introduced this relations will be more complete and the user will get the complete view of the ASIC. Just to summarize, the user gets the following info 1. Constrains\trade off about setting the sizes - this you get by the tree structure. 2. Each hardware process which use this resource is mapped to it By combining those two you can get the most accurate information about what your change will do. Partitioning of the KVD is very delicate process, because the hardware is complex. Many hardware processes are pointing to this memory and size changes effect the whole ASIC, as I mentioned as more of the pipeline will be exposed via dpipe the user will get a more precise vision of the hardware. We will provide some recommended and tested configuration of the whole mlxsw resource tree for different user scenarios. A more experienced user can do it for himself, if he got some very special scenario. > resources, what they mean and how they are used is MLX proprietary and > is known only to MLX employees and those with MLX agreements is not > acceptable. Likewise, requiring some network admin to deep dive into the > mlxsw driver to piece together how kvd/linear (for example) is used is > not acceptable. > > The cover letter touts "Many of the ASIC's internal resources are > limited and are shared between several hardware procedures. For example, > unified hash-based memory can be used for many lookup purposes, like FDB > and LPM. In many cases the user can provide a partitioning scheme for > such a resource in order to perform fine tuning for his application." > > Great, now give the user some indication of how to do that. Is setting > /kvd/linear to 0 acceptable? If not, why? What functionality is lost? > (Apparently, everything [1].) > > The dpipe tables list some correlation between the kvd resources and > tables but that is not a complete list and again there is nothing to > tell a user that it is only a partial list of how a kvd resource is This is work in progress, the LPM block will be exposed as the last L3 part. Then we will start the l2 part of the ASIC. > used. For example, it shows ipv4 host is in /kvd/hash_single and that is > all it shows. So if I have an ipv6 only deployment can I conclude that I > can set /kvd/hash_single to 0? Or the reverse, can I set hash_double to > 0 for an ipv4 only deployment? From the limited information given, it is > reasonable for a user to assume yes and has to learn through trial and > error what can be done. [2] > So you want to add min/max size attribute? I think this its not needed. > - > > [1] This is allowed by the current patch set and perhaps it should not be: > > $ ip ro ls vrf vrf1101 > unreachable default metric 8192 > 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload > 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload > 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload > 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload > 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload > 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload > 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload > 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload > > $ devlink resource set pci/:03:00.0 path /kvd/linear size 0 This line actually did nothing, because size zero is not acceptable see patch 6. This is pure userpsace problem that error is not shown. You can verify it by dumping the resources and see that there is no pending change (only size and not size_new). > $ devlink reload pci/:03:00.0 > $ ip ro ls vrf vrf1101 > unreachable default
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/28/17 1:21 AM, Yuval Mintz wrote: > I think it goes the other way around. The dpipe tables are the ones that > can be translated to functionality; The resources are internal and HW-specific > representing the possible internal division of resources - > but a given resource sn't necessarily mapped to a single networking feature. > [It might be in some cases, but not in the general case] This is what I am getting at -- a single resource /kvd/linear is used for multiple networking features, and those networking features do map to well known entities -- fdb entries, ACL entries, ipv4/v6 host entries, LPM entries, etc. Nothing about the output from devlink helps the user in any way to understand how to change the resource values. Saying that these resources, what they mean and how they are used is MLX proprietary and is known only to MLX employees and those with MLX agreements is not acceptable. Likewise, requiring some network admin to deep dive into the mlxsw driver to piece together how kvd/linear (for example) is used is not acceptable. The cover letter touts "Many of the ASIC's internal resources are limited and are shared between several hardware procedures. For example, unified hash-based memory can be used for many lookup purposes, like FDB and LPM. In many cases the user can provide a partitioning scheme for such a resource in order to perform fine tuning for his application." Great, now give the user some indication of how to do that. Is setting /kvd/linear to 0 acceptable? If not, why? What functionality is lost? (Apparently, everything [1].) The dpipe tables list some correlation between the kvd resources and tables but that is not a complete list and again there is nothing to tell a user that it is only a partial list of how a kvd resource is used. For example, it shows ipv4 host is in /kvd/hash_single and that is all it shows. So if I have an ipv6 only deployment can I conclude that I can set /kvd/hash_single to 0? Or the reverse, can I set hash_double to 0 for an ipv4 only deployment? From the limited information given, it is reasonable for a user to assume yes and has to learn through trial and error what can be done. [2] - [1] This is allowed by the current patch set and perhaps it should not be: $ ip ro ls vrf vrf1101 unreachable default metric 8192 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload $ devlink resource set pci/:03:00.0 path /kvd/linear size 0 $ devlink reload pci/:03:00.0 $ ip ro ls vrf vrf1101 unreachable default metric 8192 [2] Same exact result for setting hash_double to 0: $ ip ro ls vrf vrf1101 unreachable default metric 8192 11.2.51.0/24 dev swp1s0.51 proto kernel scope link src 11.2.51.1 offload 11.3.51.0/24 dev swp1s1.51 proto kernel scope link src 11.3.51.1 offload 11.4.51.0/24 dev swp1s2.51 proto kernel scope link src 11.4.51.1 offload 11.5.51.0/24 dev swp1s3.51 proto kernel scope link src 11.5.51.1 offload 11.6.51.0/24 dev swp3s0.51 proto kernel scope link src 11.6.51.1 offload 11.7.51.0/24 dev swp3s1.51 proto kernel scope link src 11.7.51.1 offload 11.8.51.0/24 dev swp3s2.51 proto kernel scope link src 11.8.51.1 offload 11.9.51.0/24 dev swp3s3.51 proto kernel scope link src 11.9.51.1 offload $ devlink resource set pci/:03:00.0 path /kvd/hash_double size 0 $ devlink reload pci/:03:00.0 $ ip ro ls vrf vrf1101 unreachable default metric 8192
Re: [patch net-next v2 00/10] Add support for resource abstraction
> In my opinion it should not change. Unless there is a bug (like the one > DaveA found in mlxsw erif table). Existing tables and resources should > be only added. It is the driver's maintainer responsibility to not to > break user scripts. So we agree with is ABI. Great. Andrew
Re: [patch net-next v2 00/10] Add support for resource abstraction
Sat, Dec 30, 2017 at 11:18:50AM CET, and...@lunn.ch wrote: >> 1. Both dpipe and devlink resource are abstraction models for >> hardware entities, and as a result they true to provide generic objects. >> Each driver/ASIC should register his own and it absolutely proprietary >> implementation. There is absolutely NO industry standard here, the only >> thing that resembles a standard is that dpipe looks a bit like P4 only >> because its proved to be useful for describing packet forwarding >> pipelines. The host4 table is just a hardware process in the mellanox >> spectrum ASIC pipeline and it should not be part of ABI, sorry I clearly >> don't understand how this even came up. > >Why should it not be part of the ABI? Are you saying it will change >from version to version? Are you trying to warning people not to write >scripts using it, because its meaning will change and what works now >will break in the next kernel version? In my opinion it should not change. Unless there is a bug (like the one DaveA found in mlxsw erif table). Existing tables and resources should be only added. It is the driver's maintainer responsibility to not to break user scripts.
Re: [patch net-next v2 00/10] Add support for resource abstraction
> 1. Both dpipe and devlink resource are abstraction models for > hardware entities, and as a result they true to provide generic objects. > Each driver/ASIC should register his own and it absolutely proprietary > implementation. There is absolutely NO industry standard here, the only > thing that resembles a standard is that dpipe looks a bit like P4 only > because its proved to be useful for describing packet forwarding > pipelines. The host4 table is just a hardware process in the mellanox > spectrum ASIC pipeline and it should not be part of ABI, sorry I clearly > don't understand how this even came up. Why should it not be part of the ABI? Are you saying it will change from version to version? Are you trying to warning people not to write scripts using it, because its meaning will change and what works now will break in the next kernel version? Andrew
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/28/2017 06:33 PM, David Ahern wrote: > On 12/28/17 10:23 AM, Jiri Pirko wrote: >>> So there are 4 tables exported to userspace: >>> >>> 1. mlxsw_erif table which is not in any of the kvd regions (no >>> resource path is given) and it has a size of 1000. Does >>> mlxsw_erif mean a rif as in Router Interfaces? So the switch >>> supports up to 1000 router interfaces. >>> >>> 2. mlxsw_host4 in /kvd/hash_single with a size of 62. Based on >>> the >> Size tells you the actual size. It cannot give you max size. The >> reason is simple. The resources are shared among multiple tables. >> That is exactly what this resource patch makes visible. >> >> > > In the erif table, the 1000 is the max not current usage. I do not > have 1000 interfaces: > > $ ip -br li sh | wc -l 597 > > > $ devlink dpipe table dump pci/:03:00.0 name mlxsw_erif ... index > 503 match_value: type field_exact header mlxsw_meta field erif_port > mapping ifindex mapping_value 601 value 503 action_value: type > field_modify header mlxsw_meta field l3_forward value 1 > > > The host4 table it is current size with no maximum. > > The meaning of table size needs to be consistent across tables. > You are right the egress RIF table size is not correct, I will definitely fix it, but it is not what you think it should be. So in order to clarify this point, just a reminder: 1. Both dpipe and devlink resource are abstraction models for hardware entities, and as a result they true to provide generic objects. Each driver/ASIC should register his own and it absolutely proprietary implementation. There is absolutely NO industry standard here, the only thing that resembles a standard is that dpipe looks a bit like P4 only because its proved to be useful for describing packet forwarding pipelines. The host4 table is just a hardware process in the mellanox spectrum ASIC pipeline and it should not be part of ABI, sorry I clearly don't understand how this even came up. 2. Dpipe table is a single hardware process, most of the time it uses some resources (for example LPM algorithm uses hash memory). 3. ERIF table is a table that is located in the end of the L3 pipeline. The current dpipe description is not complete and that why it caused confusion. The table performs match on rif index and packet type (UC/MC/BC) and performs forward/drop decision. As you can see, for each rif the table can have several entries, which provide different statistics for different traffic types per rif, currently only the UC is exposed with forward. 4. ASICs use shared resource for many processes, this is exactly the behavior we want to expose! Again, the size of the ERIF table should NOT provide the number of rifs which are in use, simply because dpipe tables do not describe hardware resources. In the future the RIF bank will be exported as resource object with size of 1000, and in order to observe how much are in use you should check its occupancy. This is the whole reason of this interface.
Re: [patch net-next v2 00/10] Add support for resource abstraction
Thu, Dec 28, 2017 at 05:33:58PM CET, d...@cumulusnetworks.com wrote: >On 12/28/17 10:23 AM, Jiri Pirko wrote: >>> So there are 4 tables exported to userspace: >>> >>> 1. mlxsw_erif table which is not in any of the kvd regions (no resource >>> path is given) and it has a size of 1000. Does mlxsw_erif mean a rif as >>> in Router Interfaces? So the switch supports up to 1000 router interfaces. >>> >>> 2. mlxsw_host4 in /kvd/hash_single with a size of 62. Based on the >> Size tells you the actual size. It cannot give you max size. The reason >> is simple. The resources are shared among multiple tables. That is >> exactly what this resource patch makes visible. >> >> > >In the erif table, the 1000 is the max not current usage. I do not have I believe that is a bug in erif dpipe implementation (mlxsw_sp_dpipe_table_erif_size_get) We'll fix that. Thanks! >1000 interfaces: > >$ ip -br li sh | wc -l >597 > > >$ devlink dpipe table dump pci/:03:00.0 name mlxsw_erif >... > index 503 > match_value: >type field_exact header mlxsw_meta field erif_port mapping ifindex >mapping_value 601 value 503 > action_value: >type field_modify header mlxsw_meta field l3_forward value 1 > > >The host4 table it is current size with no maximum. > >The meaning of table size needs to be consistent across tables.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/28/17 10:23 AM, Jiri Pirko wrote: >> So there are 4 tables exported to userspace: >> >> 1. mlxsw_erif table which is not in any of the kvd regions (no resource >> path is given) and it has a size of 1000. Does mlxsw_erif mean a rif as >> in Router Interfaces? So the switch supports up to 1000 router interfaces. >> >> 2. mlxsw_host4 in /kvd/hash_single with a size of 62. Based on the > Size tells you the actual size. It cannot give you max size. The reason > is simple. The resources are shared among multiple tables. That is > exactly what this resource patch makes visible. > > In the erif table, the 1000 is the max not current usage. I do not have 1000 interfaces: $ ip -br li sh | wc -l 597 $ devlink dpipe table dump pci/:03:00.0 name mlxsw_erif ... index 503 match_value: type field_exact header mlxsw_meta field erif_port mapping ifindex mapping_value 601 value 503 action_value: type field_modify header mlxsw_meta field l3_forward value 1 The host4 table it is current size with no maximum. The meaning of table size needs to be consistent across tables.
Re: [patch net-next v2 00/10] Add support for resource abstraction
Thu, Dec 28, 2017 at 05:09:09PM CET, d...@cumulusnetworks.com wrote: >On 12/28/17 2:25 AM, Yuval Mintz wrote: > Again, I have no objections to kvd, linear, hash, etc terms as they do > relate to Mellanox products. But kvd/linear, for example, does correlate > to industry standard concepts in some way. My request is that the > resource listing guide the user in some way, stating what these > resources mean. So the showed relation to dpipe table would be enougn or you would still like to see some description? I don't like the description concept here as the relations to dpipe table should tell user exactly what he needs to know. >>> >>> I believe it is useful to have a 1-line, short description that gives >>> the user some memory jogger as to what the resource is used for. It does >>> not have to be an exhaustive list, but the user should not have to do >>> mental jumping jacks running a bunch of commands to understand the >>> resources for vendor specific asics. >> >> Perhaps we can simply have devlink utility output the dpipe >> table[s] associated with the resource when showing the resource? >> It would contain live information as well as prevent the need for >> 'mental jumping jacks'. >> > >My primary contention for this static partitioning is that the proposal >is not giving the user the information they need to make decisions. > >As I mentioned earlier, the resource show command gives this: >$ devlink resource show pci/:03:00.0 >pci/:03:00.0: > name kvd size 245760 size_valid true > resources: >name linear size 98304 occ 0 >name hash_double size 60416 >name hash_single size 87040 > >the paths /kvd/linear, /kvd/hash_single and /kvd/hash_double are >essentially random names (nothing related to industry standard names) Of course. There is no industry standard for internal ASIC implementations. This is the same as for dpipe. There is no industry standard for ASIC pipeline. dpipe visualizes it. This resource patch visualizes the internal ASIC resources and their mapping to the individual dpipe tables. >and the listed sizes are random numbers (no units)[1]. There is nothing >there to tell a user what they can adjust or why they would want to make >an adjustment. > > >Looking at 'dpipe table show': > >$ devlink dpipe table show pci/:03:00.0 >pci/:03:00.0: > name mlxsw_erif size 1000 counters_enabled false > match: >type field_exact header mlxsw_meta field erif_port mapping ifindex > action: >type field_modify header mlxsw_meta field l3_forward >type field_modify header mlxsw_meta field l3_drop > > resource_path /kvd/hash_single name mlxsw_host4 size 62 >counters_enabled false > match: >type field_exact header mlxsw_meta field erif_port mapping ifindex >type field_exact header ipv4 field destination ip > action: >type field_modify header ethernet field destination mac > > resource_path /kvd/hash_double name mlxsw_host6 size 0 >counters_enabled false > match: >type field_exact header mlxsw_meta field erif_port mapping ifindex >type field_exact header ipv6 field destination ip > action: >type field_modify header ethernet field destination mac > > resource_path /kvd/linear name mlxsw_adj size 0 counters_enabled false > match: >type field_exact header mlxsw_meta field adj_index >type field_exact header mlxsw_meta field adj_size >type field_exact header mlxsw_meta field adj_hash_index > action: >type field_modify header ethernet field destination mac >type field_modify header mlxsw_meta field erif_port mapping ifindex > > >So there are 4 tables exported to userspace: > >1. mlxsw_erif table which is not in any of the kvd regions (no resource >path is given) and it has a size of 1000. Does mlxsw_erif mean a rif as >in Router Interfaces? So the switch supports up to 1000 router interfaces. > >2. mlxsw_host4 in /kvd/hash_single with a size of 62. Based on the Size tells you the actual size. It cannot give you max size. The reason is simple. The resources are shared among multiple tables. That is exactly what this resource patch makes visible. >fields mlxsw_host4 means IPv4 host entries (neighbor entries). Why is >the size set at 62? Seems really low. > >3. mlxsw_host6 in /kvd/hash_double with a size of 0. Based on the fields >mlxsw_host6 means IPv6 host entries (neighbor entries). The size of 0 is >concerning. I guess the switch is not configured to do IPv6? > >4. mlxsw_adj in /kvd/linear with a size of 0. Based on the fields I am >going to guess it is an fdb entry
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/28/17 2:25 AM, Yuval Mintz wrote: Again, I have no objections to kvd, linear, hash, etc terms as they do relate to Mellanox products. But kvd/linear, for example, does correlate to industry standard concepts in some way. My request is that the resource listing guide the user in some way, stating what these resources mean. >>> >>> So the showed relation to dpipe table would be enougn or you would still >>> like to see some description? I don't like the description concept here >>> as the relations to dpipe table should tell user exactly what he needs >>> to know. >> >> I believe it is useful to have a 1-line, short description that gives >> the user some memory jogger as to what the resource is used for. It does >> not have to be an exhaustive list, but the user should not have to do >> mental jumping jacks running a bunch of commands to understand the >> resources for vendor specific asics. > > Perhaps we can simply have devlink utility output the dpipe > table[s] associated with the resource when showing the resource? > It would contain live information as well as prevent the need for > 'mental jumping jacks'. > My primary contention for this static partitioning is that the proposal is not giving the user the information they need to make decisions. As I mentioned earlier, the resource show command gives this: $ devlink resource show pci/:03:00.0 pci/:03:00.0: name kvd size 245760 size_valid true resources: name linear size 98304 occ 0 name hash_double size 60416 name hash_single size 87040 the paths /kvd/linear, /kvd/hash_single and /kvd/hash_double are essentially random names (nothing related to industry standard names) and the listed sizes are random numbers (no units)[1]. There is nothing there to tell a user what they can adjust or why they would want to make an adjustment. Looking at 'dpipe table show': $ devlink dpipe table show pci/:03:00.0 pci/:03:00.0: name mlxsw_erif size 1000 counters_enabled false match: type field_exact header mlxsw_meta field erif_port mapping ifindex action: type field_modify header mlxsw_meta field l3_forward type field_modify header mlxsw_meta field l3_drop resource_path /kvd/hash_single name mlxsw_host4 size 62 counters_enabled false match: type field_exact header mlxsw_meta field erif_port mapping ifindex type field_exact header ipv4 field destination ip action: type field_modify header ethernet field destination mac resource_path /kvd/hash_double name mlxsw_host6 size 0 counters_enabled false match: type field_exact header mlxsw_meta field erif_port mapping ifindex type field_exact header ipv6 field destination ip action: type field_modify header ethernet field destination mac resource_path /kvd/linear name mlxsw_adj size 0 counters_enabled false match: type field_exact header mlxsw_meta field adj_index type field_exact header mlxsw_meta field adj_size type field_exact header mlxsw_meta field adj_hash_index action: type field_modify header ethernet field destination mac type field_modify header mlxsw_meta field erif_port mapping ifindex So there are 4 tables exported to userspace: 1. mlxsw_erif table which is not in any of the kvd regions (no resource path is given) and it has a size of 1000. Does mlxsw_erif mean a rif as in Router Interfaces? So the switch supports up to 1000 router interfaces. 2. mlxsw_host4 in /kvd/hash_single with a size of 62. Based on the fields mlxsw_host4 means IPv4 host entries (neighbor entries). Why is the size set at 62? Seems really low. 3. mlxsw_host6 in /kvd/hash_double with a size of 0. Based on the fields mlxsw_host6 means IPv6 host entries (neighbor entries). The size of 0 is concerning. I guess the switch is not configured to do IPv6? 4. mlxsw_adj in /kvd/linear with a size of 0. Based on the fields I am going to guess it is an fdb entry So if I combine the output of "resource show" and "dpipe table show", I can conclude: 1. /kvd/linear is only used for fdb entries. So if I want an L3 only use case I can set /kvd/linear 0. Is that correct? I believe the answer is no, but based on the information given it seems that way. 2. /kvd/hash_double has a size of 60416, but it is only used for mlxsw_host6 table which has a size of 0. Now, I am confused. 3. /kvd/hash_single has a size of 87040, but it is only used for mlxsw_host4 table which has a size of 62. Again confused. What is a user to make of this output? How is it useful for making decisions on whether to increase or decrease some resource? [1] In a response, Jiri mentioned units are added by this patch set but all I see in the patches is this: @@ -245,4 +256,8 @@ enum devlink_dpipe_header_id { DEVLINK_DPIPE_HEADER_IPV6, }; +enum devlink_resource_unit { + DEVLINK_RESOURCE_UNIT_DOUBLE_WORD, +}; + #endif /* _UAPI_LINUX_DEVLINK_H_ */ DEVLINK_RESOURCE_UNIT_DOUBLE_WORD means what???
RE: [patch net-next v2 00/10] Add support for resource abstraction
> >>> Many of the ASIC's internal resources are limited and are shared > between > >>> several hardware procedures. For example, unified hash-based memory > can > >>> be used for many lookup purposes, like FDB and LPM. In many cases the > user > >>> can provide a partitioning scheme for such a resource in order to > perform > >>> fine tuning for his application. In such cases performing driver reload is > >>> needed for the changes to take place, thus this patchset also adds support > >>> for hot reload. > >>> > >>> Such an abstraction can be coupled with devlink's dpipe interface, which > >>> models the ASIC's pipeline as a graph of match/action tables. By > modeling > >>> the hardware resource object, and by coupling it to several dpipe tables, > >>> further visibility can be achieved in order to debug ASIC-wide issues. > >>> > >>> The proposed interface will provide the user the ability to understand the > >>> limitations of the hardware, and receive notification regarding its > occupancy. > >>> Furthermore, monitoring the resource occupancy can be done in real- > time and > >>> can be useful in many cases. > >> > >> In the last RFC (not v1, but RFC) I asked for some kind of description > >> for each resource, and you and Arkadi have pushed back. Let's walk > >> through an example to see what I mean: > >> > >> $ devlink resource show pci/:03:00.0 > >> pci/:03:00.0: > >> name kvd size 245760 size_valid true > >> resources: > >>name linear size 98304 occ 0 > >>name hash_double size 60416 > >>name hash_single size 87040 > >> > >> So this 2700 has 3 resources that can be managed -- some table or > >> resource or something named 'kvd' with linear, hash_double and > >> hash_single sub-resources. What are these names referring too? The > above > >> output gives no description, and 'kvd' is not an industry term. Further, > > > > This are internal resources specific to the ASIC. Would you like some > > description to each or something like that? > > devlink has some nice self-documenting capabilities. What's missing here > is a description of what the resource is used for in standard terms -- > ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a > short list versus an exhaustive list of everything it is used for. e.g., > Why would a user decrease linear and increase hash_single or vice versa? > > > > > > >> what are these sizes that a user can control? The output contains no > >> units, no description, nothing. In short, the above output provides > >> random numbers associated with random names. > > > > Units are now exposed from kernel, just this version of iproute2 patch > > does not display it. > > please provide an iproute2 patch that does so the full context if this > patch set can be reviewed from a user perspective. > > > > > > >> > >> I can see dpipe tables exported by this device: > >> > >> $ devlink dpipe header show pci/:03:00.0 > >> > >> pci/:03:00.0: > >> name mlxsw_meta > >> field: > >>name erif_port bitwidth 32 mapping_type ifindex > >>name l3_forward bitwidth 1 > >>name l3_drop bitwidth 1 > >>name adj_index bitwidth 32 > >>name adj_size bitwidth 32 > >>name adj_hash_index bitwidth 32 > >> > >> name ipv6 > >> field: > >>name destination ip bitwidth 128 > >> > >> name ipv4 > >> field: > >>name destination ip bitwidth 32 > >> > >> name ethernet > >> field: > >>name destination mac bitwidth 48 > >> > >> but none mention 'kvd' or 'linear' or 'hash" and none of the other > >> various devlink options: > >> > >> $ devlink > >> Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help } > >> where OBJECT := { dev | port | sb | monitor | dpipe } > >> > >> seem to related to resources. > >> > >> So how does a user know what they are controlling by this 'resource' > >> option? Is the user expected to have a PRM or user guide on hand for the > >> specific device model that is being configured? > > > > The relation of specific dpipe table to specific resource is exposed by > > the kernel as well. Probably the iproute2 patch just does not display > > it. > > please provide an iproute2 patch that does so the full context if this > patch set can be reviewed from a user perspective. > > > > > > >> > >> Again, I have no objections to kvd, linear, hash, etc terms as they do > >> relate to Mellanox products. But kvd/linear, for example, does correlate > >> to industry standard concepts in some way. My request is that the > >> resource listing guide the user in some way, stating what these > >> resources mean. > > > > So the showed relation to dpipe table would be enougn or you would still > > like to see some description? I don't like the description concept here > > as the relations to dpipe table should tell user exactly what he needs > > to know. > > I believe it is useful to have a 1-line, short description that gives > the user some memory jogger as to what the resource is used for. It does > not have to be an
RE: [patch net-next v2 00/10] Add support for resource abstraction
> Many of the ASIC's internal resources are limited and are shared > between > several hardware procedures. For example, unified hash-based memory > can > be used for many lookup purposes, like FDB and LPM. In many cases the > user > can provide a partitioning scheme for such a resource in order to > perform > fine tuning for his application. In such cases performing driver reload > is > needed for the changes to take place, thus this patchset also adds > support > for hot reload. > > Such an abstraction can be coupled with devlink's dpipe interface, which > models the ASIC's pipeline as a graph of match/action tables. By > modeling > the hardware resource object, and by coupling it to several dpipe tables, > further visibility can be achieved in order to debug ASIC-wide issues. > > The proposed interface will provide the user the ability to understand > the > limitations of the hardware, and receive notification regarding its > occupancy. > Furthermore, monitoring the resource occupancy can be done in real- > time and > can be useful in many cases. > >>> > >>> In the last RFC (not v1, but RFC) I asked for some kind of description > >>> for each resource, and you and Arkadi have pushed back. Let's walk > >>> through an example to see what I mean: > >>> > >>> $ devlink resource show pci/:03:00.0 > >>> pci/:03:00.0: > >>> name kvd size 245760 size_valid true > >>> resources: > >>>name linear size 98304 occ 0 > >>>name hash_double size 60416 > >>>name hash_single size 87040 > >>> > >>> So this 2700 has 3 resources that can be managed -- some table or > >>> resource or something named 'kvd' with linear, hash_double and > >>> hash_single sub-resources. What are these names referring too? The > above > >>> output gives no description, and 'kvd' is not an industry term. Further, > >> > >> This are internal resources specific to the ASIC. Would you like some > >> description to each or something like that? > > > > devlink has some nice self-documenting capabilities. What's missing here > > is a description of what the resource is used for in standard terms -- > > ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a > > short list versus an exhaustive list of everything it is used for. e.g., > > Why would a user decrease linear and increase hash_single or vice versa? > > > Arkadi, on what david says above, can the resource names and ids not > be driver specific, but moved up to the switchdev layer and just map > to fdb, host routes, nexthops table sizes etc ?. > Can these generic networking resources then in-turn be mapped to kvd > sizes by the driver ? I think it goes the other way around. The dpipe tables are the ones that can be translated to functionality; The resources are internal and HW-specific representing the possible internal division of resources - but a given resource sn't necessarily mapped to a single networking feature. [It might be in some cases, but not in the general case] You could always move to a structured approach where each resource in the hierarchy is further split to sub-resources until each leaf represents a single network concepts - but that would stop be an abstraction of the HW resources and become a SW implementation instead, as SW would have to be the one to maintain and enforce the resource distribution. And that's not what we're trying to achieve here.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/27/17 1:31 PM, Andrew Lunn wrote: >> Hmm. That documents mainly sysfs. No mention of Netlink at all. But >> maybe I missed it. Also, that defines the interface as is. However we >> are talking about the data exchanged over the interface, not the >> interface itself. I don't see how ASIC/HW specific thing, like for >> example KVD in our case could be part of kernel ABI. > > You need to be very careful here. As soon as somebody starts using it, > it might become an ABI. Or you need to clearly document it is not ABI, > there is no guarantee it will not disappear or change its meaning in > the next kernel, and it should be used with extreme caution. > +1 Once the names go in, people can write scripts that invoke devlink at boot to partition resources. With the proposed patch set, the name (e.g., kvd/linear) becomes part of the ABI.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/27/2017 06:34 PM, David Ahern wrote: > On 12/27/17 2:09 AM, Jiri Pirko wrote: >> Wed, Dec 27, 2017 at 05:05:09AM CET, d...@cumulusnetworks.com wrote: >>> On 12/26/17 5:23 AM, Jiri Pirko wrote: From: Jiri PirkoMany of the ASIC's internal resources are limited and are shared between several hardware procedures. For example, unified hash-based memory can be used for many lookup purposes, like FDB and LPM. In many cases the user can provide a partitioning scheme for such a resource in order to perform fine tuning for his application. In such cases performing driver reload is needed for the changes to take place, thus this patchset also adds support for hot reload. Such an abstraction can be coupled with devlink's dpipe interface, which models the ASIC's pipeline as a graph of match/action tables. By modeling the hardware resource object, and by coupling it to several dpipe tables, further visibility can be achieved in order to debug ASIC-wide issues. The proposed interface will provide the user the ability to understand the limitations of the hardware, and receive notification regarding its occupancy. Furthermore, monitoring the resource occupancy can be done in real-time and can be useful in many cases. >>> >>> In the last RFC (not v1, but RFC) I asked for some kind of description >>> for each resource, and you and Arkadi have pushed back. Let's walk >>> through an example to see what I mean: >>> >>> $ devlink resource show pci/:03:00.0 >>> pci/:03:00.0: >>> name kvd size 245760 size_valid true >>> resources: >>>name linear size 98304 occ 0 >>>name hash_double size 60416 >>>name hash_single size 87040 >>> >>> So this 2700 has 3 resources that can be managed -- some table or >>> resource or something named 'kvd' with linear, hash_double and >>> hash_single sub-resources. What are these names referring too? The above >>> output gives no description, and 'kvd' is not an industry term. Further, >> >> This are internal resources specific to the ASIC. Would you like some >> description to each or something like that? > > devlink has some nice self-documenting capabilities. What's missing here > is a description of what the resource is used for in standard terms -- > ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a > short list versus an exhaustive list of everything it is used for. e.g., > Why would a user decrease linear and increase hash_single or vice versa? > >> >> >>> what are these sizes that a user can control? The output contains no >>> units, no description, nothing. In short, the above output provides >>> random numbers associated with random names. >> >> Units are now exposed from kernel, just this version of iproute2 patch >> does not display it. > > please provide an iproute2 patch that does so the full context if this > patch set can be reviewed from a user perspective. > >> >> >>> >>> I can see dpipe tables exported by this device: >>> >>> $ devlink dpipe header show pci/:03:00.0 >>> >>> pci/:03:00.0: >>> name mlxsw_meta >>> field: >>>name erif_port bitwidth 32 mapping_type ifindex >>>name l3_forward bitwidth 1 >>>name l3_drop bitwidth 1 >>>name adj_index bitwidth 32 >>>name adj_size bitwidth 32 >>>name adj_hash_index bitwidth 32 >>> >>> name ipv6 >>> field: >>>name destination ip bitwidth 128 >>> >>> name ipv4 >>> field: >>>name destination ip bitwidth 32 >>> >>> name ethernet >>> field: >>>name destination mac bitwidth 48 >>> >>> but none mention 'kvd' or 'linear' or 'hash" and none of the other >>> various devlink options: >>> >>> $ devlink >>> Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help } >>> where OBJECT := { dev | port | sb | monitor | dpipe } >>> >>> seem to related to resources. >>> >>> So how does a user know what they are controlling by this 'resource' >>> option? Is the user expected to have a PRM or user guide on hand for the >>> specific device model that is being configured? >> >> The relation of specific dpipe table to specific resource is exposed by >> the kernel as well. Probably the iproute2 patch just does not display >> it. > > please provide an iproute2 patch that does so the full context if this > patch set can be reviewed from a user perspective. > As Yuval stated you are using the wrong command here. You are printing the headers not the tables. On each dpipe table you can see the resource it is using (the resource path aka host table uses /kvd/hash_single for example). This is already working. Just try it. >> >> >>> >>> Again, I have no objections to kvd, linear, hash, etc terms as they do >>> relate to Mellanox products. But kvd/linear, for example, does correlate >>> to industry standard concepts in some way. My request is that the >>> resource listing guide the user in some way, stating what these >>> resources mean. >> >> So the
Re: [patch net-next v2 00/10] Add support for resource abstraction
On Wed, Dec 27, 2017 at 8:34 AM, David Ahernwrote: > On 12/27/17 2:09 AM, Jiri Pirko wrote: >> Wed, Dec 27, 2017 at 05:05:09AM CET, d...@cumulusnetworks.com wrote: >>> On 12/26/17 5:23 AM, Jiri Pirko wrote: From: Jiri Pirko Many of the ASIC's internal resources are limited and are shared between several hardware procedures. For example, unified hash-based memory can be used for many lookup purposes, like FDB and LPM. In many cases the user can provide a partitioning scheme for such a resource in order to perform fine tuning for his application. In such cases performing driver reload is needed for the changes to take place, thus this patchset also adds support for hot reload. Such an abstraction can be coupled with devlink's dpipe interface, which models the ASIC's pipeline as a graph of match/action tables. By modeling the hardware resource object, and by coupling it to several dpipe tables, further visibility can be achieved in order to debug ASIC-wide issues. The proposed interface will provide the user the ability to understand the limitations of the hardware, and receive notification regarding its occupancy. Furthermore, monitoring the resource occupancy can be done in real-time and can be useful in many cases. >>> >>> In the last RFC (not v1, but RFC) I asked for some kind of description >>> for each resource, and you and Arkadi have pushed back. Let's walk >>> through an example to see what I mean: >>> >>> $ devlink resource show pci/:03:00.0 >>> pci/:03:00.0: >>> name kvd size 245760 size_valid true >>> resources: >>>name linear size 98304 occ 0 >>>name hash_double size 60416 >>>name hash_single size 87040 >>> >>> So this 2700 has 3 resources that can be managed -- some table or >>> resource or something named 'kvd' with linear, hash_double and >>> hash_single sub-resources. What are these names referring too? The above >>> output gives no description, and 'kvd' is not an industry term. Further, >> >> This are internal resources specific to the ASIC. Would you like some >> description to each or something like that? > > devlink has some nice self-documenting capabilities. What's missing here > is a description of what the resource is used for in standard terms -- > ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a > short list versus an exhaustive list of everything it is used for. e.g., > Why would a user decrease linear and increase hash_single or vice versa? Arkadi, on what david says above, can the resource names and ids not be driver specific, but moved up to the switchdev layer and just map to fdb, host routes, nexthops table sizes etc ?. Can these generic networking resources then in-turn be mapped to kvd sizes by the driver ?
Re: [patch net-next v2 00/10] Add support for resource abstraction
> Hmm. That documents mainly sysfs. No mention of Netlink at all. But > maybe I missed it. Also, that defines the interface as is. However we > are talking about the data exchanged over the interface, not the > interface itself. I don't see how ASIC/HW specific thing, like for > example KVD in our case could be part of kernel ABI. You need to be very careful here. As soon as somebody starts using it, it might become an ABI. Or you need to clearly document it is not ABI, there is no guarantee it will not disappear or change its meaning in the next kernel, and it should be used with extreme caution. Personally, if DSA drivers were to expose such settings, i would consider them ABI, which people can rely on to remain stable. Andrew
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/27/17 7:15 AM, Jiri Pirko wrote: > Wed, Dec 27, 2017 at 02:08:03PM CET, and...@lunn.ch wrote: >>> This is misunderstanding I believe. This is not about ABI. That is well >>> defined by the netlink attributes. This is about meaning of particular >>> ASIC-specific internal resources. >> >> I would agree that the netlink attributed are clearly defined. But the >> meta information, what this ASIC specific internal resource means when >> you combine these attributes, is unclear. This meta information is >> also part of the ABI, and documenting giving users a hit what it >> means, and why they should change it, would be good practice. >> >> Look at sysfs. open/read/write are clearly defined, which is the >> equivalent of the netlink attributes. The meta information we document >> in Documentation/ABI/, what a file name means, what a value means, >> what other values it can take, etc. > > Hmm. That documents mainly sysfs. No mention of Netlink at all. But > maybe I missed it. Also, that defines the interface as is. However we > are talking about the data exchanged over the interface, not the > interface itself. I don't see how ASIC/HW specific thing, like for > example KVD in our case could be part of kernel ABI. That makes 0 sense > to me, sorry. > IMO, kernel documentation is not much better than vendor docs. A general networking admin may not have access to a kernel tree or have vendor docs at their finger tips. As I mentioned in the previous response, devlink attributes have self-documenting capabilities making that information available inline. That to me is the right place for a short, but reasonably sized description. For in-depth details users can cross-reference vendor docs.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/27/17 2:09 AM, Jiri Pirko wrote: > Wed, Dec 27, 2017 at 05:05:09AM CET, d...@cumulusnetworks.com wrote: >> On 12/26/17 5:23 AM, Jiri Pirko wrote: >>> From: Jiri Pirko>>> >>> Many of the ASIC's internal resources are limited and are shared between >>> several hardware procedures. For example, unified hash-based memory can >>> be used for many lookup purposes, like FDB and LPM. In many cases the user >>> can provide a partitioning scheme for such a resource in order to perform >>> fine tuning for his application. In such cases performing driver reload is >>> needed for the changes to take place, thus this patchset also adds support >>> for hot reload. >>> >>> Such an abstraction can be coupled with devlink's dpipe interface, which >>> models the ASIC's pipeline as a graph of match/action tables. By modeling >>> the hardware resource object, and by coupling it to several dpipe tables, >>> further visibility can be achieved in order to debug ASIC-wide issues. >>> >>> The proposed interface will provide the user the ability to understand the >>> limitations of the hardware, and receive notification regarding its >>> occupancy. >>> Furthermore, monitoring the resource occupancy can be done in real-time and >>> can be useful in many cases. >> >> In the last RFC (not v1, but RFC) I asked for some kind of description >> for each resource, and you and Arkadi have pushed back. Let's walk >> through an example to see what I mean: >> >> $ devlink resource show pci/:03:00.0 >> pci/:03:00.0: >> name kvd size 245760 size_valid true >> resources: >>name linear size 98304 occ 0 >>name hash_double size 60416 >>name hash_single size 87040 >> >> So this 2700 has 3 resources that can be managed -- some table or >> resource or something named 'kvd' with linear, hash_double and >> hash_single sub-resources. What are these names referring too? The above >> output gives no description, and 'kvd' is not an industry term. Further, > > This are internal resources specific to the ASIC. Would you like some > description to each or something like that? devlink has some nice self-documenting capabilities. What's missing here is a description of what the resource is used for in standard terms -- ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a short list versus an exhaustive list of everything it is used for. e.g., Why would a user decrease linear and increase hash_single or vice versa? > > >> what are these sizes that a user can control? The output contains no >> units, no description, nothing. In short, the above output provides >> random numbers associated with random names. > > Units are now exposed from kernel, just this version of iproute2 patch > does not display it. please provide an iproute2 patch that does so the full context if this patch set can be reviewed from a user perspective. > > >> >> I can see dpipe tables exported by this device: >> >> $ devlink dpipe header show pci/:03:00.0 >> >> pci/:03:00.0: >> name mlxsw_meta >> field: >>name erif_port bitwidth 32 mapping_type ifindex >>name l3_forward bitwidth 1 >>name l3_drop bitwidth 1 >>name adj_index bitwidth 32 >>name adj_size bitwidth 32 >>name adj_hash_index bitwidth 32 >> >> name ipv6 >> field: >>name destination ip bitwidth 128 >> >> name ipv4 >> field: >>name destination ip bitwidth 32 >> >> name ethernet >> field: >>name destination mac bitwidth 48 >> >> but none mention 'kvd' or 'linear' or 'hash" and none of the other >> various devlink options: >> >> $ devlink >> Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help } >> where OBJECT := { dev | port | sb | monitor | dpipe } >> >> seem to related to resources. >> >> So how does a user know what they are controlling by this 'resource' >> option? Is the user expected to have a PRM or user guide on hand for the >> specific device model that is being configured? > > The relation of specific dpipe table to specific resource is exposed by > the kernel as well. Probably the iproute2 patch just does not display > it. please provide an iproute2 patch that does so the full context if this patch set can be reviewed from a user perspective. > > >> >> Again, I have no objections to kvd, linear, hash, etc terms as they do >> relate to Mellanox products. But kvd/linear, for example, does correlate >> to industry standard concepts in some way. My request is that the >> resource listing guide the user in some way, stating what these >> resources mean. > > So the showed relation to dpipe table would be enougn or you would still > like to see some description? I don't like the description concept here > as the relations to dpipe table should tell user exactly what he needs > to know. I believe it is useful to have a 1-line, short description that gives the user some memory jogger as to what the resource is used for. It does not have to be an exhaustive list, but the user should not have to do
Re: [patch net-next v2 00/10] Add support for resource abstraction
Wed, Dec 27, 2017 at 02:08:03PM CET, and...@lunn.ch wrote: >> This is misunderstanding I believe. This is not about ABI. That is well >> defined by the netlink attributes. This is about meaning of particular >> ASIC-specific internal resources. > >I would agree that the netlink attributed are clearly defined. But the >meta information, what this ASIC specific internal resource means when >you combine these attributes, is unclear. This meta information is >also part of the ABI, and documenting giving users a hit what it >means, and why they should change it, would be good practice. > >Look at sysfs. open/read/write are clearly defined, which is the >equivalent of the netlink attributes. The meta information we document >in Documentation/ABI/, what a file name means, what a value means, >what other values it can take, etc. Hmm. That documents mainly sysfs. No mention of Netlink at all. But maybe I missed it. Also, that defines the interface as is. However we are talking about the data exchanged over the interface, not the interface itself. I don't see how ASIC/HW specific thing, like for example KVD in our case could be part of kernel ABI. That makes 0 sense to me, sorry.
Re: [patch net-next v2 00/10] Add support for resource abstraction
> This is misunderstanding I believe. This is not about ABI. That is well > defined by the netlink attributes. This is about meaning of particular > ASIC-specific internal resources. I would agree that the netlink attributed are clearly defined. But the meta information, what this ASIC specific internal resource means when you combine these attributes, is unclear. This meta information is also part of the ABI, and documenting giving users a hit what it means, and why they should change it, would be good practice. Look at sysfs. open/read/write are clearly defined, which is the equivalent of the netlink attributes. The meta information we document in Documentation/ABI/, what a file name means, what a value means, what other values it can take, etc. Andrew
Re: [patch net-next v2 00/10] Add support for resource abstraction
Wed, Dec 27, 2017 at 09:23:31AM CET, and...@lunn.ch wrote: >> >$ devlink resource show pci/:03:00.0 >> >pci/:03:00.0: >> > name kvd size 245760 size_valid true >> > resources: >> >name linear size 98304 occ 0 >> >name hash_double size 60416 >> >name hash_single size 87040 >> > >> >So this 2700 has 3 resources that can be managed -- some table or >> >resource or something named 'kvd' with linear, hash_double and >> >hash_single sub-resources. What are these names referring too? The above >> >output gives no description, and 'kvd' is not an industry term. Further, >> >> This are internal resources specific to the ASIC. Would you like some >> description to each or something like that? > >The fact you have decided to expose them means you expect people to >change them. So yes, they need to be documented. Maybe add something >to Documentation/ABI/stable/ > >> So the showed relation to dpipe table would be enougn or you would still >> like to see some description? I don't like the description concept here >> as the relations to dpipe table should tell user exactly what he needs >> to know. > >Documenting the ABI is good practice. This is misunderstanding I believe. This is not about ABI. That is well defined by the netlink attributes. This is about meaning of particular ASIC-specific internal resources. > > Andrew >
RE: [patch net-next v2 00/10] Add support for resource abstraction
> >> Many of the ASIC's internal resources are limited and are shared between > >> several hardware procedures. For example, unified hash-based memory > can > >> be used for many lookup purposes, like FDB and LPM. In many cases the > user > >> can provide a partitioning scheme for such a resource in order to perform > >> fine tuning for his application. In such cases performing driver reload is > >> needed for the changes to take place, thus this patchset also adds support > >> for hot reload. > >> > >> Such an abstraction can be coupled with devlink's dpipe interface, which > >> models the ASIC's pipeline as a graph of match/action tables. By modeling > >> the hardware resource object, and by coupling it to several dpipe tables, > >> further visibility can be achieved in order to debug ASIC-wide issues. > >> > >> The proposed interface will provide the user the ability to understand the > >> limitations of the hardware, and receive notification regarding its > occupancy. > >> Furthermore, monitoring the resource occupancy can be done in real-time > and > >> can be useful in many cases. > > > >In the last RFC (not v1, but RFC) I asked for some kind of description > >for each resource, and you and Arkadi have pushed back. Let's walk > >through an example to see what I mean: > > > >$ devlink resource show pci/:03:00.0 > >pci/:03:00.0: > > name kvd size 245760 size_valid true > > resources: > >name linear size 98304 occ 0 > >name hash_double size 60416 > >name hash_single size 87040 > > > >So this 2700 has 3 resources that can be managed -- some table or > >resource or something named 'kvd' with linear, hash_double and > >hash_single sub-resources. What are these names referring too? The above > >output gives no description, and 'kvd' is not an industry term. Further, > > This are internal resources specific to the ASIC. Would you like some > description to each or something like that? > > > >what are these sizes that a user can control? The output contains no > >units, no description, nothing. In short, the above output provides > >random numbers associated with random names. > > Units are now exposed from kernel, just this version of iproute2 patch > does not display it. > > > > > >I can see dpipe tables exported by this device: > > > >$ devlink dpipe header show pci/:03:00.0 > > > >pci/:03:00.0: > > name mlxsw_meta > > field: > >name erif_port bitwidth 32 mapping_type ifindex > >name l3_forward bitwidth 1 > >name l3_drop bitwidth 1 > >name adj_index bitwidth 32 > >name adj_size bitwidth 32 > >name adj_hash_index bitwidth 32 > > > > name ipv6 > > field: > >name destination ip bitwidth 128 > > > > name ipv4 > > field: > >name destination ip bitwidth 32 > > > > name ethernet > > field: > >name destination mac bitwidth 48 > > > >but none mention 'kvd' or 'linear' or 'hash" and none of the other > >various devlink options: > > > >$ devlink > >Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help } > >where OBJECT := { dev | port | sb | monitor | dpipe } > > > >seem to related to resources. > > > >So how does a user know what they are controlling by this 'resource' > >option? Is the user expected to have a PRM or user guide on hand for the > >specific device model that is being configured? > > The relation of specific dpipe table to specific resource is exposed by > the kernel as well. Probably the iproute2 patch just does not display > it. It does, just under 'table' and not under 'header'. E.g.: # ./devlink/devlink dpipe -j -p table show pci/:03:00.0 ... },{ "resource_path": "/kvd/hash_single", "name": "mlxsw_host4", "size": 0, "counters_enabled": "false", "match": [{ "type": "field_exact", "header": "mlxsw_meta", "field": "erif_port", "mapping": "ifindex" },{ "type": "field_exact", "header": "ipv4", "field": "destination ip" } ], "action": [{ "type": "field_modify", "header": "ethernet", "field": "destination mac" } ] },{ ... > > > > > >Again, I have no objections to kvd, linear, hash, etc terms as they do > >relate to Mellanox products. But kvd/linear, for example, does correlate > >to industry standard concepts in some way. My request is that the > >resource listing guide the user in some way, stating what these > >resources mean. > > So the showed relation to dpipe table would be enougn or you would still > like to see some description? I don't like the description concept here > as the relations to dpipe table should tell user exactly what he needs > to know. > > > > >
Re: [patch net-next v2 00/10] Add support for resource abstraction
> >$ devlink resource show pci/:03:00.0 > >pci/:03:00.0: > > name kvd size 245760 size_valid true > > resources: > >name linear size 98304 occ 0 > >name hash_double size 60416 > >name hash_single size 87040 > > > >So this 2700 has 3 resources that can be managed -- some table or > >resource or something named 'kvd' with linear, hash_double and > >hash_single sub-resources. What are these names referring too? The above > >output gives no description, and 'kvd' is not an industry term. Further, > > This are internal resources specific to the ASIC. Would you like some > description to each or something like that? The fact you have decided to expose them means you expect people to change them. So yes, they need to be documented. Maybe add something to Documentation/ABI/stable/ > So the showed relation to dpipe table would be enougn or you would still > like to see some description? I don't like the description concept here > as the relations to dpipe table should tell user exactly what he needs > to know. Documenting the ABI is good practice. Andrew
Re: [patch net-next v2 00/10] Add support for resource abstraction
Wed, Dec 27, 2017 at 05:05:09AM CET, d...@cumulusnetworks.com wrote: >On 12/26/17 5:23 AM, Jiri Pirko wrote: >> From: Jiri Pirko>> >> Many of the ASIC's internal resources are limited and are shared between >> several hardware procedures. For example, unified hash-based memory can >> be used for many lookup purposes, like FDB and LPM. In many cases the user >> can provide a partitioning scheme for such a resource in order to perform >> fine tuning for his application. In such cases performing driver reload is >> needed for the changes to take place, thus this patchset also adds support >> for hot reload. >> >> Such an abstraction can be coupled with devlink's dpipe interface, which >> models the ASIC's pipeline as a graph of match/action tables. By modeling >> the hardware resource object, and by coupling it to several dpipe tables, >> further visibility can be achieved in order to debug ASIC-wide issues. >> >> The proposed interface will provide the user the ability to understand the >> limitations of the hardware, and receive notification regarding its >> occupancy. >> Furthermore, monitoring the resource occupancy can be done in real-time and >> can be useful in many cases. > >In the last RFC (not v1, but RFC) I asked for some kind of description >for each resource, and you and Arkadi have pushed back. Let's walk >through an example to see what I mean: > >$ devlink resource show pci/:03:00.0 >pci/:03:00.0: > name kvd size 245760 size_valid true > resources: >name linear size 98304 occ 0 >name hash_double size 60416 >name hash_single size 87040 > >So this 2700 has 3 resources that can be managed -- some table or >resource or something named 'kvd' with linear, hash_double and >hash_single sub-resources. What are these names referring too? The above >output gives no description, and 'kvd' is not an industry term. Further, This are internal resources specific to the ASIC. Would you like some description to each or something like that? >what are these sizes that a user can control? The output contains no >units, no description, nothing. In short, the above output provides >random numbers associated with random names. Units are now exposed from kernel, just this version of iproute2 patch does not display it. > >I can see dpipe tables exported by this device: > >$ devlink dpipe header show pci/:03:00.0 > >pci/:03:00.0: > name mlxsw_meta > field: >name erif_port bitwidth 32 mapping_type ifindex >name l3_forward bitwidth 1 >name l3_drop bitwidth 1 >name adj_index bitwidth 32 >name adj_size bitwidth 32 >name adj_hash_index bitwidth 32 > > name ipv6 > field: >name destination ip bitwidth 128 > > name ipv4 > field: >name destination ip bitwidth 32 > > name ethernet > field: >name destination mac bitwidth 48 > >but none mention 'kvd' or 'linear' or 'hash" and none of the other >various devlink options: > >$ devlink >Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help } >where OBJECT := { dev | port | sb | monitor | dpipe } > >seem to related to resources. > >So how does a user know what they are controlling by this 'resource' >option? Is the user expected to have a PRM or user guide on hand for the >specific device model that is being configured? The relation of specific dpipe table to specific resource is exposed by the kernel as well. Probably the iproute2 patch just does not display it. > >Again, I have no objections to kvd, linear, hash, etc terms as they do >relate to Mellanox products. But kvd/linear, for example, does correlate >to industry standard concepts in some way. My request is that the >resource listing guide the user in some way, stating what these >resources mean. So the showed relation to dpipe table would be enougn or you would still like to see some description? I don't like the description concept here as the relations to dpipe table should tell user exactly what he needs to know. > >IMO the above output is not user friendly and having to keep a PRM on >hand for each device model is not a realistic solution.
Re: [patch net-next v2 00/10] Add support for resource abstraction
On 12/26/17 5:23 AM, Jiri Pirko wrote: > From: Jiri Pirko> > Many of the ASIC's internal resources are limited and are shared between > several hardware procedures. For example, unified hash-based memory can > be used for many lookup purposes, like FDB and LPM. In many cases the user > can provide a partitioning scheme for such a resource in order to perform > fine tuning for his application. In such cases performing driver reload is > needed for the changes to take place, thus this patchset also adds support > for hot reload. > > Such an abstraction can be coupled with devlink's dpipe interface, which > models the ASIC's pipeline as a graph of match/action tables. By modeling > the hardware resource object, and by coupling it to several dpipe tables, > further visibility can be achieved in order to debug ASIC-wide issues. > > The proposed interface will provide the user the ability to understand the > limitations of the hardware, and receive notification regarding its occupancy. > Furthermore, monitoring the resource occupancy can be done in real-time and > can be useful in many cases. In the last RFC (not v1, but RFC) I asked for some kind of description for each resource, and you and Arkadi have pushed back. Let's walk through an example to see what I mean: $ devlink resource show pci/:03:00.0 pci/:03:00.0: name kvd size 245760 size_valid true resources: name linear size 98304 occ 0 name hash_double size 60416 name hash_single size 87040 So this 2700 has 3 resources that can be managed -- some table or resource or something named 'kvd' with linear, hash_double and hash_single sub-resources. What are these names referring too? The above output gives no description, and 'kvd' is not an industry term. Further, what are these sizes that a user can control? The output contains no units, no description, nothing. In short, the above output provides random numbers associated with random names. I can see dpipe tables exported by this device: $ devlink dpipe header show pci/:03:00.0 pci/:03:00.0: name mlxsw_meta field: name erif_port bitwidth 32 mapping_type ifindex name l3_forward bitwidth 1 name l3_drop bitwidth 1 name adj_index bitwidth 32 name adj_size bitwidth 32 name adj_hash_index bitwidth 32 name ipv6 field: name destination ip bitwidth 128 name ipv4 field: name destination ip bitwidth 32 name ethernet field: name destination mac bitwidth 48 but none mention 'kvd' or 'linear' or 'hash" and none of the other various devlink options: $ devlink Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help } where OBJECT := { dev | port | sb | monitor | dpipe } seem to related to resources. So how does a user know what they are controlling by this 'resource' option? Is the user expected to have a PRM or user guide on hand for the specific device model that is being configured? Again, I have no objections to kvd, linear, hash, etc terms as they do relate to Mellanox products. But kvd/linear, for example, does correlate to industry standard concepts in some way. My request is that the resource listing guide the user in some way, stating what these resources mean. IMO the above output is not user friendly and having to keep a PRM on hand for each device model is not a realistic solution.
[patch net-next v2 00/10] Add support for resource abstraction
From: Jiri PirkoMany of the ASIC's internal resources are limited and are shared between several hardware procedures. For example, unified hash-based memory can be used for many lookup purposes, like FDB and LPM. In many cases the user can provide a partitioning scheme for such a resource in order to perform fine tuning for his application. In such cases performing driver reload is needed for the changes to take place, thus this patchset also adds support for hot reload. Such an abstraction can be coupled with devlink's dpipe interface, which models the ASIC's pipeline as a graph of match/action tables. By modeling the hardware resource object, and by coupling it to several dpipe tables, further visibility can be achieved in order to debug ASIC-wide issues. The proposed interface will provide the user the ability to understand the limitations of the hardware, and receive notification regarding its occupancy. Furthermore, monitoring the resource occupancy can be done in real-time and can be useful in many cases. --- Userspace part prototype can be found at https://github.com/arkadis/iproute2/ at resource_dev branch. v1->v2 - Add resource size attribute. - Fix split bug. Arkadi Sharshevsky (10): devlink: Add per devlink instance lock devlink: Add support for resource abstraction devlink: Add support for reload devlink: Add relation between dpipe and resource mlxsw: pci: Add support for performing bus reset mlxsw: spectrum: Register KVD resources with devlink mlxsw: spectrum_dpipe: Connect dpipe tables to resources mlxsw: spectrum: Add support for getting kvdl occupancy mlxsw: pci: Add support for getting resource through devlink mlxsw: core: Add support for reload drivers/net/ethernet/mellanox/mlxsw/core.c | 85 ++- drivers/net/ethernet/mellanox/mlxsw/core.h | 16 +- drivers/net/ethernet/mellanox/mlxsw/i2c.c | 5 +- drivers/net/ethernet/mellanox/mlxsw/pci.c | 98 ++-- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 205 drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 13 + .../net/ethernet/mellanox/mlxsw/spectrum_dpipe.c | 72 ++- .../net/ethernet/mellanox/mlxsw/spectrum_kvdl.c| 26 + include/net/devlink.h | 97 include/uapi/linux/devlink.h | 21 + net/core/devlink.c | 573 ++--- 11 files changed, 1079 insertions(+), 132 deletions(-) -- 2.9.5