Re: Driver profiles RFC
Fri, Aug 11, 2017 at 11:57:38PM CEST, kubak...@wp.pl wrote: >On Tue, 8 Aug 2017 16:15:41 +0300, Arkadi Sharshevsky wrote: >> Driver <--> Devlink API >> === >> Each driver will register his resources with default values at init in >> a similar way to DPIPE table registration. In case those resources already >> exist the default values are discarded. The user will be able to dump and >> update the resources. In order for the changes to take place the user will >> need to re-initiate the driver by a specific devlink knob. > >What seems missing from the examples is the ability to dump the >different states - the "pending" configuration and the currently >applied one. Agreed, I miss this too. > >> The above described procedure will require extra reload of the driver. >> This can be improved as a future optimization. > >I'm a bit lost, this says driver reload is required... driver will provide "commit" callback. What he does is up to him. In case of mlxsw, we have to reinstantiate the driver, because FW reset is required. > >> UAPI >> >> The user will be able to update the resources on a per resource basis: >> >> $devlink dpipe resource set pci/:03:00.0 Mem_Linear 2M >> >> For some resources the size is fixed, for example the size of the internal >> memory cannot be changed. It is provided merely in order to reflect the >> nested structure of the resource and to imply the user that Mem = Linear + >> Hash, thus a set operation on it will fail. >> >> The user can dump the current resource configuration: >> >> #devlink dpipe resource dump tree pci/:03:00.0 Mem >> >> The user can specify 'tree' in order to show all the nested resources under >> the specified one. In case no 'resource name' is specified the TOP hierarchy >> will be dumped. >> >> After successful resource update the drivers hould be re-instantiated in >> order for the changes to take place: >> >> $devlink reload pci/:03:00.0 > >... but this shows a devlink reload tigger, so no driver reload? Were >you describing two possible solutions? One with persistent kernel >database of configs (persistent across driver reloads) and one with no >persistence and the driver is managing reinit internally when triggered >via devlink? This is misunderstanding. There is no driver reload (modprobe -r && modprobe) > >Another thing that comes to mind is - in case HW/FW reinit takes long >would it make sense to incorporate some form of pre-population of those >defaults somehow? If user knows exactly the config they want upon >boot, it would seem cleaner if the reconfig did not have to happen and >devices started out in the right mode. We were discussing it. There are couple of ways to achieve that, none of that is very nice. So we decided to leave this for later, when/if it is needed.
Re: Driver profiles RFC
On Tue, 8 Aug 2017 16:15:41 +0300, Arkadi Sharshevsky wrote: > Driver <--> Devlink API > === > Each driver will register his resources with default values at init in > a similar way to DPIPE table registration. In case those resources already > exist the default values are discarded. The user will be able to dump and > update the resources. In order for the changes to take place the user will > need to re-initiate the driver by a specific devlink knob. What seems missing from the examples is the ability to dump the different states - the "pending" configuration and the currently applied one. > The above described procedure will require extra reload of the driver. > This can be improved as a future optimization. I'm a bit lost, this says driver reload is required... > UAPI > > The user will be able to update the resources on a per resource basis: > > $devlink dpipe resource set pci/:03:00.0 Mem_Linear 2M > > For some resources the size is fixed, for example the size of the internal > memory cannot be changed. It is provided merely in order to reflect the > nested structure of the resource and to imply the user that Mem = Linear + > Hash, thus a set operation on it will fail. > > The user can dump the current resource configuration: > > #devlink dpipe resource dump tree pci/:03:00.0 Mem > > The user can specify 'tree' in order to show all the nested resources under > the specified one. In case no 'resource name' is specified the TOP hierarchy > will be dumped. > > After successful resource update the drivers hould be re-instantiated in > order for the changes to take place: > > $devlink reload pci/:03:00.0 ... but this shows a devlink reload tigger, so no driver reload? Were you describing two possible solutions? One with persistent kernel database of configs (persistent across driver reloads) and one with no persistence and the driver is managing reinit internally when triggered via devlink? Another thing that comes to mind is - in case HW/FW reinit takes long would it make sense to incorporate some form of pre-population of those defaults somehow? If user knows exactly the config they want upon boot, it would seem cleaner if the reconfig did not have to happen and devices started out in the right mode.
Re: Driver profiles RFC
On Wed, Aug 9, 2017 at 4:43 AM, Arkadi Sharshevskywrote: > > > On 08/08/2017 07:08 PM, Roopa Prabhu wrote: >> On Tue, Aug 8, 2017 at 6:15 AM, Arkadi Sharshevsky >> wrote: >>> [snip] >>> User Configuration >>> -- >>> Such an UAPI is very low level, and thus an average user may not know how to >>> adjust this sizes according to his needs. The vendor can provide several >>> tested configuration files that the user can choose from. Each config file >>> will be measured in terms of: MAC addresses, L3 Neighbors (IPv4, IPv6), >>> LPM entries (IPv4,IPv6) in order to provide approximate results. By this an >>> average user will choose one of the provided ones. Furthermore, a more >>> advanced user could play with the numbers for his personal benefit. >>> >>> Reference >>> = >>> [1] >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnetdevconf.org%2F2.1%2Fpapers%2Fdpipe_netdev_2_1.odt=02%7C01%7Carkadis%40mellanox.com%7Cc64b0d54e3e94d07b64c08d4de77bf8b%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636378053281241266=9u%2BFwGF%2FjkmNogPF7Cm%2FfwJsaPVkr%2BC3%2F8x1IVbszRg%3D=0 >>> >> >> Thanks for sending this out. There is very much a need for this. >> and agree, user-space app config can translate to what values they want and >> kernel api can be a low level api. >> >> But how about we align these resource limits with the kernel resource limits >> ? >> For example we usually map l3 hw neighbor limits to kernel software gc_thresh >> values (which are configurable via sysctl). This is one way to give >> user immediate >> feedback on resource full errors. It would be nice if we can introduce >> limits for routes and >> mac addresses. Defaults could be what they are today but user >> configurable ...like I said, >> neighbor subsystem already allows this. >> > > Hi Roopa, thanks for the feedback. > > Regarding aligning the hardware tables sizes with the kernel software > limits. The hardware resources (internal memory) are much more limited > than the software one. Please consider the following scenario: > > 1. User adds limit to neighbor table (as you suggested), which uses the >hash memory portion. > 2. User adds many routes, the routes uses the hash memory resource as well, >potentially. > 3. The kernel adds some neighbors dynamically, the neighbor offloading >fails due to lack of this shared resource, the user get confused because >its lower then what he configured in 1). > > Thus providing max size on specific table is not well defined due to > limited > shared resource. Thus, the feedback the user gets can be not very accurate. > Furthermore, guessing the resource partitioning based only on the subset of > tables which use it makes me little bit uncomfortable. yep, understood. I am aware of some of these problems as well. > > The proposed API aims at solving this issue by providing abstraction for > this hw behavior, and provide the connection with the hardware table, thus > providing more accurate and well defined description of the system. > > I totally agree that this API should be enhanced in order provide the > occupancy of the this 'resource'. For example, the user first observe the > tables and sees the resource<->table mapping, then see the resource > occupancy: > > #devlink dpipe resource occupancy pci/:03:00.0 Mem > > By this the user can understand the offloading limitation, and maybe figure > out that he should change the partitioning. > yes, sounds good.
Re: Driver profiles RFC
On 08/08/2017 07:08 PM, Roopa Prabhu wrote: > On Tue, Aug 8, 2017 at 6:15 AM, Arkadi Sharshevsky> wrote: >> Drivers may require driver specific information during the init stage. >> For example, memory based shared resource which should be segmented for >> different ASIC processes, such as FDB and LPM lookups. >> >> The current mlxsw implementation assumes some default values, which are >> const and cannot be changed due to lack of UAPI for its configuration >> (module params is not an option). Those values can greatly impact the >> scale of the hardware processes, such as the maximum sizes of the FDB/LPM >> tables. Furthermore, those values should be consistent between driver >> reloads. >> >> The interface called DPIPE [1] was introduced in order to provide >> abstraction of the hardware pipeline. This RFC letter suggests solving >> this problem by enhancing the DPIPE hardware abstraction model. >> >> DPIPE Resource >> == >> >> In order to represent ASIC wide resources space a new object should be >> introduced called "resource". It was originally suggested as future >> extension in [1] in order to give the user visibility about the tables >> limitation due to some shared resource. For example FDB and LPM share >> a common hash based memory. This abstraction can be also used for >> providing static configuration for such resources. >> >> Resource >> >> The resource object defines generic hardware resource like memory, >> counter pool, etc. which can be described by name and size. The resource >> can be nested, for example the internal ASIC's memory can be split into >> two parts, as can be seen in the following diagram: >> >> +---+ >> | Internal Mem | >> | | >> | Size: 3M* | >> +---+ >> / \ >> / \ >> / \ >>/ \ >> / \ >> +--+ +--+ >> |Linear| | Hash | >> | | | | >> | Size: 1M | | Size: 2M | >> +--+ +--+ >> >> *The number are provided as an example and do not reflect real ASIC >> resource sizes >> >> Where the hash portion is used for FDB/LPM table lookups, and the linear >> one is used by the routing adjacency table. Each resource can be described >> by a name, size and list of children. Example for dumping the described >> above structure: >> >> #devlink dpipe resource dump tree pci/:03:00.0 Mem >> { >> "resource": { >>"pci/:03:00.0": [{ >> "name": "Mem", >> "size": 3M, >> "resource": [{ >> "name": "Mem_Linear", >> "size": "1M", >> }, { >> "name": "Mem_Hash", >> "size": "2MK", >> } >> }] >> }] >> } >> } >> >> Each DPIPE table can be connected to one resource. >> >> Driver <--> Devlink API >> === >> Each driver will register his resources with default values at init in >> a similar way to DPIPE table registration. In case those resources already >> exist the default values are discarded. The user will be able to dump and >> update the resources. In order for the changes to take place the user will >> need to re-initiate the driver by a specific devlink knob. >> >> The above described procedure will require extra reload of the driver. >> This can be improved as a future optimization. >> >> UAPI >> >> The user will be able to update the resources on a per resource basis: >> >> $devlink dpipe resource set pci/:03:00.0 Mem_Linear 2M >> >> For some resources the size is fixed, for example the size of the internal >> memory cannot be changed. It is provided merely in order to reflect the >> nested structure of the resource and to imply the user that Mem = Linear + >> Hash, thus a set operation on it will fail. >> >> The user can dump the current resource configuration: >> >> #devlink dpipe resource dump tree pci/:03:00.0 Mem >> >> The user can specify 'tree' in order to show all the nested resources under >> the specified one. In case no 'resource name' is specified the TOP hierarchy >> will be dumped. >> >> After successful resource update the drivers hould be re-instantiated in >> order for the changes to take place: >> >> $devlink reload pci/:03:00.0 >> >> User Configuration >> -- >> Such an UAPI is very low level, and thus an average user may not know how to >> adjust this sizes according to his needs. The vendor can provide several >> tested configuration files that the user can choose from. Each config file >> will be
Re: Driver profiles RFC
On Tue, Aug 8, 2017 at 6:15 AM, Arkadi Sharshevskywrote: > Drivers may require driver specific information during the init stage. > For example, memory based shared resource which should be segmented for > different ASIC processes, such as FDB and LPM lookups. > > The current mlxsw implementation assumes some default values, which are > const and cannot be changed due to lack of UAPI for its configuration > (module params is not an option). Those values can greatly impact the > scale of the hardware processes, such as the maximum sizes of the FDB/LPM > tables. Furthermore, those values should be consistent between driver > reloads. > > The interface called DPIPE [1] was introduced in order to provide > abstraction of the hardware pipeline. This RFC letter suggests solving > this problem by enhancing the DPIPE hardware abstraction model. > > DPIPE Resource > == > > In order to represent ASIC wide resources space a new object should be > introduced called "resource". It was originally suggested as future > extension in [1] in order to give the user visibility about the tables > limitation due to some shared resource. For example FDB and LPM share > a common hash based memory. This abstraction can be also used for > providing static configuration for such resources. > > Resource > > The resource object defines generic hardware resource like memory, > counter pool, etc. which can be described by name and size. The resource > can be nested, for example the internal ASIC's memory can be split into > two parts, as can be seen in the following diagram: > > +---+ > | Internal Mem | > | | > | Size: 3M* | > +---+ > / \ > / \ > / \ >/ \ > / \ > +--+ +--+ > |Linear| | Hash | > | | | | > | Size: 1M | | Size: 2M | > +--+ +--+ > > *The number are provided as an example and do not reflect real ASIC > resource sizes > > Where the hash portion is used for FDB/LPM table lookups, and the linear > one is used by the routing adjacency table. Each resource can be described > by a name, size and list of children. Example for dumping the described > above structure: > > #devlink dpipe resource dump tree pci/:03:00.0 Mem > { > "resource": { >"pci/:03:00.0": [{ > "name": "Mem", > "size": 3M, > "resource": [{ > "name": "Mem_Linear", > "size": "1M", > }, { > "name": "Mem_Hash", > "size": "2MK", > } > }] > }] > } > } > > Each DPIPE table can be connected to one resource. > > Driver <--> Devlink API > === > Each driver will register his resources with default values at init in > a similar way to DPIPE table registration. In case those resources already > exist the default values are discarded. The user will be able to dump and > update the resources. In order for the changes to take place the user will > need to re-initiate the driver by a specific devlink knob. > > The above described procedure will require extra reload of the driver. > This can be improved as a future optimization. > > UAPI > > The user will be able to update the resources on a per resource basis: > > $devlink dpipe resource set pci/:03:00.0 Mem_Linear 2M > > For some resources the size is fixed, for example the size of the internal > memory cannot be changed. It is provided merely in order to reflect the > nested structure of the resource and to imply the user that Mem = Linear + > Hash, thus a set operation on it will fail. > > The user can dump the current resource configuration: > > #devlink dpipe resource dump tree pci/:03:00.0 Mem > > The user can specify 'tree' in order to show all the nested resources under > the specified one. In case no 'resource name' is specified the TOP hierarchy > will be dumped. > > After successful resource update the drivers hould be re-instantiated in > order for the changes to take place: > > $devlink reload pci/:03:00.0 > > User Configuration > -- > Such an UAPI is very low level, and thus an average user may not know how to > adjust this sizes according to his needs. The vendor can provide several > tested configuration files that the user can choose from. Each config file > will be measured in terms of: MAC addresses, L3 Neighbors (IPv4, IPv6), > LPM entries (IPv4,IPv6) in order to provide approximate results. By this an > average user will choose one
Re: Driver profiles RFC
On 08/08/2017 04:54 PM, Andrew Lunn wrote: > On Tue, Aug 08, 2017 at 04:15:41PM +0300, Arkadi Sharshevsky wrote: >> Drivers may require driver specific information during the init stage. >> For example, memory based shared resource which should be segmented for >> different ASIC processes, such as FDB and LPM lookups. > > Hi Arkadi > > Have you looked around other subsystems to see if they have already > solved this problem? > One obvious possible solution which other subsystems use is module prams, which is not acceptable. > How about GPUs? Do they have a similar requirement? Seems they are using module params. Furthermore, I checked the DRM API and such a feature is not supported. > > This seems like a generic problem for 'smart' peripherals. How would > you use dpipe with a GPU for example? > > Andrew > Thanks for the review. Arkadi
Re: Driver profiles RFC
On Tue, Aug 08, 2017 at 04:15:41PM +0300, Arkadi Sharshevsky wrote: > Drivers may require driver specific information during the init stage. > For example, memory based shared resource which should be segmented for > different ASIC processes, such as FDB and LPM lookups. Hi Arkadi Have you looked around other subsystems to see if they have already solved this problem? How about GPUs? Do they have a similar requirement? This seems like a generic problem for 'smart' peripherals. How would you use dpipe with a GPU for example? Andrew
Re: Driver profiles RFC
Tue, Aug 08, 2017 at 03:15:41PM CEST, arka...@mellanox.com wrote: >Drivers may require driver specific information during the init stage. >For example, memory based shared resource which should be segmented for >different ASIC processes, such as FDB and LPM lookups. > >The current mlxsw implementation assumes some default values, which are >const and cannot be changed due to lack of UAPI for its configuration >(module params is not an option). Those values can greatly impact the >scale of the hardware processes, such as the maximum sizes of the FDB/LPM >tables. Furthermore, those values should be consistent between driver >reloads. > >The interface called DPIPE [1] was introduced in order to provide >abstraction of the hardware pipeline. This RFC letter suggests solving >this problem by enhancing the DPIPE hardware abstraction model. > >DPIPE Resource >== > >In order to represent ASIC wide resources space a new object should be >introduced called "resource". It was originally suggested as future >extension in [1] in order to give the user visibility about the tables >limitation due to some shared resource. For example FDB and LPM share >a common hash based memory. This abstraction can be also used for >providing static configuration for such resources. > >Resource > >The resource object defines generic hardware resource like memory, >counter pool, etc. which can be described by name and size. The resource >can be nested, for example the internal ASIC's memory can be split into >two parts, as can be seen in the following diagram: > >+---+ >| Internal Mem | >| | >| Size: 3M* | >+---+ > / \ > / \ >/ \ > / \ > / \ > +--+ +--+ > |Linear| | Hash | > | | | | > | Size: 1M | | Size: 2M | > +--+ +--+ > >*The number are provided as an example and do not reflect real ASIC > resource sizes > >Where the hash portion is used for FDB/LPM table lookups, and the linear >one is used by the routing adjacency table. Each resource can be described >by a name, size and list of children. Example for dumping the described >above structure: > >#devlink dpipe resource dump tree pci/:03:00.0 Mem >{ >"resource": { > "pci/:03:00.0": [{ >"name": "Mem", >"size": 3M, >"resource": [{ > "name": "Mem_Linear", > "size": "1M", > }, { > "name": "Mem_Hash", > "size": "2MK", >} > }] >}] This is dumped from kernel either by list or tree using nesting. I think that list makes more sense and userspace can assemble the tree according to references. > } >} > >Each DPIPE table can be connected to one resource. > >Driver <--> Devlink API >=== >Each driver will register his resources with default values at init in >a similar way to DPIPE table registration. In case those resources already >exist the default values are discarded. The user will be able to dump and >update the resources. In order for the changes to take place the user will >need to re-initiate the driver by a specific devlink knob. > >The above described procedure will require extra reload of the driver. >This can be improved as a future optimization. > >UAPI > >The user will be able to update the resources on a per resource basis: > >$devlink dpipe resource set pci/:03:00.0 Mem_Linear 2M > >For some resources the size is fixed, for example the size of the internal >memory cannot be changed. It is provided merely in order to reflect the >nested structure of the resource and to imply the user that Mem = Linear + >Hash, thus a set operation on it will fail. > >The user can dump the current resource configuration: > >#devlink dpipe resource dump tree pci/:03:00.0 Mem > >The user can specify 'tree' in order to show all the nested resources under >the specified one. In case no 'resource name' is specified the TOP hierarchy >will be dumped. > >After successful resource update the drivers hould be re-instantiated in >order for the changes to take place: > >$devlink reload pci/:03:00.0 > >User Configuration >-- >Such an UAPI is very low level, and thus an average user may not know how to >adjust this sizes according to his needs. The vendor can provide several >tested configuration files that the user can choose from. Each config file >will be measured in terms of: MAC addresses, L3 Neighbors (IPv4, IPv6), >LPM entries (IPv4,IPv6) in order to provide approximate
Driver profiles RFC
Drivers may require driver specific information during the init stage. For example, memory based shared resource which should be segmented for different ASIC processes, such as FDB and LPM lookups. The current mlxsw implementation assumes some default values, which are const and cannot be changed due to lack of UAPI for its configuration (module params is not an option). Those values can greatly impact the scale of the hardware processes, such as the maximum sizes of the FDB/LPM tables. Furthermore, those values should be consistent between driver reloads. The interface called DPIPE [1] was introduced in order to provide abstraction of the hardware pipeline. This RFC letter suggests solving this problem by enhancing the DPIPE hardware abstraction model. DPIPE Resource == In order to represent ASIC wide resources space a new object should be introduced called "resource". It was originally suggested as future extension in [1] in order to give the user visibility about the tables limitation due to some shared resource. For example FDB and LPM share a common hash based memory. This abstraction can be also used for providing static configuration for such resources. Resource The resource object defines generic hardware resource like memory, counter pool, etc. which can be described by name and size. The resource can be nested, for example the internal ASIC's memory can be split into two parts, as can be seen in the following diagram: +---+ | Internal Mem | | | | Size: 3M* | +---+ / \ / \ / \ / \ / \ +--+ +--+ |Linear| | Hash | | | | | | Size: 1M | | Size: 2M | +--+ +--+ *The number are provided as an example and do not reflect real ASIC resource sizes Where the hash portion is used for FDB/LPM table lookups, and the linear one is used by the routing adjacency table. Each resource can be described by a name, size and list of children. Example for dumping the described above structure: #devlink dpipe resource dump tree pci/:03:00.0 Mem { "resource": { "pci/:03:00.0": [{ "name": "Mem", "size": 3M, "resource": [{ "name": "Mem_Linear", "size": "1M", }, { "name": "Mem_Hash", "size": "2MK", } }] }] } } Each DPIPE table can be connected to one resource. Driver <--> Devlink API === Each driver will register his resources with default values at init in a similar way to DPIPE table registration. In case those resources already exist the default values are discarded. The user will be able to dump and update the resources. In order for the changes to take place the user will need to re-initiate the driver by a specific devlink knob. The above described procedure will require extra reload of the driver. This can be improved as a future optimization. UAPI The user will be able to update the resources on a per resource basis: $devlink dpipe resource set pci/:03:00.0 Mem_Linear 2M For some resources the size is fixed, for example the size of the internal memory cannot be changed. It is provided merely in order to reflect the nested structure of the resource and to imply the user that Mem = Linear + Hash, thus a set operation on it will fail. The user can dump the current resource configuration: #devlink dpipe resource dump tree pci/:03:00.0 Mem The user can specify 'tree' in order to show all the nested resources under the specified one. In case no 'resource name' is specified the TOP hierarchy will be dumped. After successful resource update the drivers hould be re-instantiated in order for the changes to take place: $devlink reload pci/:03:00.0 User Configuration -- Such an UAPI is very low level, and thus an average user may not know how to adjust this sizes according to his needs. The vendor can provide several tested configuration files that the user can choose from. Each config file will be measured in terms of: MAC addresses, L3 Neighbors (IPv4, IPv6), LPM entries (IPv4,IPv6) in order to provide approximate results. By this an average user will choose one of the provided ones. Furthermore, a more advanced user could play with the numbers for his personal benefit. Reference = [1] https://netdevconf.org/2.1/papers/dpipe_netdev_2_1.odt