Hi Simon, On Tue, Sep 1, 2015 at 11:22 AM, Bin Meng <[email protected]> wrote: > Hi Simon, > > On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass <[email protected]> wrote: >> Hi Bin, >> >> On 31 August 2015 at 21:04, Bin Meng <[email protected]> wrote: >>> Hi Simon, >>> >>> On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <[email protected]> wrote: >>>> Hi Bin, >>>> >>>> On 31 August 2015 at 19:40, Bin Meng <[email protected]> wrote: >>>>> Hi Simon, >>>>> >>>>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <[email protected]> wrote: >>>>>> Hi Bin, >>>>>> >>>>>> On 31 August 2015 at 08:04, Bin Meng <[email protected]> wrote: >>>>>>> Hi Simon, >>>>>>> >>>>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <[email protected]> wrote: >>>>>>>> Hi Bin, >>>>>>>> >>>>>>>> On 31 August 2015 at 07:43, Bin Meng <[email protected]> wrote: >>>>>>>>> Hi Simon, >>>>>>>>> >>>>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> Hi Bin, >>>>>>>>>> >>>>>>>>>> On 31 August 2015 at 03:52, Bin Meng <[email protected]> wrote: >>>>>>>>>>> Boot time performance degradation is observed with the conversion >>>>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with >>>>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC. >>>>>>>>>>> Each MRC register programming requires indirect access via pci bus. >>>>>>>>>>> With dm pci, accessing pci configuration space has some overhead. >>>>>>>>>>> Unfortunately this single access overhead gets accumulated in the >>>>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds) >>>>>>>>>>> than before (12 seconds). >>>>>>>>>>> >>>>>>>>>>> To speed up the boot, create an optimized version of pci config >>>>>>>>>>> read/write routines without bothering to go through driver model. >>>>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really >>>>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before). >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Bin Meng <[email protected]> >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>>> arch/x86/cpu/quark/msg_port.c | 59 >>>>>>>>>>> +++++++++++++++++++++++++++---------------- >>>>>>>>>>> 1 file changed, 37 insertions(+), 22 deletions(-) >>>>>>>>>> >>>>>>>>>> Before I delve into the patch - with driver model we are using the >>>>>>>>>> I/O >>>>>>>>>> method - see pci_x86_read_config(). Is that the source of the >>>>>>>>>> slowdown >>>>>>>>>> or is it just general driver model overhead. >>>>>>>>> >>>>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR >>>>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and >>>>>>>>> pci_read_config_dword() are called. >>>>>>>>> >>>>>>>>> With driver model, the overhead is: >>>>>>>>> >>>>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config() >>>>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally >>>>>>>>> call pci_x86_read_config(). >>>>>>>>> >>>>>>>>> Without driver model, there is still some overhead (so previously the >>>>>>>>> MRC time was about 12 seconds) >>>>>>>>> >>>>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() -> >>>>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0) >>>>>>>>> >>>>>>>>> With my optimized version, pci_write_config_dword() directly calls a >>>>>>>>> hardcoded dword size pci config access, without the need to consider >>>>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data. >>>>>>>> >>>>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI >>>>>>>> access to driver model to avoid the uclass_get_device_by_seq() >>>>>>>> everywhere. >>>>>>> >>>>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver. >>>>>>> MRC is just something that program a bunch of registers with pci >>>>>>> config rw call. >>>>>> >>>>>> My question is really what takes the time? It's not clear whether it >>>>>> is the driver model overhead or something else. The code you add in >>>>>> qrk_pci_write_config_dword() looks very similar to >>>>>> pci_x86_read_config(). >>>>>> >>>>> >>>>> It is the driver model overhead. In order to get to >>>>> pci_x86_read_config(), we need go through a bunch of function calls >>>>> (see above). Yes, my version is very similar to pci_x86_read_config(), >>>>> but my version is more simpler as it only needs to deal with dword >>>>> size thus no need to do offset mask and switch/case. If you look at >>>>> the Quark MRC codes, there are thousands of calls to msg_port_read() >>>>> and msg_port_write(). >>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> If the former then perhaps we should change this. If the latter then >>>>>>>>>> we have work to do... >>>>>>>>>> >>>>>>> >>>>>>> Also for this patch, I just realized that not only it helps to reduce >>>>>>> the boot time, but also it helps to support PCIe root port in future >>>>>>> patches. >>>>>>> >>>>>>> By checking the Quark SoC manual, I found something that needs to be >>>>>>> done to get the two PCIe root ports on Quark SoC to work. In order to >>>>>>> get PCIe root ports show up in the PCI configuration space, we need >>>>>>> program some registers in the message bus (using APIs in >>>>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call >>>>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI >>>>>>> enumeration process first before we actually write something to the >>>>>>> configuration space, however the enumeration process will hang when >>>>>>> scanning to the PCIe root port if we don't properly initialize the >>>>>>> message bus registers. This is a chicken-egg problem. >>>>>> >>>>>> Sure, although I see that as a separate problem. >>>>>> >>>>> >>>>> Yes, it is a separate problem. It came to me when I was reading the >>>>> manual after I submitted the patch. >>>>> >>>>>> We can't have a driver model implementation that is really slow, so >>>>>> I'd like to clear that up first. Of course your patch makes sense for >>>>>> other reasons, but I don't want to play whack-a-mole here. >>>>>> >>>>> >>>>> Agreed. So far the driver model PCI is used on x86 boards and sandbox. >>>>> The performance issue was not obvious on these targets, but it is >>>>> quite noticeable on Intel Quark. These PCI config read/write routines >>>>> will go through lots of function calls before we actually touch the >>>>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the >>>>> root bus (it needs to go through its parent followed by its parent's >>>>> parent). >>>>> >>>>> But anyway I think this optimization for Quark is needed. I doubt we >>>>> can optimize driver model pci to such an extent. >>>> >>>> Can we use this as an opportunity to try a few things? If we use the >>>> dm_pci functions that should cut out some overhead. Can you try an >>>> experiment to see how much difference it makes? >>>> >>>> dm_pci_read_config32() >>> >>> We can't use this API as MRC is not a dm driver. >> >> OK, probably I need to dig in and understand this a little better. Is >> it running pre-relocation with the early PCI stuff? We could make a >> driver with UCLASS_RAM perhaps. > > Yes, it is running pre-relocation with the early PCI stuff. But I > doubt the need to create a UCLASS_RAM for x86 targets as most x86 > targets uses FSP to initialize the RAM. The best candidate to > implement UCLASS_RAM that I can think of now is the Freescale DDR > controller driver on powerpc, and on ARM recently. It supports both > SPD and memory-down. On ARM, most RAM targets' DDR initialization is > memory-down I believe. >
Some updates today when trying to support PCIe root ports in the v2: I moved qrk_pci_write_config_dword() and qrk_pci_read_config_dword() from msg_port.c to quark.c and update all codes in quark.c to call these two routines to avoid the chicken & egg problem. With this change, I noticed that the MRC execution time changed from 3 seconds to 4 seconds. So 1 additional second is needed. I disassembled u-boot and found in v1 since qrk_pci_write_config_dword() and qrk_pci_read_config_dword() are declared static in msg_port.c, they are inlined by the compiler into these APIs in msg_port.c. But with v2 changes, they are no longer inline but normal function call, which causes this additional 1 second. So even inline makes some improvement for MRC, not to mention if we can avoid these multiple call chains in the driver model PCI APIs. Regards, Bin _______________________________________________ U-Boot mailing list [email protected] http://lists.denx.de/mailman/listinfo/u-boot

