Hi Simon, On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass <[email protected]> wrote: > Hi Bin, > > On 31 August 2015 at 21:04, Bin Meng <[email protected]> wrote: >> Hi Simon, >> >> On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <[email protected]> wrote: >>> Hi Bin, >>> >>> On 31 August 2015 at 19:40, Bin Meng <[email protected]> wrote: >>>> Hi Simon, >>>> >>>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <[email protected]> wrote: >>>>> Hi Bin, >>>>> >>>>> On 31 August 2015 at 08:04, Bin Meng <[email protected]> wrote: >>>>>> Hi Simon, >>>>>> >>>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <[email protected]> wrote: >>>>>>> Hi Bin, >>>>>>> >>>>>>> On 31 August 2015 at 07:43, Bin Meng <[email protected]> wrote: >>>>>>>> Hi Simon, >>>>>>>> >>>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <[email protected]> wrote: >>>>>>>>> Hi Bin, >>>>>>>>> >>>>>>>>> On 31 August 2015 at 03:52, Bin Meng <[email protected]> wrote: >>>>>>>>>> Boot time performance degradation is observed with the conversion >>>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with >>>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC. >>>>>>>>>> Each MRC register programming requires indirect access via pci bus. >>>>>>>>>> With dm pci, accessing pci configuration space has some overhead. >>>>>>>>>> Unfortunately this single access overhead gets accumulated in the >>>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds) >>>>>>>>>> than before (12 seconds). >>>>>>>>>> >>>>>>>>>> To speed up the boot, create an optimized version of pci config >>>>>>>>>> read/write routines without bothering to go through driver model. >>>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really >>>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before). >>>>>>>>>> >>>>>>>>>> Signed-off-by: Bin Meng <[email protected]> >>>>>>>>>> --- >>>>>>>>>> >>>>>>>>>> arch/x86/cpu/quark/msg_port.c | 59 >>>>>>>>>> +++++++++++++++++++++++++++---------------- >>>>>>>>>> 1 file changed, 37 insertions(+), 22 deletions(-) >>>>>>>>> >>>>>>>>> Before I delve into the patch - with driver model we are using the I/O >>>>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown >>>>>>>>> or is it just general driver model overhead. >>>>>>>> >>>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR >>>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and >>>>>>>> pci_read_config_dword() are called. >>>>>>>> >>>>>>>> With driver model, the overhead is: >>>>>>>> >>>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config() >>>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally >>>>>>>> call pci_x86_read_config(). >>>>>>>> >>>>>>>> Without driver model, there is still some overhead (so previously the >>>>>>>> MRC time was about 12 seconds) >>>>>>>> >>>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() -> >>>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0) >>>>>>>> >>>>>>>> With my optimized version, pci_write_config_dword() directly calls a >>>>>>>> hardcoded dword size pci config access, without the need to consider >>>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data. >>>>>>> >>>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI >>>>>>> access to driver model to avoid the uclass_get_device_by_seq() >>>>>>> everywhere. >>>>>> >>>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver. >>>>>> MRC is just something that program a bunch of registers with pci >>>>>> config rw call. >>>>> >>>>> My question is really what takes the time? It's not clear whether it >>>>> is the driver model overhead or something else. The code you add in >>>>> qrk_pci_write_config_dword() looks very similar to >>>>> pci_x86_read_config(). >>>>> >>>> >>>> It is the driver model overhead. In order to get to >>>> pci_x86_read_config(), we need go through a bunch of function calls >>>> (see above). Yes, my version is very similar to pci_x86_read_config(), >>>> but my version is more simpler as it only needs to deal with dword >>>> size thus no need to do offset mask and switch/case. If you look at >>>> the Quark MRC codes, there are thousands of calls to msg_port_read() >>>> and msg_port_write(). >>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> If the former then perhaps we should change this. If the latter then >>>>>>>>> we have work to do... >>>>>>>>> >>>>>> >>>>>> Also for this patch, I just realized that not only it helps to reduce >>>>>> the boot time, but also it helps to support PCIe root port in future >>>>>> patches. >>>>>> >>>>>> By checking the Quark SoC manual, I found something that needs to be >>>>>> done to get the two PCIe root ports on Quark SoC to work. In order to >>>>>> get PCIe root ports show up in the PCI configuration space, we need >>>>>> program some registers in the message bus (using APIs in >>>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call >>>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI >>>>>> enumeration process first before we actually write something to the >>>>>> configuration space, however the enumeration process will hang when >>>>>> scanning to the PCIe root port if we don't properly initialize the >>>>>> message bus registers. This is a chicken-egg problem. >>>>> >>>>> Sure, although I see that as a separate problem. >>>>> >>>> >>>> Yes, it is a separate problem. It came to me when I was reading the >>>> manual after I submitted the patch. >>>> >>>>> We can't have a driver model implementation that is really slow, so >>>>> I'd like to clear that up first. Of course your patch makes sense for >>>>> other reasons, but I don't want to play whack-a-mole here. >>>>> >>>> >>>> Agreed. So far the driver model PCI is used on x86 boards and sandbox. >>>> The performance issue was not obvious on these targets, but it is >>>> quite noticeable on Intel Quark. These PCI config read/write routines >>>> will go through lots of function calls before we actually touch the >>>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the >>>> root bus (it needs to go through its parent followed by its parent's >>>> parent). >>>> >>>> But anyway I think this optimization for Quark is needed. I doubt we >>>> can optimize driver model pci to such an extent. >>> >>> Can we use this as an opportunity to try a few things? If we use the >>> dm_pci functions that should cut out some overhead. Can you try an >>> experiment to see how much difference it makes? >>> >>> dm_pci_read_config32() >> >> We can't use this API as MRC is not a dm driver. > > OK, probably I need to dig in and understand this a little better. Is > it running pre-relocation with the early PCI stuff? We could make a > driver with UCLASS_RAM perhaps.
Yes, it is running pre-relocation with the early PCI stuff. But I doubt the need to create a UCLASS_RAM for x86 targets as most x86 targets uses FSP to initialize the RAM. The best candidate to implement UCLASS_RAM that I can think of now is the Freescale DDR controller driver on powerpc, and on ARM recently. It supports both SPD and memory-down. On ARM, most RAM targets' DDR initialization is memory-down I believe. > >> >>> pci_get_bdf() >>> pci_read_config() >>> for loop which probably terminates immediately >>> pci_bus_read_config() >>> read_config(), which is pci_x86_read_config() >>> >>> So it's not great but it doesn't look too bad. >>> >>> Also is this an Intel Gallileo gen 2 development board? I'm thinking >>> of getting one. >>> >> >> Yes, this is an Intel Galileo gen2 development board. Although there >> is an gen1 board in the past and the same u-boot.rom can boot on both >> gen1 and gen2 board, Intel is now shipping only gen2 board. > > OK I've ordered one to try out. > >> >>>> >>>>> Also we should start to move things away from the non-driver-model pci >>>>> functions. > Regards, Bin _______________________________________________ U-Boot mailing list [email protected] http://lists.denx.de/mailman/listinfo/u-boot

