Re: FW: ixg(4) performances
On Sun, Aug 31, 2014 at 12:07:38PM -0400, Terry Moore wrote: This is not 2.5G Transfers per second. PCIe talks about transactions rather than transfers; one transaction requires either 12 bytes (for 32-bit systems) or 16 bytes (for 64-bit systems) of overhead at the transaction layer, plus 7 bytes at the link layer. The maximum number of transactions per second paradoxically transfers the fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only about 60,000 such transactions are possible per second (moving about 248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims, for example 95% efficiency is typical for storage controllers.] The gain for large transfer requests is probably minimal. There can be multiple requests outstanding at any one time (the limit is negotiated, I'm guessing that 8 and 16 are typical values). A typical PCIe dma controller will generate multiple concurrent transfer requests, so even if the requests are only 128 bytes you can get a reasonable overall throughput. A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million transactions are possible per second, but those 9 million transactions can only move 36 million bytes. Except that nothing will generate adequately overlapped short transfers. The real performance killer is cpu pio cycles. Every one that the driver does will hit the throughput - the cpu will be spinning for a long, long time (think ISA bus speeds). A side effect of this is that PCI-PCIe bridges (either way) are doomed to be very inefficient. Multiple lanes scale things fairly linearly. But there has to be one byte per lane; a x8 configuration says that physical transfers are padded so that each the 4-byte write (which takes 27 bytes on the bus) will have to take 32 bytes. Instead of getting 72 million transactions per second, you get 62.5 million transactions/second, so it doesn't scale as nicely. I think that individual PCIe transfers requests always use a single lane. Multiple lanes help if you have multiple concurrent transfers. So different chunks of an ethernet frame can be transferred in parrallel over multiple lanes, with the transfer not completing until all the individual parts complete. So the ring status transfer can't be scheduled until all the other data fragment transfers have completed. I also believe that the PCIe transfers are inherently 64bit. There are byte-enables indicating which bytes of the first and last 64bit words are actually required. The real thing to remember about PCIe is that it is a comms protocol, not a bus protocol. It is high throughput, high latency. I've had 'fun' getting even moderate PCIe throughput into an fpga. David -- David Laight: da...@l8s.co.uk
RE: FW: ixg(4) performances
-Original Message- From: Hisashi T Fujinaka [mailto:ht...@twofifty.com] Sent: Sunday, August 31, 2014 12:39 To: Terry Moore Cc: tech-kern@netbsd.org Subject: RE: FW: ixg(4) performances I may be wrong in the transactions/transfers. However, I think you're reading the page incorrectly. The signalling rate is the physical speed of the link. On top of that is the 8/10 encoding (the Ethernet controller we're talking about is only Gen 2), the framing, etc, and the spec discusses the data rate in GT/s. Gb/s means nothing. Hi, I see what the dispute is. The PCIe 3.0 spec *nowhere* uses transfers in the sense of UNIT INTERVALs (it uses UNIT INTERVAL). The word transfer is not used in that way in the spec. Transfer is used, mostly in the sense of a larger message. It's not in the glossary, etc. However, you are absolutely right, many places in the industry (Intel, Wikipedia, etc.) refer to 2.5GT/s to mean 2.5G Unit Intervals/Second; and it's common enough that it's the de-facto standard terminology. I only deal with the spec, and don't pay that much attention to the ancillary material. We still disagree about something: GT/s does mean *something* in terms of the raw throughput of the link. It tells the absolute upper limit of the channel (ignoring protocol overhead). If the upper limit is too low, you can stop worrying about fine points. If you're only trying to push 10% (for example) and you're not getting it, you look for protocol problems. At 50%, you say hmm, we could be hitting the channel capacity. I think we were violently agreeing, because the technical content of what I was writing was (modulo possible typos) identical to 2.5GT/s. 2.5GT/s is (after 8/10 encoding) a max raw data rate of 250e6 bytes/sec, and when you go through things and account for overhead, it's very possible that 8 lanes (max 2e9 bytes/sec) of gen1 won't be fast enough for 10G Eth. Best regards, --Terry
RE: FW: ixg(4) performances
From Emmanuel Dreyfus You are right; # pcictl /dev/pci5 read -d 0 -f 1 0xa8 00092810 # pcictl /dev/pci5 write -d 0 -f 1 0xa8 0x00094810 # pcictl /dev/pci5 read -d 0 -f 1 0xa8 4810 That's reassuring. The dump confirms that we're looking at the right registers, thank you. As I read the spec, 0x4810 in the Device Control Register means: Max_Read_Request_Size: 100b - 4096 bytes Enable_No_Snoop: 1 Max_Payload_Size: 000b -- 128 bytes Enable_Relaxed_Ordering: 1 All other options turned off. I think you should try: Max_Read_Request_Size: 100b - 4096 bytes Enable_No_Snoop: 1 Max_Payload_Size: 100b -- 4096 bytes Enable_Relaxed_Ordering: 1 This would give 0x4890 as the value, not 0x4810. It's odd that the BIOS set the max_payload_size to 000b. It's possible that this indicates that the root complex has some limitations. Or it could be a buggy or excessively conservative BIOS. (It's safer to program Add-In boards conservatively -- fewer support calls due to dead systems. Or something like that.) So you may have to experiment. This would explain that you saw 2.5 GB/sec before, and 2.7 GB/sec after -- you increased the max *read* size, but not the max *write* size. Increasing from 2048 to 4096 would improve read throughput but not enormously. Depends, of course, on your benchmark. --Terry
RE: FW: ixg(4) performances
-Original Message- From: Hisashi T Fujinaka [mailto:ht...@twofifty.com] Sent: Saturday, August 30, 2014 21:29 To: Terry Moore Cc: tech-kern@netbsd.org Subject: Re: FW: ixg(4) performances Doesn't anyone read my posts or, more important, the PCIe spec? 2.5 Giga TRANSFERS per second. I'm not sure I understand what you're saying. From the PCIe space, page 40: Signaling rate - Once initialized, each Link must only operate at one of the supported signaling levels. For the first generation of PCI Express technology, there is only one signaling rate defined, which provides an effective 2.5 Gigabits/second/Lane/direction of raw bandwidth. The second generation provides an effective 5.0 Gigabits/second/Lane/direction of raw bandwidth. The third generation provides an effective 8.0 Gigabits/second/Lane/direction of 10 raw bandwidth. The data rate is expected to increase with technology advances in the future. This is not 2.5G Transfers per second. PCIe talks about transactions rather than transfers; one transaction requires either 12 bytes (for 32-bit systems) or 16 bytes (for 64-bit systems) of overhead at the transaction layer, plus 7 bytes at the link layer. The maximum number of transactions per second paradoxically transfers the fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only about 60,000 such transactions are possible per second (moving about 248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims, for example 95% efficiency is typical for storage controllers.] A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million transactions are possible per second, but those 9 million transactions can only move 36 million bytes. Multiple lanes scale things fairly linearly. But there has to be one byte per lane; a x8 configuration says that physical transfers are padded so that each the 4-byte write (which takes 27 bytes on the bus) will have to take 32 bytes. Instead of getting 72 million transactions per second, you get 62.5 million transactions/second, so it doesn't scale as nicely. Reads are harder to analyze, because they depend on the speed and design of both ends of the link. The reader sends a read request packet, and the read-responder (some time later) sends back the response. As far as I can see, even at gen3 with lots of lanes, PCIe doesn't scale to 2.5 G transfers per second. Best regards, --Terry
RE: FW: ixg(4) performances
I may be wrong in the transactions/transfers. However, I think you're reading the page incorrectly. The signalling rate is the physical speed of the link. On top of that is the 8/10 encoding (the Ethernet controller we're talking about is only Gen 2), the framing, etc, and the spec discusses the data rate in GT/s. Gb/s means nothing. It's like talking about the frequency of the Ethernet link, which we never do. We talk about how much data can be transferred. I'm also not sure if you've looked at an actual trace before, but a PCIe link is incredibly chatty, and every transfer only has a payload of 64/128/256b (especially regarding the actual controller again). So, those two coupled together (GT/s chatty link with small packets) means talking about things in Gb/s is not something used by people who talk about PCIe every day (my day job). The signalling rate is not used when talking about the max data transfer rate. On Sun, 31 Aug 2014, Terry Moore wrote: -Original Message- From: Hisashi T Fujinaka [mailto:ht...@twofifty.com] Sent: Saturday, August 30, 2014 21:29 To: Terry Moore Cc: tech-kern@netbsd.org Subject: Re: FW: ixg(4) performances Doesn't anyone read my posts or, more important, the PCIe spec? 2.5 Giga TRANSFERS per second. I'm not sure I understand what you're saying. From the PCIe space, page 40: Signaling rate - Once initialized, each Link must only operate at one of the supported signaling levels. For the first generation of PCI Express technology, there is only one signaling rate defined, which provides an effective 2.5 Gigabits/second/Lane/direction of raw bandwidth. The second generation provides an effective 5.0 Gigabits/second/Lane/direction of raw bandwidth. The third generation provides an effective 8.0 Gigabits/second/Lane/direction of 10 raw bandwidth. The data rate is expected to increase with technology advances in the future. This is not 2.5G Transfers per second. PCIe talks about transactions rather than transfers; one transaction requires either 12 bytes (for 32-bit systems) or 16 bytes (for 64-bit systems) of overhead at the transaction layer, plus 7 bytes at the link layer. The maximum number of transactions per second paradoxically transfers the fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only about 60,000 such transactions are possible per second (moving about 248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims, for example 95% efficiency is typical for storage controllers.] A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million transactions are possible per second, but those 9 million transactions can only move 36 million bytes. Multiple lanes scale things fairly linearly. But there has to be one byte per lane; a x8 configuration says that physical transfers are padded so that each the 4-byte write (which takes 27 bytes on the bus) will have to take 32 bytes. Instead of getting 72 million transactions per second, you get 62.5 million transactions/second, so it doesn't scale as nicely. Reads are harder to analyze, because they depend on the speed and design of both ends of the link. The reader sends a read request packet, and the read-responder (some time later) sends back the response. As far as I can see, even at gen3 with lots of lanes, PCIe doesn't scale to 2.5 G transfers per second. Best regards, --Terry -- Hisashi T Fujinaka - ht...@twofifty.com BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee
RE: FW: ixg(4) performances
Oh, and to answer the actual first, relevant question, I can try finding out if we (day job, 82599) can do line rate at 2.5GT/s. I think we can get a lot closer than you're getting but we don't test with NetBSD. -- Hisashi T Fujinaka - ht...@twofifty.com BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee
Re: FW: ixg(4) performances
Terry Moore t...@mcci.com wrote: Since you did a dword read, the extra 0x9 is the device status register. This makes me suspicious as the device status register is claiming that you have unsupported request detected) [bit 3] and correctable error detected [bit 0]. Further, this register is RW1C for all these bits -- so when you write 94810, it should have cleared the 9 (so a subsequent read should have returned 4810). Please check. You are right; # pcictl /dev/pci5 read -d 0 -f 1 0xa8 00092810 # pcictl /dev/pci5 write -d 0 -f 1 0xa8 0x00094810 # pcictl /dev/pci5 read -d 0 -f 1 0xa8 4810 Might be good to post a pcictl dump of your device, just to expose all the details. It explicitely says 2.5 Gb/s x 8 lanes # pcictl /dev/pci5 dump -d0 -f 1 PCI configuration registers: Common header: 0x00: 0x10fb8086 0x00100107 0x0201 0x00800010 Vendor Name: Intel (0x8086) Device Name: 82599 (SFI/SFP+) 10 GbE Controller (0x10fb) Command register: 0x0107 I/O space accesses: on Memory space accesses: on Bus mastering: on Special cycles: off MWI transactions: off Palette snooping: off Parity error checking: off Address/data stepping: off System error (SERR): on Fast back-to-back transactions: off Interrupt disable: off Status register: 0x0010 Interrupt status: inactive Capability List support: on 66 MHz capable: off User Definable Features (UDF) support: off Fast back-to-back capable: off Data parity error detected: off DEVSEL timing: fast (0x0) Slave signaled Target Abort: off Master received Target Abort: off Master received Master Abort: off Asserted System Error (SERR): off Parity error detected: off Class Name: network (0x02) Subclass Name: ethernet (0x00) Interface: 0x00 Revision ID: 0x01 BIST: 0x00 Header Type: 0x00+multifunction (0x80) Latency Timer: 0x00 Cache Line Size: 0x10 Type 0 (normal device) header: 0x10: 0xdfe8000c 0x 0xbc01 0x 0x20: 0xdfe7c00c 0x 0x 0x00038086 0x30: 0x 0x0040 0x 0x0209 Base address register at 0x10 type: 64-bit prefetchable memory base: 0xdfe8, not sized Base address register at 0x18 type: i/o base: 0xbc00, not sized Base address register at 0x1c not implemented(?) Base address register at 0x20 type: 64-bit prefetchable memory base: 0xdfe7c000, not sized Cardbus CIS Pointer: 0x Subsystem vendor ID: 0x8086 Subsystem ID: 0x0003 Expansion ROM Base Address: 0x Capability list pointer: 0x40 Reserved @ 0x38: 0x Maximum Latency: 0x00 Minimum Grant: 0x00 Interrupt pin: 0x02 (pin B) Interrupt line: 0x09 Capability register at 0x40 type: 0x01 (Power Management, rev. 1.0) Capability register at 0x50 type: 0x05 (MSI) Capability register at 0x70 type: 0x11 (MSI-X) Capability register at 0xa0 type: 0x10 (PCI Express) PCI Message Signaled Interrupt Message Control register: 0x0180 MSI Enabled: no Multiple Message Capable: no (1 vector) Multiple Message Enabled: off (1 vector) 64 Bit Address Capable: yes Per-Vector Masking Capable: yes Message Address (lower) register: 0x Message Address (upper) register: 0x Message Data register: 0x Vector Mask register: 0x Vector Pending register: 0x PCI Power Management Capabilities Register Capabilities register: 0x4823 Version: 1.2 PME# clock: off Device specific initialization: on 3.3V auxiliary current: self-powered D1 power management state support: off D2 power management state support: off PME# support: 0x09 Control/status register: 0x2000 Power state: D0 PCI Express reserved: off No soft reset: off PME# assertion disabled PME# status: off PCI Express Capabilities Register Capability version: 2 Device type: PCI Express Endpoint device Interrupt Message Number: 0 Link Capabilities Register: 0x00027482 Maximum Link Speed: unknown 2 value Maximum Link Width: x8 lanes Port Number: 0 Link Status Register: 0x1081 Negotiated Link Speed: 2.5Gb/s Negotiated Link Width: x8 lanes Device-dependent header: 0x40: 0x48235001 0x2b002000 0x 0x 0x50: 0x01807005 0x 0x 0x 0x60: 0x 0x 0x 0x 0x70: 0x003fa011 0x0004 0x2004 0x 0x80: 0x 0x 0x 0x 0x90: 0x 0x 0x 0x 0xa0: 0x00020010 0x10008cc2 0x4810 0x00027482 0xb0: 0x1081 0x 0x 0x 0xc0: 0x 0x001f
Re: FW: ixg(4) performances
Hi, Emmanuel. On 2014/09/01 11:10, Emmanuel Dreyfus wrote: Terry Moore t...@mcci.com wrote: Since you did a dword read, the extra 0x9 is the device status register. This makes me suspicious as the device status register is claiming that you have unsupported request detected) [bit 3] and correctable error detected [bit 0]. Further, this register is RW1C for all these bits -- so when you write 94810, it should have cleared the 9 (so a subsequent read should have returned 4810). Please check. You are right; # pcictl /dev/pci5 read -d 0 -f 1 0xa8 00092810 # pcictl /dev/pci5 write -d 0 -f 1 0xa8 0x00094810 # pcictl /dev/pci5 read -d 0 -f 1 0xa8 4810 Might be good to post a pcictl dump of your device, just to expose all the details. It explicitely says 2.5 Gb/s x 8 lanes # pcictl /dev/pci5 dump -d0 -f 1 PCI configuration registers: Common header: 0x00: 0x10fb8086 0x00100107 0x0201 0x00800010 Vendor Name: Intel (0x8086) Device Name: 82599 (SFI/SFP+) 10 GbE Controller (0x10fb) Command register: 0x0107 I/O space accesses: on Memory space accesses: on Bus mastering: on Special cycles: off MWI transactions: off Palette snooping: off Parity error checking: off Address/data stepping: off System error (SERR): on Fast back-to-back transactions: off Interrupt disable: off Status register: 0x0010 Interrupt status: inactive Capability List support: on 66 MHz capable: off User Definable Features (UDF) support: off Fast back-to-back capable: off Data parity error detected: off DEVSEL timing: fast (0x0) Slave signaled Target Abort: off Master received Target Abort: off Master received Master Abort: off Asserted System Error (SERR): off Parity error detected: off Class Name: network (0x02) Subclass Name: ethernet (0x00) Interface: 0x00 Revision ID: 0x01 BIST: 0x00 Header Type: 0x00+multifunction (0x80) Latency Timer: 0x00 Cache Line Size: 0x10 Type 0 (normal device) header: 0x10: 0xdfe8000c 0x 0xbc01 0x 0x20: 0xdfe7c00c 0x 0x 0x00038086 0x30: 0x 0x0040 0x 0x0209 Base address register at 0x10 type: 64-bit prefetchable memory base: 0xdfe8, not sized Base address register at 0x18 type: i/o base: 0xbc00, not sized Base address register at 0x1c not implemented(?) Base address register at 0x20 type: 64-bit prefetchable memory base: 0xdfe7c000, not sized Cardbus CIS Pointer: 0x Subsystem vendor ID: 0x8086 Subsystem ID: 0x0003 Expansion ROM Base Address: 0x Capability list pointer: 0x40 Reserved @ 0x38: 0x Maximum Latency: 0x00 Minimum Grant: 0x00 Interrupt pin: 0x02 (pin B) Interrupt line: 0x09 Capability register at 0x40 type: 0x01 (Power Management, rev. 1.0) Capability register at 0x50 type: 0x05 (MSI) Capability register at 0x70 type: 0x11 (MSI-X) Capability register at 0xa0 type: 0x10 (PCI Express) PCI Message Signaled Interrupt Message Control register: 0x0180 MSI Enabled: no Multiple Message Capable: no (1 vector) Multiple Message Enabled: off (1 vector) 64 Bit Address Capable: yes Per-Vector Masking Capable: yes Message Address (lower) register: 0x Message Address (upper) register: 0x Message Data register: 0x Vector Mask register: 0x Vector Pending register: 0x PCI Power Management Capabilities Register Capabilities register: 0x4823 Version: 1.2 PME# clock: off Device specific initialization: on 3.3V auxiliary current: self-powered D1 power management state support: off D2 power management state support: off PME# support: 0x09 Control/status register: 0x2000 Power state: D0 PCI Express reserved: off No soft reset: off PME# assertion disabled PME# status: off PCI Express Capabilities Register Capability version: 2 Device type: PCI Express Endpoint device Interrupt Message Number: 0 Link Capabilities Register: 0x00027482 Maximum Link Speed: unknown 2 value Maximum Link Width: x8 lanes Port Number: 0 Link Status Register: 0x1081 Negotiated Link Speed: 2.5Gb/s * Which Version of NetBSD are you using? I committed some changes fixing Gb/s to GT/s in pci_sbur.c. It was in April, 2013. I suspect you are using netbsd-6, or you are using -current with old /usr/lib/libpci.so.
Re: FW: ixg(4) performances
Doesn't anyone read my posts or, more important, the PCIe spec? 2.5 Giga TRANSFERS per second. On Sat, 30 Aug 2014, Terry Moore wrote: Forgot to cc the list. -Original Message- From: Terry Moore [mailto:t...@mcci.com] Sent: Friday, August 29, 2014 15:13 To: 'Emmanuel Dreyfus' Subject: RE: ixg(4) performances But it's running at gen1. I strongly suspect that the benchmark case was gen2 (since the ixg is capable of it). gen1 vs gen2 is 2.5 Gb.s vs 5 Gb/s? Yes. Actually, 2.5Gbps is symbol rate -- it's 8/10 encoded, so one lane is really 2Gbps. So 8 lanes is 16Gbps which *should* be enough, but... there's overhead and a variety of sources of wastage. I just saw today a slide that says that 8 lanes gen 1 is just barely enough for 10Gb Eth (http://www.eetimes.com/document.asp?doc_id=1323695page_number=4). It's possible that the benchmark system was using 8 lanes of Gen2. Another possibility: earlier you wrote: No reference to MMRBC in this document, but I understand Max Read Request Size is the same thing. Page 765 tells us about register A8, bits 12-14 that should be set to 100. pcictl /dev/pci5 read -d 0 -f 1 0x18 tells me the value 0x00092810 I tried this command: pcictl /dev/pci5 write -d 0 -f 1 0x18 0x00094810 In the PCIe spec, this is controlled via the Device Control register which is a 16-bit value. You want to set *two* fields, to different values. 0x18 looks like the wrong offset. The PCIe spec says offset 0x08, but that's relative to the base of the capability structure, the offset of which is in the low byte of the dword at 0x34. I am running NetBSD 5; my pcictl doesn't support write as one of its options, but I'd expect that to be relative to the base of the function config space, and *that's not the device capabilities register*. It's a read/write register, which is one of the Base Address Registers. In any case, in the Device Control Register, bits 7:5 are the max payload size *for writes* by the igx to the system. These must be set to 101b for 4K max payload size. Similarly, bits 14:12 are the max read request size. These must also be set to 101b for 4K max read request. Since you did a dword read, the extra 0x9 is the device status register. This makes me suspicious as the device status register is claiming that you have unsupported request detected) [bit 3] and correctable error detected [bit 0]. Further, this register is RW1C for all these bits -- so when you write 94810, it should have cleared the 9 (so a subsequent read should have returned 4810). Please check. Might be good to post a pcictl dump of your device, just to expose all the details. Is the ixg in an expansion slot or integrated onto the main board? In a slot. Check the manual on the main board and find out whether other slots have 8 lanes of Gen2. If so, move the board. Best regards, --Terry -- Hisashi T Fujinaka - ht...@twofifty.com BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee