Re: [PATCH RFC] target/user: Add double ring buffers support.
The struct tcmu_cmd_entry {} size is fixed 44 bytes without iovec[], and the size of struct iovec[N] is about 16 bytes * N. The cmd entry size will be [44B, N *16 + 44B], and the data size will be [0, N * 4096], so the ratio of sizeof(cmd entry): sizeof(entry datas) == (N * 16 + 44)Bytes : (N * 4096)Bytes == (N * 16)/(N * 4096) + 44/(N*4096) == 1/256 + 11/(N * 1024). When N is bigger, the ratio will be smaller. If N >= 1, the ratio will be [15/1024, 4/1024). So, that's right, 512M is a little bigger than actually needed, for the safe ratio i think 16/1024 is enough, if the data area is fixed 1G, so the cmd area could be 16M+. Or cmd area(64M) + data area(960M) == 1G ? Seems like a good ratio. You could look at growing the cmd ring when deciding to allocate more pages for the data area. But, growing the cmd ring is a little trickier (see below) so maybe have a different policy for deciding when & how much to grow. Changing the cmd ring size: Unlike the data area where you can just allocate & map more pages, I think the cmd ring you probably want to complete all I/O, get cmd_head and cmd_tail back to the top with a PAD op, and *then* reallocate/remap the cmd ring and update tcmu_mailbox.cmdr_size. Yes, that sounds nice. Growing the cmd area will be much more complex, and this could be implemented in the future. I will try to implement the following simple new scheme firstly: - Fixed cmd area using the vmalloc(), total size will be 64M or 128M(?) - Growing data area using kmalloc(DATA_BLOCK_SIZE, GFP_ATOMIC), the maximum total size will be 1G. Thanks, BRs Xiubo Regards -- Andy
Re: [PATCH RFC] target/user: Add double ring buffers support.
On 02/13/2017 09:50 PM, Xiubo Li wrote: > The struct tcmu_cmd_entry {} size is fixed 44 bytes without iovec[], and > the size of struct iovec[N] is about 16 bytes * N. > > The cmd entry size will be [44B, N *16 + 44B], and the data size will be > [0, N * 4096], so the ratio of sizeof(cmd entry): sizeof(entry datas) == > (N * 16 + 44)Bytes : (N * 4096)Bytes == (N * 16)/(N * 4096) + > 44/(N*4096) == 1/256 + 11/(N * 1024). > When N is bigger, the ratio will be smaller. If N >= 1, the ratio will > be [15/1024, 4/1024). > > So, that's right, 512M is a little bigger than actually needed, for the > safe ratio i think 16/1024 is enough, if the data area is fixed 1G, so > the cmd area could be 16M+. > Or cmd area(64M) + data area(960M) == 1G ? Seems like a good ratio. You could look at growing the cmd ring when deciding to allocate more pages for the data area. But, growing the cmd ring is a little trickier (see below) so maybe have a different policy for deciding when & how much to grow. Changing the cmd ring size: Unlike the data area where you can just allocate & map more pages, I think the cmd ring you probably want to complete all I/O, get cmd_head and cmd_tail back to the top with a PAD op, and *then* reallocate/remap the cmd ring and update tcmu_mailbox.cmdr_size. Regards -- Andy
Re: [PATCH RFC] target/user: Add double ring buffers support.
Yes I think it's now clear we need more buffer space to avoid bottlenecks for high iops. The initial design kept it simple with the 1MB vmalloc'd space but anticipated greater would be needed. It should not be necessary to change userspace or the TCMU ABI to handle growing the buffer for fast devices: 1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB to 1GB or larger For the cmd area, set the size to fixed 512M, and data area's size to fixed 1G, is that okay ? 512M seems a little big for the cmd area... use of the cmd ring should be much smaller than the data area. Each cmd has a minimum size, but then to describe an additional page in the iovec[] should be just one struct iovec (16 bytes?) for each 4K page. The struct tcmu_cmd_entry {} size is fixed 44 bytes without iovec[], and the size of struct iovec[N] is about 16 bytes * N. The cmd entry size will be [44B, N *16 + 44B], and the data size will be [0, N * 4096], so the ratio of sizeof(cmd entry): sizeof(entry datas) == (N * 16 + 44)Bytes : (N * 4096)Bytes == (N * 16)/(N * 4096) + 44/(N*4096) == 1/256 + 11/(N * 1024). When N is bigger, the ratio will be smaller. If N >= 1, the ratio will be [15/1024, 4/1024). So, that's right, 512M is a little bigger than actually needed, for the safe ratio i think 16/1024 is enough, if the data area is fixed 1G, so the cmd area could be 16M+. Or cmd area(64M) + data area(960M) == 1G ? 3. Upgrade the current fixed-size bitmap-based tracking of data area to handle the new scheme The Radix tree will be used to keep the block's index(0 ~ 1G/DATA_BLOCK_SIZE) and physical page mapping relations. Each leaf is one data block(the size is DATA_BLOCK_SIZE). For non-leaf nodes, use the radix tags[0][SLOTs] to indicate wether slot[SLOTs]'s branch has free(reused the old one or NULL leafs) block leafs or not. This could speed the search of the free blocks in data area. Yes I think radix tree is a good place to start. Okay. Thanks, BRs Xiubo Regards -- Andy
Re: [PATCH RFC] target/user: Add double ring buffers support.
On 2017年02月14日 01:42, Andy Grover wrote: On 02/12/2017 05:38 PM, Xiubo Li wrote: On 2017年02月11日 02:02, Andy Grover wrote: 1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB to 1GB or larger 2. Don't vmalloc() the whole thing, instead vmalloc for the cmd ring portion, and dynamically alloc pages for the data area as needed and map them into the data area. Should the data area mapping is dynamical or add one new API to userspace to trigger it ? I would recommend having the kernel make the decision to allocate/map more memory, or not (i.e. wait for existing I/O to complete and free pages) because this avoids the need for a new API, and also the kernel may better decide how to allocate total pages used across multiple TCMU devices. Yes, agreed. 4. Implement an algorithm to keep allocated pages mapped into the data area for reuse, and maybe a heuristic to keep extreme burstiness from over-allocating pages Does a little like slab ? It's definitely becoming less of a storage issue and more of a page-management issue. It might make sense to look at mm code and ask mm experts for their advice, yep. Sure, i will have a deep investigate about this. Thanks very much. BRs Xiubo This should allow TCMU to allocate more data area as needed, not waste memory for slower devices, and avoid userspace ABI changes. Could we prototype this approach and see if it is workable? I will do more investigate about this. Thank you for working to improve this area! Regards -- Andy
Re: [PATCH RFC] target/user: Add double ring buffers support.
On 02/13/2017 01:50 AM, Xiubo Li wrote: There is one new scheme in my mind: Yes I think it's now clear we need more buffer space to avoid bottlenecks for high iops. The initial design kept it simple with the 1MB vmalloc'd space but anticipated greater would be needed. It should not be necessary to change userspace or the TCMU ABI to handle growing the buffer for fast devices: 1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB to 1GB or larger For the cmd area, set the size to fixed 512M, and data area's size to fixed 1G, is that okay ? 512M seems a little big for the cmd area... use of the cmd ring should be much smaller than the data area. Each cmd has a minimum size, but then to describe an additional page in the iovec[] should be just one struct iovec (16 bytes?) for each 4K page. 3. Upgrade the current fixed-size bitmap-based tracking of data area to handle the new scheme The Radix tree will be used to keep the block's index(0 ~ 1G/DATA_BLOCK_SIZE) and physical page mapping relations. Each leaf is one data block(the size is DATA_BLOCK_SIZE). For non-leaf nodes, use the radix tags[0][SLOTs] to indicate wether slot[SLOTs]'s branch has free(reused the old one or NULL leafs) block leafs or not. This could speed the search of the free blocks in data area. Yes I think radix tree is a good place to start. Regards -- Andy
Re: [PATCH RFC] target/user: Add double ring buffers support.
On 02/12/2017 05:38 PM, Xiubo Li wrote: On 2017年02月11日 02:02, Andy Grover wrote: 1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB to 1GB or larger 2. Don't vmalloc() the whole thing, instead vmalloc for the cmd ring portion, and dynamically alloc pages for the data area as needed and map them into the data area. Should the data area mapping is dynamical or add one new API to userspace to trigger it ? I would recommend having the kernel make the decision to allocate/map more memory, or not (i.e. wait for existing I/O to complete and free pages) because this avoids the need for a new API, and also the kernel may better decide how to allocate total pages used across multiple TCMU devices. 4. Implement an algorithm to keep allocated pages mapped into the data area for reuse, and maybe a heuristic to keep extreme burstiness from over-allocating pages Does a little like slab ? It's definitely becoming less of a storage issue and more of a page-management issue. It might make sense to look at mm code and ask mm experts for their advice, yep. This should allow TCMU to allocate more data area as needed, not waste memory for slower devices, and avoid userspace ABI changes. Could we prototype this approach and see if it is workable? I will do more investigate about this. Thank you for working to improve this area! Regards -- Andy
Re: [PATCH RFC] target/user: Add double ring buffers support.
Hi Andy, There is one new scheme in my mind: Yes I think it's now clear we need more buffer space to avoid bottlenecks for high iops. The initial design kept it simple with the 1MB vmalloc'd space but anticipated greater would be needed. It should not be necessary to change userspace or the TCMU ABI to handle growing the buffer for fast devices: 1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB to 1GB or larger For the cmd area, set the size to fixed 512M, and data area's size to fixed 1G, is that okay ? 2. Don't vmalloc() the whole thing, instead vmalloc for the cmd ring portion, and dynamically alloc pages for the data area as needed and map them into the data area. TCMU will just vmalloc() the 512M cmd area, and let the data area memory allocate and map later when needed. The userspace runner will mmap() all the (512M + 1G) when initialising, and will return 1.5G virtual address space. The cmd area will be mapped to actual physical addresses here, while the data area will be mapped in page fault hook when using 3. Upgrade the current fixed-size bitmap-based tracking of data area to handle the new scheme The Radix tree will be used to keep the block's index(0 ~ 1G/DATA_BLOCK_SIZE) and physical page mapping relations. Each leaf is one data block(the size is DATA_BLOCK_SIZE). For non-leaf nodes, use the radix tags[0][SLOTs] to indicate wether slot[SLOTs]'s branch has free(reused the old one or NULL leafs) block leafs or not. This could speed the search of the free blocks in data area. 4. Implement an algorithm to keep allocated pages mapped into the data area for reuse, and maybe a heuristic to keep extreme burstiness from over-allocating pages For leaf nodes: if one leaf node is exist and its tags[0][SLOTs] = 0 meaning that some older cmds have already touched the block leafs and then "freed" it, if so, we will reuse it. if the leaf node is exist and its tag[0][SLOTs] = 1 meaning that this is still used. if the leaf node is non-exist, this is the first time to touch this leaf, will allocate memory and then insert it here setting the tag[0][SLOTs] to 1. This should allow TCMU to allocate more data area as needed, not waste memory for slower devices, and avoid userspace ABI changes. Could we prototype this approach and see if it is workable? For slower devices, it will save memory. Could also avoid changing the userspace ABI. Thanks, BRs Xiubo Thanks -- Regards -- Andy
Re: [PATCH RFC] target/user: Add double ring buffers support.
On 02/09/2017 10:48 PM, lixi...@cmss.chinamobile.com wrote: For now the tcmu is based on UIO framework and only using the map0 with fixed ring buffer size. This will work fine mostly, but when we are using the 10GBASE NIC, the ring buffer is too small to deal with the incoming iscsi cmds. We can resolve the above issue by increasingt the ring buffer size larger, but we don't know the exactly size to all cases. This patch will add double ring buffers support by adding map1 support through UIO device. Hi Xiubo, Yes I think it's now clear we need more buffer space to avoid bottlenecks for high iops. The initial design kept it simple with the 1MB vmalloc'd space but anticipated greater would be needed. It should not be necessary to change userspace or the TCMU ABI to handle growing the buffer for fast devices: 1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB to 1GB or larger 2. Don't vmalloc() the whole thing, instead vmalloc for the cmd ring portion, and dynamically alloc pages for the data area as needed and map them into the data area. 3. Upgrade the current fixed-size bitmap-based tracking of data area to handle the new scheme 4. Implement an algorithm to keep allocated pages mapped into the data area for reuse, and maybe a heuristic to keep extreme burstiness from over-allocating pages This should allow TCMU to allocate more data area as needed, not waste memory for slower devices, and avoid userspace ABI changes. Could we prototype this approach and see if it is workable? Thanks -- Regards -- Andy