Re: [PATCH RFC] target/user: Add double ring buffers support.

2017-02-14 Thread Xiubo Li

The struct tcmu_cmd_entry {} size is fixed 44 bytes without iovec[], and
the size of struct iovec[N] is about 16 bytes * N.

The cmd entry size will be [44B, N *16 + 44B], and the data size will be
[0, N * 4096], so the ratio of sizeof(cmd entry): sizeof(entry datas) ==
(N * 16 + 44)Bytes : (N * 4096)Bytes == (N * 16)/(N * 4096) +
44/(N*4096) == 1/256 + 11/(N * 1024).
When N is bigger, the ratio will be smaller. If N >= 1, the ratio will
be [15/1024, 4/1024).

So, that's right, 512M is a little bigger than actually needed, for the
safe ratio i think 16/1024 is enough, if the data area is fixed 1G, so
the cmd area could be 16M+.
Or cmd area(64M) + data area(960M) == 1G ?

Seems like a good ratio. You could look at growing the cmd ring when
deciding to allocate more pages for the data area. But, growing the cmd
ring is a little trickier (see below) so maybe have a different policy
for deciding when & how much to grow.

Changing the cmd ring size: Unlike the data area where you can just
allocate & map more pages, I think the cmd ring you probably want to
complete all I/O, get cmd_head and cmd_tail back to the top with a PAD
op, and *then* reallocate/remap the cmd ring and update
tcmu_mailbox.cmdr_size.

Yes, that sounds nice. Growing the cmd area will be much more
complex, and this could be implemented in the future.

I will try to implement the following simple new scheme firstly:
- Fixed cmd area using the vmalloc(), total size will be 64M or 128M(?)
- Growing data area using kmalloc(DATA_BLOCK_SIZE, GFP_ATOMIC),
  the maximum total size will be 1G.

Thanks,

BRs
Xiubo


Regards -- Andy






Re: [PATCH RFC] target/user: Add double ring buffers support.

2017-02-14 Thread Andy Grover
On 02/13/2017 09:50 PM, Xiubo Li wrote:
> The struct tcmu_cmd_entry {} size is fixed 44 bytes without iovec[], and
> the size of struct iovec[N] is about 16 bytes * N.
> 
> The cmd entry size will be [44B, N *16 + 44B], and the data size will be
> [0, N * 4096], so the ratio of sizeof(cmd entry): sizeof(entry datas) ==
> (N * 16 + 44)Bytes : (N * 4096)Bytes == (N * 16)/(N * 4096) +
> 44/(N*4096) == 1/256 + 11/(N * 1024).
> When N is bigger, the ratio will be smaller. If N >= 1, the ratio will
> be [15/1024, 4/1024).
> 
> So, that's right, 512M is a little bigger than actually needed, for the
> safe ratio i think 16/1024 is enough, if the data area is fixed 1G, so
> the cmd area could be 16M+.
> Or cmd area(64M) + data area(960M) == 1G ?

Seems like a good ratio. You could look at growing the cmd ring when
deciding to allocate more pages for the data area. But, growing the cmd
ring is a little trickier (see below) so maybe have a different policy
for deciding when & how much to grow.

Changing the cmd ring size: Unlike the data area where you can just
allocate & map more pages, I think the cmd ring you probably want to
complete all I/O, get cmd_head and cmd_tail back to the top with a PAD
op, and *then* reallocate/remap the cmd ring and update
tcmu_mailbox.cmdr_size.

Regards -- Andy


Re: [PATCH RFC] target/user: Add double ring buffers support.

2017-02-13 Thread Xiubo Li



Yes I think it's now clear we need more buffer space to avoid
bottlenecks for high iops. The initial design kept it simple with the
1MB vmalloc'd space but anticipated greater would be needed. It should
not be necessary to change userspace or the TCMU ABI to handle growing
the buffer for fast devices:

1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB
to 1GB or larger

For the cmd area, set the size to fixed 512M, and data area's size to
fixed 1G, is that okay ?


512M seems a little big for the cmd area... use of the cmd ring should 
be much smaller than the data area. Each cmd has a minimum size, but 
then to describe an additional page in the iovec[] should be just one 
struct iovec (16 bytes?) for each 4K page.




The struct tcmu_cmd_entry {} size is fixed 44 bytes without iovec[], and 
the size of struct iovec[N] is about 16 bytes * N.


The cmd entry size will be [44B, N *16 + 44B], and the data size will be 
[0, N * 4096], so the ratio of sizeof(cmd entry): sizeof(entry datas) == 
(N * 16 + 44)Bytes : (N * 4096)Bytes == (N * 16)/(N * 4096) + 
44/(N*4096) == 1/256 + 11/(N * 1024).
When N is bigger, the ratio will be smaller. If N >= 1, the ratio will 
be [15/1024, 4/1024).


So, that's right, 512M is a little bigger than actually needed, for the 
safe ratio i think 16/1024 is enough, if the data area is fixed 1G, so 
the cmd area could be 16M+.

Or cmd area(64M) + data area(960M) == 1G ?



3. Upgrade the current fixed-size bitmap-based tracking of data area
to handle the new scheme

The Radix tree will be used to keep the block's index(0 ~
1G/DATA_BLOCK_SIZE) and physical page mapping relations. Each leaf is
one data block(the size is DATA_BLOCK_SIZE).

For non-leaf nodes, use the radix tags[0][SLOTs] to indicate wether
slot[SLOTs]'s branch has free(reused the old one or NULL leafs) block
leafs or not.
This could speed the search of the free blocks in data area.


Yes I think radix tree is a good place to start.


Okay.

Thanks,

BRs
Xiubo



Regards -- Andy







Re: [PATCH RFC] target/user: Add double ring buffers support.

2017-02-13 Thread Xiubo Li



On 2017年02月14日 01:42, Andy Grover wrote:

On 02/12/2017 05:38 PM, Xiubo Li wrote:

On 2017年02月11日 02:02, Andy Grover wrote:

1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB
to 1GB or larger
2. Don't vmalloc() the whole thing, instead vmalloc for the cmd ring
portion, and dynamically alloc pages for the data area as needed and
map them into the data area.

Should the data area mapping is dynamical or add one new API to
userspace to trigger it ?


I would recommend having the kernel make the decision to allocate/map 
more memory, or not (i.e. wait for existing I/O to complete and free 
pages) because this avoids the need for a new API, and also the kernel 
may better decide how to allocate total pages used across multiple 
TCMU devices.



Yes, agreed.



4. Implement an algorithm to keep allocated pages mapped into the data
area for reuse, and maybe a heuristic to keep extreme burstiness from
over-allocating pages


Does a little like slab ?


It's definitely becoming less of a storage issue and more of a 
page-management issue. It might make sense to look at mm code and ask 
mm experts for their advice, yep.



Sure, i will have a deep investigate about this.

Thanks very much.

BRs
Xiubo



This should allow TCMU to allocate more data area as needed, not waste
memory for slower devices, and avoid userspace ABI changes. Could we
prototype this approach and see if it is workable?


I will do more investigate about this.


Thank you for working to improve this area!

Regards -- Andy







Re: [PATCH RFC] target/user: Add double ring buffers support.

2017-02-13 Thread Andy Grover

On 02/13/2017 01:50 AM, Xiubo Li wrote:

There is one new scheme in my mind:



Yes I think it's now clear we need more buffer space to avoid
bottlenecks for high iops. The initial design kept it simple with the
1MB vmalloc'd space but anticipated greater would be needed. It should
not be necessary to change userspace or the TCMU ABI to handle growing
the buffer for fast devices:

1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB
to 1GB or larger

For the cmd area, set the size to fixed 512M, and data area's size to
fixed 1G, is that okay ?


512M seems a little big for the cmd area... use of the cmd ring should 
be much smaller than the data area. Each cmd has a minimum size, but 
then to describe an additional page in the iovec[] should be just one 
struct iovec (16 bytes?) for each 4K page.



3. Upgrade the current fixed-size bitmap-based tracking of data area
to handle the new scheme

The Radix tree will be used to keep the block's index(0 ~
1G/DATA_BLOCK_SIZE) and physical page mapping relations. Each leaf is
one data block(the size is DATA_BLOCK_SIZE).

For non-leaf nodes, use the radix tags[0][SLOTs] to indicate wether
slot[SLOTs]'s branch has free(reused the old one or NULL leafs) block
leafs or not.
This could speed the search of the free blocks in data area.


Yes I think radix tree is a good place to start.

Regards -- Andy



Re: [PATCH RFC] target/user: Add double ring buffers support.

2017-02-13 Thread Andy Grover

On 02/12/2017 05:38 PM, Xiubo Li wrote:

On 2017年02月11日 02:02, Andy Grover wrote:

1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB
to 1GB or larger
2. Don't vmalloc() the whole thing, instead vmalloc for the cmd ring
portion, and dynamically alloc pages for the data area as needed and
map them into the data area.

Should the data area mapping is dynamical or add one new API to
userspace to trigger it ?


I would recommend having the kernel make the decision to allocate/map 
more memory, or not (i.e. wait for existing I/O to complete and free 
pages) because this avoids the need for a new API, and also the kernel 
may better decide how to allocate total pages used across multiple TCMU 
devices.



4. Implement an algorithm to keep allocated pages mapped into the data
area for reuse, and maybe a heuristic to keep extreme burstiness from
over-allocating pages


Does a little like slab ?


It's definitely becoming less of a storage issue and more of a 
page-management issue. It might make sense to look at mm code and ask mm 
experts for their advice, yep.



This should allow TCMU to allocate more data area as needed, not waste
memory for slower devices, and avoid userspace ABI changes. Could we
prototype this approach and see if it is workable?


I will do more investigate about this.


Thank you for working to improve this area!

Regards -- Andy



Re: [PATCH RFC] target/user: Add double ring buffers support.

2017-02-13 Thread Xiubo Li

Hi Andy,

There is one new scheme in my mind:



Yes I think it's now clear we need more buffer space to avoid 
bottlenecks for high iops. The initial design kept it simple with the 
1MB vmalloc'd space but anticipated greater would be needed. It should 
not be necessary to change userspace or the TCMU ABI to handle growing 
the buffer for fast devices:


1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB 
to 1GB or larger
For the cmd area, set the size to fixed 512M, and data area's size to 
fixed 1G, is that okay ?


2. Don't vmalloc() the whole thing, instead vmalloc for the cmd ring 
portion, and dynamically alloc pages for the data area as needed and 
map them into the data area.
TCMU will just vmalloc() the 512M cmd area, and let the data area memory 
allocate and map later when needed.
The userspace runner will mmap() all the (512M + 1G) when initialising, 
and will return 1.5G virtual address space. The cmd area will be mapped 
to actual physical addresses here, while the data area will be mapped in 
page fault hook when using



3. Upgrade the current fixed-size bitmap-based tracking of data area 
to handle the new scheme
The Radix tree will be used to keep the block's index(0 ~ 
1G/DATA_BLOCK_SIZE) and physical page mapping relations. Each leaf is 
one data block(the size is DATA_BLOCK_SIZE).


For non-leaf nodes, use the radix tags[0][SLOTs] to indicate wether 
slot[SLOTs]'s branch has free(reused the old one or NULL leafs) block 
leafs or not.

This could speed the search of the free blocks in data area.


4. Implement an algorithm to keep allocated pages mapped into the data 
area for reuse, and maybe a heuristic to keep extreme burstiness from 
over-allocating pages



For leaf nodes:
if one leaf node is exist and its tags[0][SLOTs] = 0 meaning that some 
older cmds have already touched the block leafs and then "freed" it, if 
so, we will reuse it.
if the leaf node is exist and its tag[0][SLOTs] = 1 meaning that this is 
still used.
if the leaf node is non-exist, this is the first time to touch this 
leaf, will allocate memory and then insert it here setting the 
tag[0][SLOTs] to 1.


This should allow TCMU to allocate more data area as needed, not waste 
memory for slower devices, and avoid userspace ABI changes. Could we 
prototype this approach and see if it is workable?



For slower devices, it will save memory.
Could also avoid changing the userspace ABI.

Thanks,

BRs
Xiubo




Thanks -- Regards -- Andy







Re: [PATCH RFC] target/user: Add double ring buffers support.

2017-02-10 Thread Andy Grover

On 02/09/2017 10:48 PM, lixi...@cmss.chinamobile.com wrote:

For now the tcmu is based on UIO framework and only using the
map0 with fixed ring buffer size. This will work fine mostly,
but when we are using the 10GBASE NIC, the ring buffer is too
small to deal with the incoming iscsi cmds.

We can resolve the above issue by increasingt the ring buffer
size larger, but we don't know the exactly size to all cases.

This patch will add double ring buffers support by adding map1
support through UIO device.


Hi Xiubo,

Yes I think it's now clear we need more buffer space to avoid 
bottlenecks for high iops. The initial design kept it simple with the 
1MB vmalloc'd space but anticipated greater would be needed. It should 
not be necessary to change userspace or the TCMU ABI to handle growing 
the buffer for fast devices:


1. increase the region mmap()ed by userspace, TCMU_RING_SIZE, from 1MB 
to 1GB or larger
2. Don't vmalloc() the whole thing, instead vmalloc for the cmd ring 
portion, and dynamically alloc pages for the data area as needed and map 
them into the data area.
3. Upgrade the current fixed-size bitmap-based tracking of data area to 
handle the new scheme
4. Implement an algorithm to keep allocated pages mapped into the data 
area for reuse, and maybe a heuristic to keep extreme burstiness from 
over-allocating pages


This should allow TCMU to allocate more data area as needed, not waste 
memory for slower devices, and avoid userspace ABI changes. Could we 
prototype this approach and see if it is workable?


Thanks -- Regards -- Andy