[ 
https://issues.apache.org/jira/browse/KUDU-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151564#comment-16151564
 ] 

Hao Hao edited comment on KUDU-2131 at 9/2/17 9:05 PM:
-------------------------------------------------------

[~adar] Yes, that is the issue.

I did two tablet copy tests, the first one passed without problem, though the 
second failed.
1) in a fresh tablet server with singe disk, and no container opened. The size 
of the copied tablet is 37.2GB.
2) in a large cluster that has being running with several kudu release bits. 
The tablet server has 14 disks, and 19199 containers only 97 of them are full. 
The size of the copied tablet is 26.81 GB.

I suspect the issue could be related to the number of opened containers. 
Looking at the code, in LBM when we {{GetNextDataDir}}, we always do pop_front 
of the opened container queue, while when {{MakeContainerAvailable}}, we 
push_back the used container to the end of the opened container queue.

I just did an equivalent experiment as test 2) I mentioned above, with the 
change in {{MakeContainerAvailable}} to always add the container back at the 
front of the opened container queue. This time DownloadBlocks finished quickly 
and the table copy session complete successfully.


was (Author: hahao):
[~adar] Yes, that is the issue.

I did two tests:
1) in a fresh tablet server with singe disk, and no container opened. The size 
of the copied tablet is 37.2GB.
2) in a large cluster that has being running with several kudu release bits. 
The tablet server has 14 disks, and 19199 containers only 97 of them are full. 
The size of the copied tablet is 26.81 GB.

I suspect the issue could be related to the number of opened containers. 
Looking at the code, in LBM when we {{GetNextDataDir}}, we always do pop_front 
of the opened container queue, while when {{MakeContainerAvailable}}, we 
push_back the used container to the end of the opened container queue.

I just did an equivalent experiment as test 2) I mentioned above, with the 
change in {{MakeContainerAvailable}} to always add the container back at the 
front of the opened container queue. This time DownloadBlocks finished quickly 
and the table copy session complete successfully.

> Tablet copy session may expire before completion for large tablet
> -----------------------------------------------------------------
>
>                 Key: KUDU-2131
>                 URL: https://issues.apache.org/jira/browse/KUDU-2131
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Hao Hao
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> KUDU-1726 introduced an optimization to do a bulk sync-to-disk once the 
> tablet copy operation is complete. However, when tested in a large cluster, I 
> found disk synchronization in batches can result in the tablet session 
> expires before the synchronization complete. There is a flag 
> '--tablet_copy_idle_timeout_ms' to control the amount of time without 
> activity before a tablet copy session expires, but it is tagged as hidden(not 
> user-facing).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to