[jira] [Updated] (KUDU-3523) st_blksize is not alway equal to the filesystem block size
[ https://issues.apache.org/jira/browse/KUDU-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xixu Wang updated KUDU-3523: Description: In my ** aarch64 architecture system, the st_blksize is not equal to the real filesystem block size. The st_blksize in my system is 65536 bytes, but the block size of the filesystem is 4096 bytes. When writing some data which size is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. But in kudu, it use st_blksize to decide the filesystem block size, which is not always right. There is a unit test which causing this issue: EncryptionEnabled/LogBlockManagerTest.ContainerPreallocationTest/1 {code:java} /root/kudu/src/kudu/fs/log_block_manager-test.cc:541: Failure Expected equality of these values: FLAGS_log_container_preallocate_bytes Which is: 33554432 size Which is: 33492992 {code} The code is follow: !image-2023-11-08-14-35-08-189.png! FLAGS_log_container_preallocate_bytes=33554432 bytes The file is encrypted, so the encryption header occupies one block in file system. After creating the first block, there should be 2 blocks on the disk. In my system (aarch64 kylin-10), the st_blksize=65536, but the block size of file system is 4096, see part-4 follow. When write the encryption header into the file, the on disk size is 4096, when writing a new block, it's offset is 65536(it uses st_blksize to decide the next block offset, see function: src/kudu/util/env_posix.cc#GetBlockSize()). Therefore, in the first file system block, only 4096 bytes on disk, but Kudu thinks it occupies 65536 bytes, and preallocate (FLAGS_log_container_preallocate_bytes - 1) bytes for this file. Actually, it generates (65536 - 4096) bytes hole in the file system block. Finally, the file size on disk is (FLAGS_log_container_preallocate_bytes - (65536 - 4096)) = 33492992. {color:#de350b}In my opinion, Kudu should use the file system block size(f_bsize) as the Kudu block size, not st_blksize.{color} *1. The test environment* Linux hybrid01 4.19.90-23.30.v2101.ky10.aarch64 #1 SMP Thu Dec 15 09:57:55 CST 2022 aarch64 aarch64 aarch64 GNU/Linux. And a docker container runs on it. *2.Create a file with encryption header* {code:java} const string kFile = JoinPathSegments(test_dir_, "encrypted_file"); unique_ptr rw; RWFileOptions opts; opts.is_sensitive = true; ASSERT_OK(env_->NewRWFile(opts, kFile, )); uint64_t file_size = 0; env_->GetFileSizeOnDisk(kFile, _size); {code} *3.stat the file* The IO Block size is 65536, which means st_blsize is 65536, the file logic size is 64 bytes. !image-2023-11-06-15-42-46-082.png! *4. filesystem block size is 4096 bytes* !image-2023-11-06-15-45-39-233.png! *5.The file on disk size is 4096 bytes* !image-2023-11-06-15-52-41-834.png! was: In my ** aarch64 architecture system, the st_blksize is not equal to the real filesystem block size. The st_blksize in my system is 65536 bytes, but the block size of the filesystem is 4096 bytes. When writing some data which size is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. But in kudu, it use st_blksize to decide the filesystem block size, which is not always right. *1. The test environment* Linux hybrid01 4.19.90-23.30.v2101.ky10.aarch64 #1 SMP Thu Dec 15 09:57:55 CST 2022 aarch64 aarch64 aarch64 GNU/Linux. And a docker container runs on it. *2.Create a file with encryption header* {code:java} const string kFile = JoinPathSegments(test_dir_, "encrypted_file"); unique_ptr rw; RWFileOptions opts; opts.is_sensitive = true; ASSERT_OK(env_->NewRWFile(opts, kFile, )); uint64_t file_size = 0; env_->GetFileSizeOnDisk(kFile, _size); {code} *3.stat the file* The IO Block size is 65536, which means st_blsize is 65536, the file logic size is 64 bytes. !image-2023-11-06-15-42-46-082.png! *4. filesystem block size is 4096 bytes* !image-2023-11-06-15-45-39-233.png! *5.The file on disk size is 4096 bytes* !image-2023-11-06-15-52-41-834.png! > st_blksize is not alway equal to the filesystem block size > -- > > Key: KUDU-3523 > URL: https://issues.apache.org/jira/browse/KUDU-3523 > Project: Kudu > Issue Type: Bug >Reporter: Xixu Wang >Priority: Major > Attachments: image-2023-11-06-15-42-46-082.png, > image-2023-11-06-15-45-11-819.png, image-2023-11-06-15-45-39-233.png, > image-2023-11-06-15-52-41-834.png, image-2023-11-08-14-35-08-189.png > > > In my ** aarch64 architecture system, the st_blksize is not equal to the real > filesystem block size. The st_blksize in my system is 65536 bytes, but the > block size of the filesystem is 4096 bytes. When writing some data which size > is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. > But in kudu, it use
[jira] [Updated] (KUDU-3523) st_blksize is not alway equal to the filesystem block size
[ https://issues.apache.org/jira/browse/KUDU-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xixu Wang updated KUDU-3523: Attachment: image-2023-11-08-14-35-08-189.png > st_blksize is not alway equal to the filesystem block size > -- > > Key: KUDU-3523 > URL: https://issues.apache.org/jira/browse/KUDU-3523 > Project: Kudu > Issue Type: Bug >Reporter: Xixu Wang >Priority: Major > Attachments: image-2023-11-06-15-42-46-082.png, > image-2023-11-06-15-45-11-819.png, image-2023-11-06-15-45-39-233.png, > image-2023-11-06-15-52-41-834.png, image-2023-11-08-14-35-08-189.png > > > In my ** aarch64 architecture system, the st_blksize is not equal to the real > filesystem block size. The st_blksize in my system is 65536 bytes, but the > block size of the filesystem is 4096 bytes. When writing some data which size > is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. > But in kudu, it use st_blksize to decide the filesystem block size, which is > not always right. > > *1. The test environment* > Linux hybrid01 4.19.90-23.30.v2101.ky10.aarch64 #1 SMP Thu Dec 15 09:57:55 > CST 2022 aarch64 aarch64 aarch64 GNU/Linux. And a docker container runs on it. > *2.Create a file with encryption header* > > {code:java} > const string kFile = JoinPathSegments(test_dir_, "encrypted_file"); > unique_ptr rw; > RWFileOptions opts; > opts.is_sensitive = true; > ASSERT_OK(env_->NewRWFile(opts, kFile, )); > uint64_t file_size = 0; > env_->GetFileSizeOnDisk(kFile, _size); {code} > *3.stat the file* > > The IO Block size is 65536, which means st_blsize is 65536, the file logic > size is 64 bytes. > !image-2023-11-06-15-42-46-082.png! > *4. filesystem block size is 4096 bytes* > !image-2023-11-06-15-45-39-233.png! > *5.The file on disk size is 4096 bytes* > !image-2023-11-06-15-52-41-834.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3523) st_blksize is not alway equal to the filesystem block size
[ https://issues.apache.org/jira/browse/KUDU-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783889#comment-17783889 ] Xixu Wang commented on KUDU-3523: - Thanks for your comments. [~aserbin] > st_blksize is not alway equal to the filesystem block size > -- > > Key: KUDU-3523 > URL: https://issues.apache.org/jira/browse/KUDU-3523 > Project: Kudu > Issue Type: Bug >Reporter: Xixu Wang >Priority: Major > Attachments: image-2023-11-06-15-42-46-082.png, > image-2023-11-06-15-45-11-819.png, image-2023-11-06-15-45-39-233.png, > image-2023-11-06-15-52-41-834.png > > > In my ** aarch64 architecture system, the st_blksize is not equal to the real > filesystem block size. The st_blksize in my system is 65536 bytes, but the > block size of the filesystem is 4096 bytes. When writing some data which size > is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. > But in kudu, it use st_blksize to decide the filesystem block size, which is > not always right. > > *1. The test environment* > Linux hybrid01 4.19.90-23.30.v2101.ky10.aarch64 #1 SMP Thu Dec 15 09:57:55 > CST 2022 aarch64 aarch64 aarch64 GNU/Linux. And a docker container runs on it. > *2.Create a file with encryption header* > > {code:java} > const string kFile = JoinPathSegments(test_dir_, "encrypted_file"); > unique_ptr rw; > RWFileOptions opts; > opts.is_sensitive = true; > ASSERT_OK(env_->NewRWFile(opts, kFile, )); > uint64_t file_size = 0; > env_->GetFileSizeOnDisk(kFile, _size); {code} > *3.stat the file* > > The IO Block size is 65536, which means st_blsize is 65536, the file logic > size is 64 bytes. > !image-2023-11-06-15-42-46-082.png! > *4. filesystem block size is 4096 bytes* > !image-2023-11-06-15-45-39-233.png! > *5.The file on disk size is 4096 bytes* > !image-2023-11-06-15-52-41-834.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-3523) st_blksize is not alway equal to the filesystem block size
[ https://issues.apache.org/jira/browse/KUDU-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783824#comment-17783824 ] Alexey Serbin edited comment on KUDU-3523 at 11/8/23 1:11 AM: -- [~wangxixu], Thank you for reporting the issue! Indeed, as per [1|https://www.man7.org/linux/man-pages/man2/stat.2.html], the output of the {{stat}} utility reports on different 'block sizes' when run on the file and on the filesystem level (that's why there is {{\-f}} option). That's right: st_blksize isn't not supposed to be always equal to the filesystem block size, and that's so "by design", AFAIK. The 'st_blksize' stands for the IO block size, or "preferred" block size for efficient file system IO (a.k.a. optimal IO transfer size hint), see [2|https://www.man7.org/linux/man-pages/man2/statx.2.html]. As per filesystem operations and {{LogBlockManager}} in Kudu, one crucial invariant is that 'st_blksize' is a multiple of the filesystem's block size. At least, such invariant is important in the scope of addressing [KUDU-620|https://issues.apache.org/jira/browse/KUDU-620] (that's where the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field has originated from). In the scope of this JIRA, please feel free to add references to particular places where the difference between the IO block and the filesystem block sizes might lead to inconsistencies. I guess one of those is the misleading name of the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field. Probably, there are more places where the difference is important, and that could lead to issues in the actual functionality. # [https://www.man7.org/linux/man-pages/man2/stat.2.html|https://www.man7.org/linux/man-pages/man2/stat.2.html] # [https://www.man7.org/linux/man-pages/man2/statx.2.html|https://www.man7.org/linux/man-pages/man2/statx.2.html] was (Author: aserbin): [~wangxixu], Thank you for reporint the issue! Indeed, as per [1|https://www.man7.org/linux/man-pages/man2/stat.2.html], the output of the {{stat}} utility reports on different 'block sizes' when run on the file and on the filesystem level (that's why there is {{\-f}} option). That's right: st_blksize isn't not supposed to be always equal to the filesystem block size, and that's so "by design", AFAIK. The 'st_blksize' stands for the IO block size, or "preferred" block size for efficient file system IO (a.k.a. optimal IO transfer size hint), see [2|https://www.man7.org/linux/man-pages/man2/statx.2.html]. As per filesystem operations and {{LogBlockManager}} in Kudu, one crucial invariant is that 'st_blksize' is a multiple of the filesystem's block size. At least, such invariant is important in the scope of addressing [KUDU-620|https://issues.apache.org/jira/browse/KUDU-620] (that's where the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field has originated from). In the scope of this JIRA, please feel free to add references to particular places where the difference between the IO block and the filesystem block sizes might lead to inconsistencies. I guess one of those is the misleading name of the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field. Probably, there are more places where the difference is important, and that could lead to issues in the actual functionality. # [https://www.man7.org/linux/man-pages/man2/stat.2.html|https://www.man7.org/linux/man-pages/man2/stat.2.html] # [https://www.man7.org/linux/man-pages/man2/statx.2.html|https://www.man7.org/linux/man-pages/man2/statx.2.html] > st_blksize is not alway equal to the filesystem block size > -- > > Key: KUDU-3523 > URL: https://issues.apache.org/jira/browse/KUDU-3523 > Project: Kudu > Issue Type: Bug >Reporter: Xixu Wang >Priority: Major > Attachments: image-2023-11-06-15-42-46-082.png, > image-2023-11-06-15-45-11-819.png, image-2023-11-06-15-45-39-233.png, > image-2023-11-06-15-52-41-834.png > > > In my ** aarch64 architecture system, the st_blksize is not equal to the real > filesystem block size. The st_blksize in my system is 65536 bytes, but the > block size of the filesystem is 4096 bytes. When writing some data which size > is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. > But in kudu, it use st_blksize to decide the filesystem block size, which is > not always right. > > *1. The test environment* > Linux hybrid01 4.19.90-23.30.v2101.ky10.aarch64 #1 SMP Thu Dec 15 09:57:55 > CST 2022 aarch64 aarch64 aarch64 GNU/Linux. And a docker container runs on it. > *2.Create a file with encryption header* > > {code:java} > const string kFile = JoinPathSegments(test_dir_, "encrypted_file"); > unique_ptr rw; > RWFileOptions opts; > opts.is_sensitive = true; >
[jira] [Commented] (KUDU-3523) st_blksize is not alway equal to the filesystem block size
[ https://issues.apache.org/jira/browse/KUDU-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783824#comment-17783824 ] Alexey Serbin commented on KUDU-3523: - [~wangxixu], Thank you for reporint the issue! Indeed, as per [1|https://www.man7.org/linux/man-pages/man2/stat.2.html], the output of the {{stat}} utility reports on different 'block sizes' when run on the file and on the filesystem level (that's why there is {{\-f}} option). That's right: st_blksize isn't not supposed to be always equal to the filesystem block size, and that's so "by design", AFAIK. The 'st_blksize' stands for the IO block size, or "preferred" block size for efficient file system IO (a.k.a. optimal IO transfer size hint), see [2|https://www.man7.org/linux/man-pages/man2/statx.2.html]. As per filesystem operations and {{LogBlockManager}} in Kudu, one crucial invariant is that 'st_blksize' is a multiple of the filesystem's block size. At least, such invariant is important in the scope of addressing [KUDU-620|https://issues.apache.org/jira/browse/KUDU-620] (that's where the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field has originated from). In the scope of this JIRA, please feel free to add references to particular places where the difference between the IO block and the filesystem block sizes might lead to inconsistencies. I guess one of those is the misleading name of the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field. Probably, there are more places where the difference is important, and that could lead to issues in the actual functionality. # [https://www.man7.org/linux/man-pages/man2/stat.2.html|https://www.man7.org/linux/man-pages/man2/stat.2.html] # [https://www.man7.org/linux/man-pages/man2/statx.2.html|https://www.man7.org/linux/man-pages/man2/statx.2.html] > st_blksize is not alway equal to the filesystem block size > -- > > Key: KUDU-3523 > URL: https://issues.apache.org/jira/browse/KUDU-3523 > Project: Kudu > Issue Type: Bug >Reporter: Xixu Wang >Priority: Major > Attachments: image-2023-11-06-15-42-46-082.png, > image-2023-11-06-15-45-11-819.png, image-2023-11-06-15-45-39-233.png, > image-2023-11-06-15-52-41-834.png > > > In my ** aarch64 architecture system, the st_blksize is not equal to the real > filesystem block size. The st_blksize in my system is 65536 bytes, but the > block size of the filesystem is 4096 bytes. When writing some data which size > is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. > But in kudu, it use st_blksize to decide the filesystem block size, which is > not always right. > > *1. The test environment* > Linux hybrid01 4.19.90-23.30.v2101.ky10.aarch64 #1 SMP Thu Dec 15 09:57:55 > CST 2022 aarch64 aarch64 aarch64 GNU/Linux. And a docker container runs on it. > *2.Create a file with encryption header* > > {code:java} > const string kFile = JoinPathSegments(test_dir_, "encrypted_file"); > unique_ptr rw; > RWFileOptions opts; > opts.is_sensitive = true; > ASSERT_OK(env_->NewRWFile(opts, kFile, )); > uint64_t file_size = 0; > env_->GetFileSizeOnDisk(kFile, _size); {code} > *3.stat the file* > > The IO Block size is 65536, which means st_blsize is 65536, the file logic > size is 64 bytes. > !image-2023-11-06-15-42-46-082.png! > *4. filesystem block size is 4096 bytes* > !image-2023-11-06-15-45-39-233.png! > *5.The file on disk size is 4096 bytes* > !image-2023-11-06-15-52-41-834.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-2195) Enforce durability happened before relationships on multiple disks
[ https://issues.apache.org/jira/browse/KUDU-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783767#comment-17783767 ] ASF subversion and git services commented on KUDU-2195: --- Commit 13a66ea9b088eec1de74249b738cc74333eefc4a in kudu's branch refs/heads/master from Attila Bukor [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=13a66ea9b ] [tools] KUDU-3337 Add unsafe_create_cmeta tool We've seen some cases when a power outage on XFS lead to empty cmeta files, causing some tablets to fail to start (KUDU-2195). There is a flag to force fsync, but it's disabled by default except for XFS. Fortunately, it's possible to reconstruct how a cmeta should look like based on the information found in ksck (peers) and WAL dumps (term and config index). Still, the only way to actually create a cmeta file even if this information is available, was to copy an existing cmeta file and run "kudu pbc edit" on it, which is very error-prone and hard to automate. This commit introduces a new unsafe_create_cmeta tool under local_replica, which creates a new cmeta file based on the term, config index and peers as provided in CLI arguments. I manually tested this tool by using it to recover a tablet with three empty cmeta files. Change-Id: I136cc5b5797420a9ca9156f37c3e281da0c265d7 Reviewed-on: http://gerrit.cloudera.org:8080/18029 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin > Enforce durability happened before relationships on multiple disks > -- > > Key: KUDU-2195 > URL: https://issues.apache.org/jira/browse/KUDU-2195 > Project: Kudu > Issue Type: Bug > Components: consensus, tablet >Affects Versions: 1.9.0 >Reporter: David Alves >Priority: Major > > When using weaker durability semantics (e.g. when log_force_fsync is off) we > should still enforce certain happened before relationships which are not > currently being enforced when using different disks for the wal and data. > The two cases that come to mind where this is relevant are: > 1) cmeta (c) -> wal (w) : We flush cmeta before flushing the wal (for > instance on term change) with the intention that either {}, \{c} or \{c, w} > were made durable. > 2) wal (w) -> tablet meta (t): We flush the wal before tablet metadata to > make sure that that all commit messages that refer to on disk row sets (and > deltas) are on disk before the row sets they point to, i.e. with the > intention that either {}, \{w} or \{w, t} were made durable. > With strong durability semantics these are always made durable in the right > order. With weaker semantics that is not the case though. If using the same > disk for both the wal and data then the invariants are still preserved, as > buffers will be flushed in the right order but if using different disks for > the wal and data (and because cmeta is stored with the data) that is not > always the case. > 1) in ext4 is actually safe, because we perform an fsync (indirect, rename() > implies fsync in ext4) when flushing cmeta. But it is not for xfs. > 2) Is not safe in either filesystem. > --- Possible solutions -- > For 1): Store cmeta with the wal; actually always fsync cmeta. > For 2): Store tablet meta with the wal; always fsync the wal before flushing > tablet meta. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3337) Tool to manually create cmeta files
[ https://issues.apache.org/jira/browse/KUDU-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783766#comment-17783766 ] ASF subversion and git services commented on KUDU-3337: --- Commit 13a66ea9b088eec1de74249b738cc74333eefc4a in kudu's branch refs/heads/master from Attila Bukor [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=13a66ea9b ] [tools] KUDU-3337 Add unsafe_create_cmeta tool We've seen some cases when a power outage on XFS lead to empty cmeta files, causing some tablets to fail to start (KUDU-2195). There is a flag to force fsync, but it's disabled by default except for XFS. Fortunately, it's possible to reconstruct how a cmeta should look like based on the information found in ksck (peers) and WAL dumps (term and config index). Still, the only way to actually create a cmeta file even if this information is available, was to copy an existing cmeta file and run "kudu pbc edit" on it, which is very error-prone and hard to automate. This commit introduces a new unsafe_create_cmeta tool under local_replica, which creates a new cmeta file based on the term, config index and peers as provided in CLI arguments. I manually tested this tool by using it to recover a tablet with three empty cmeta files. Change-Id: I136cc5b5797420a9ca9156f37c3e281da0c265d7 Reviewed-on: http://gerrit.cloudera.org:8080/18029 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin > Tool to manually create cmeta files > --- > > Key: KUDU-3337 > URL: https://issues.apache.org/jira/browse/KUDU-3337 > Project: Kudu > Issue Type: New Feature >Reporter: Attila Bukor >Assignee: Attila Bukor >Priority: Major > > Power outages can lead to empty cmeta files on XFS (KUDU-2195), and sometimes > all replicas are affected. By checking the ksck output and the WAL dumps it's > possible to reconstruct how the cmeta should look like, except for the > voted_for part, but that isn't required to be able to bootstrap a tablet, so > a tool to manually create a cmeta file would be useful to recover such > tablets. -- This message was sent by Atlassian Jira (v8.20.10#820010)