Hi Daniel, In short you can’t create a HDFS block with unallocated data. You can create a zero length block, which will result in a zero byte file being created on the data node, but you can’t create a sparse file in HDFS. While HDFS has a block size e.g. 128MB if you create a small file then the file on the data node will be of a size directly proportional to the data and not the block length; creating a 32kB HDFS file will in turn create a single 32kB file on the datanodes. The way HDFS is built is not like a traditional file system with fixed size blocks/extents in fixed disk locations.
Kind regards, Jim > On 12 Jan 2024, at 18:35, Daniel Howard <danny...@toldme.com> wrote: > > Thank Jim, > > The scenario I have in mind is something like: > 1) Ask HDFS to create a file that is 32k in length. > 2) Attempt to read the contents of the file. > > Can I even attempt to read the contents of a file that has not yet been > written? If so, what data would get sent? > > For example, I asked a version of this question of ganeti with regard to > creating VMs. You can, by default, read the previous contents of the disk in > your new VM, but they have an option to wipe newly allocated VM disks for > added security.[1] > > [1]: https://groups.google.com/g/ganeti/c/-c_KoLd6mnI > > Thanks, > -danny > > On Fri, Jan 12, 2024 at 8:03 AM Jim Halfpenny <jim.halfpe...@stackable.tech> > wrote: >> Hi Danny, >> This does depend on a number of circumstances, mostly based on file >> permissions. If for example a file is deleted without the -skipTrash option >> then it will be moved to the .Trash directory. From here it could be read, >> but the original file permissions will be preserved. Therefore if a user did >> not have read access before it was deleted then it won’t be able to read it >> from .Trash and if they did have read access then this ought to remain the >> case. >> >> If a file is deleted then the blocks are marked for deletion by the namenode >> and won’t be available through HDFS, but there will be some lag between the >> HDFS delete operation and the block files being removed from the datanodes. >> It’s possible that someone could read the block from the datanode file >> system directly, but not through the HDFS file system. The blocks will exist >> on disk until the datanode itself deletes them. >> >> The way HDFS works you won’t get previous data when you create a new block >> since unallocated spaces doesn’t exist in the same way as it does on a >> regular file system. Each HDFS block maps to a file on the datanodes and >> block files can be an arbitrary size, unlike the fixed block/extent size of >> a regular file system. You don’t “reuse" HDFS blocks, a block in HDFS is >> just a file on the data node. You could potentially recover data from >> unallocated space on the datanode disk the same way you would for any other >> deleted file. >> >> If you want to remove the chance of data recovery on HDFS then encrypting >> the blocks using HDFS transparent encryption is the way to do it. They >> encryption keys reside in the namenode metadata so once they are deleted the >> data in that file is effectively lost. Beware of snapshots though since a >> deleted file in the live HDFS view may exist in a previous snapshot. >> >> Kind regards, >> Jim >> >> >>> On 11 Jan 2024, at 21:50, Daniel Howard <danny...@toldme.com >>> <mailto:danny...@toldme.com>> wrote: >>> >>> Is it possible for a user with HDFS access to read the contents of a file >>> previously deleted by a different user? >>> >>> I know a user can employ KMS to encrypt files with a personal key, making >>> this sort of data leakage effectively impossible. But, without KMS, is it >>> possible to allocate a file with uninitialized data, and then read the data >>> that exists on the underlying disk? >>> >>> Thanks, >>> -danny >>> >>> -- >>> http://dannyman.toldme.com <http://dannyman.toldme.com/> > > > -- > http://dannyman.toldme.com <http://dannyman.toldme.com/>