Hi Daniel,
In short you can’t create a HDFS block with unallocated data. You can create a 
zero length block, which will result in a zero byte file being created on the 
data node, but you can’t create a sparse file in HDFS. While HDFS has a block 
size e.g. 128MB if you create a small file then the file on the data node will 
be of a size directly proportional to the data and not the block length; 
creating a 32kB HDFS file will in turn create a single 32kB file on the 
datanodes. The way HDFS is built is not like a traditional file system with 
fixed size blocks/extents in fixed disk locations.

Kind regards,
Jim

> On 12 Jan 2024, at 18:35, Daniel Howard <danny...@toldme.com> wrote:
> 
> Thank Jim,
> 
> The scenario I have in mind is something like:
> 1) Ask HDFS to create a file that is 32k in length.
> 2) Attempt to read the contents of the file.
> 
> Can I even attempt to read the contents of a file that has not yet been 
> written? If so, what data would get sent?
> 
> For example, I asked a version of this question of ganeti with regard to 
> creating VMs. You can, by default, read the previous contents of the disk in 
> your new VM, but they have an option to wipe newly allocated VM disks for 
> added security.[1]
> 
> [1]: https://groups.google.com/g/ganeti/c/-c_KoLd6mnI
> 
> Thanks,
> -danny
> 
> On Fri, Jan 12, 2024 at 8:03 AM Jim Halfpenny <jim.halfpe...@stackable.tech> 
> wrote:
>> Hi Danny,
>> This does depend on a number of circumstances, mostly based on file 
>> permissions. If for example a file is deleted without the -skipTrash option 
>> then it will be moved to the .Trash directory. From here it could be read, 
>> but the original file permissions will be preserved. Therefore if a user did 
>> not have read access before it was deleted then it won’t be able to read it 
>> from .Trash and if they did have read access then this ought to remain the 
>> case.
>> 
>> If a file is deleted then the blocks are marked for deletion by the namenode 
>> and won’t be available through HDFS, but there will be some lag between the 
>> HDFS delete operation and the block files being removed from the datanodes. 
>> It’s possible that someone could read the block from the datanode file 
>> system directly, but not through the HDFS file system. The blocks will exist 
>> on disk until the datanode itself deletes them.
>> 
>> The way HDFS works you won’t get previous data when you create a new block 
>> since unallocated spaces doesn’t exist in the same way as it does on a 
>> regular file system. Each HDFS block maps to a file on the datanodes and 
>> block files can be an arbitrary size, unlike the fixed block/extent size of 
>> a regular file system. You don’t “reuse" HDFS blocks, a block in HDFS is 
>> just a file on the data node. You could potentially recover data from 
>> unallocated space on the datanode disk the same way you would for any other 
>> deleted file.
>> 
>> If you want to remove the chance of data recovery on HDFS then encrypting 
>> the blocks using HDFS transparent encryption is the way to do it. They 
>> encryption keys reside in the namenode metadata so once they are deleted the 
>> data in that file is effectively lost. Beware of snapshots though since a 
>> deleted file in the live HDFS view may exist in a previous snapshot.
>> 
>> Kind regards,
>> Jim
>> 
>> 
>>> On 11 Jan 2024, at 21:50, Daniel Howard <danny...@toldme.com 
>>> <mailto:danny...@toldme.com>> wrote:
>>> 
>>> Is it possible for a user with HDFS access to read the contents of a file 
>>> previously deleted by a different user?
>>> 
>>> I know a user can employ KMS to encrypt files with a personal key, making 
>>> this sort of data leakage effectively impossible. But, without KMS, is it 
>>> possible to allocate a file with uninitialized data, and then read the data 
>>> that exists on the underlying disk?
>>> 
>>> Thanks,
>>> -danny
>>> 
>>> --
>>> http://dannyman.toldme.com <http://dannyman.toldme.com/>
> 
> 
> --
> http://dannyman.toldme.com <http://dannyman.toldme.com/>

Reply via email to