mAccessFile, FileChannel, ZipFile, multiple *Buffer classes
>>>> for memory
>>>> >>> mapped files etc., and have the same statistics: start/end of a read
>>>> >>> from/write to disk, no. of bytes involved and such. I can plot
>>>> these number
e HDFS JVMs write 1 TiB of data to disk during
>>> TeraGen
>>> >>> (expected) and read and write 1 TiB from and to disk during TeraSort
>>> >>> (expected).
>>> >>>
>>> >>> Sorry for the enormous introduction, but now there's
teresting part: Flink's JVMs read from and write to disk 1 TiB of
>> data
>> >>> each during TeraSort. I'm suspecting there is some sort of spilling
>> >>> involved, potentially because I have not done the setup properly. But
>> that
>> >>> is
aSort, and there I'm not
>>> missing any data, meaning my statistics agree with XFS for TeraSort on
>>> Hadoop, which is why I suspect there are some cases where Flink goes to disk
>>> without me noticing it.
>>>
>>> Therefore here finally the question: in whic
;> involved, so I can check my bytecode instrumentation)? This would also
>> include any kind of resource distribution via HDFS/YARN I guess (like JAR
>> files and I don't know what). Seeing that I'm missing an amount of data
>> equal to the size of my input set I'd suspect the
not sure. Maybe there is also some
sort of remote I/O involved via sockets or so that I'm missing.
Any hints as to where Flink might incur disk I/O are greatly appreciated!
I'm also happy with doing the digging myself, once pointed to the proper
packages in the Apache Flink source tree (I have done