It is hard to calculate, it very much depends on the job: - Is it running filters that reduce the data volume early? - It it possibly running operations that blow up the size of an intermediate result?
I would in general assume to use as much temp space than the input data size, unless you have a lot of RAM such that the system can process the job completely in memory. Unless you immediately filter the input data aggressively (you have a highly selective filter function after the readTextFile(...) or so). Stephan On Mon, Nov 10, 2014 at 5:59 PM, Malte Schwarzer <[email protected]> wrote: > What's the estimated amount of disk space for such a job? Or how can I > calculate it? > > Malte > > Von: Stephan Ewen <[email protected]> > Antworten an: <[email protected]> > Datum: Montag, 10. November 2014 11:22 > An: <[email protected]> > Betreff: Re: How to make Flink to write less temporary files? > > Hi! > > With 10 nodes and 25 GB on each node, you have 250 GB space to spill > temporary files. You also seem to have roughly the same size in JVM Heap, > out of which Flink can use roughly 2/3. > > When you process 1 TB, 250 GB JVM heap and 250 GB temp file space may not > be enough, it is less than the initial data size. > > I think you need simply need more disk space for a job like that... > > Stephan > > > > > On Mon, Nov 10, 2014 at 10:54 AM, Malte Schwarzer <[email protected]> wrote: > >> My blobStore fileds are small, but each *.channel file is around 170MB. >> Before I start by Flink job I’ve 25GB free space available in my tmp-dir >> and my taskmanager heap size is currently at 24GB. I’m using a cluster with >> 10 nodes. >> >> Is this enough space to process a 1TB file? >> >> Von: Stephan Ewen <[email protected]> >> Antworten an: <[email protected]> >> Datum: Montag, 10. November 2014 10:35 >> An: <[email protected]> >> Betreff: Re: How to make Flink to write less temporary files? >> >> I would assume that the blobStore fields are rather small (they are only >> jar files so far). >> >> I would look for *.channel files, which are spilled intermediate results. >> They can get pretty large for large jobs. >> > >
