Hi Miguel, have you found a solution to your problem? I'm not a docker expert but this forum thread looks like could be related to your problem [1].
Best, Fabian [1] https://forums.docker.com/t/no-space-left-on-device-error/10894 2016-12-02 17:43 GMT+01:00 Miguel Coimbra <miguel.e.coim...@gmail.com>: > Hello Fabian, > > I have created a directory on my host machine user directory ( > /home/myuser/mydir ) and I am mapping it as a volume with Docker for the > TaskManager and JobManager containers. > Each container will thus have the following directory /home/flink/htmp > > host ---> container > /home/myuser/mydir ---> /home/flink/htmp > > I had previously done this successfully with the a host directory which > holds several SNAP data sets. > In the Flink configuration file, I specified /home/flink/htmp to be used > as the tmp dir for the TaskManager. > This seems to be working, as I was able to start the cluster and invoke > Flink for that Friendster dataset. > > However, during execution, there were 2 intermediate files which kept > growing until they reached about 30 GB. > At that point, the Flink TaskManager threw the exception again: > > java.lang.RuntimeException: Error obtaining the sorted input: Thread > 'SortMerger spilling thread' terminated due to an exception: No space left > on device > > Here is an ls excerpt of the directory on the host (to which the > TaskManager container was also writing successfully) shortly before the > exception: > > *31G *9d177a1971322263f1597c3378885ccf.channel > *31G* a693811249bc5f785a79d1b1b537fe93.channel > > Now I believe the host system is capable of storing hundred GBs more, so I > am confused as to what the problem might be. > > Best regards, > > Miguel E. Coimbra > Email: miguel.e.coim...@gmail.com <miguel.e.coim...@ist.utl.pt> > Skype: miguel.e.coimbra > > >> >> Hi Miguel, >> >> the exception does indeed indicate that the process ran out of available >> disk space. >> The quoted paragraph of the blog post describes the situation when you >> receive the IOE. >> >> By default the systems default tmp dir is used. I don't know which folder >> that would be in a Docker setup. >> You can configure the temp dir using the taskmanager.tmp.dirs config key. >> Please see the configuration documentation for details [1]. >> >> Hope this helps, >> Fabian >> >> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/ >> setup/config.html#jobmanager-amp-taskmanager >> >> 2016-12-02 0:18 GMT+01:00 Miguel Coimbra <miguel.e.coim...@gmail.com>: >> >> >>> Hello, >>> >>> I have a problem for which I hope someone will be able to give a hint. >>> I am running the Flink *standalone* cluster with 2 Docker containers (1 >>> TaskManager and 1 JobManager) using 1 TaskManager with 30 GB of RAM. >>> >>> The dataset is a large one: SNAP Friendster, which has around 1800 M >>> edges. >>> https://snap.stanford.edu/data/com-Friendster.html >>> >>> I am trying to run the Gelly built-in label propagation algorithm on top >>> of it. >>> As this is a very big dataset, I believe I am exceeding the available >>> RAM and that the system is using secondary storage, which then fails: >>> >>> >>> Connected to JobManager at Actor[akka.tcp://flink@172.19. >>> 0.2:6123/user/jobmanager#894624508] >>> 12/01/2016 17:58:24 Job execution switched to status RUNNING. >>> 12/01/2016 17:58:24 DataSource (at main(App.java:33) ( >>> org.apache.flink.api.java.io.TupleCsvInputFormat))(1/1) switched to >>> SCHEDULED >>> 12/01/2016 17:58:24 DataSource (at main(App.java:33) ( >>> org.apache.flink.api.java.io.TupleCsvInputFormat))(1/1) switched to >>> DEPLOYING >>> 12/01/2016 17:58:24 DataSource (at main(App.java:33) ( >>> org.apache.flink.api.java.io.TupleCsvInputFormat))(1/1) switched to >>> RUNNING >>> 12/01/2016 17:58:24 Map (Map at fromTuple2DataSet(Graph.java:343))(1/1) >>> switched to SCHEDULED >>> 12/01/2016 17:58:24 Map (Map at fromTuple2DataSet(Graph.java:343))(1/1) >>> switched to DEPLOYING >>> 12/01/2016 17:58:24 Map (Map at fromTuple2DataSet(Graph.java:343))(1/1) >>> switched to RUNNING >>> 12/01/2016 17:59:51 Map (Map at fromTuple2DataSet(Graph.java:343))(1/1) >>> switched to FAILED >>> *java.lang.RuntimeException: Error obtaining the sorted input: Thread >>> 'SortMerger spilling thread' terminated due to an exception: No space left >>> on device* >>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger >>> .getIterator(UnilateralSortMerger.java:619) >>> at org.apache.flink.runtime.operators.BatchTask.getInput(BatchT >>> ask.java:1098) >>> at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.j >>> ava:86) >>> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.j >>> ava:486) >>> at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTas >>> k.java:351) >>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585) >>> at java.lang.Thread.run(Thread.java:745) >>> *Caused by: java.io.IOException: Thread 'SortMerger spilling thread' >>> terminated due to an exception: No space left on device* >>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger >>> $ThreadBase.run(UnilateralSortMerger.java:800) >>> Caused by: java.io.IOException: No space left on device >>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >>> at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) >>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) >>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) >>> at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211) >>> at org.apache.flink.runtime.io.disk.iomanager.SegmentWriteReque >>> st.write(AsynchronousFileIOChannel.java:344) >>> at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync$Wr >>> iterThread.run(IOManagerAsync.java:502) >>> >>> >>> I do not have secondary storage limitations on the host system, so I >>> believe the system would be able to handle whatever is spilled to the >>> disk... >>> Perhaps this is a Docker limitation regarding the usage of the host's >>> secondary storage? >>> >>> Or is there perhaps some configuration or setting for the TaskManager >>> which I am missing? >>> Running the label propagation of Gelly on this dataset and cluster >>> configuration, what would be the expected behavior if the system consumes >>> all the memory? >>> >>> >>> I believe the SortMerger thread is associated to the following mechanism >>> described in this blog post: >>> >>> https://flink.apache.org/news/2015/03/13/peeking-into-Apache >>> -Flinks-Engine-Room.html >>> *The Sort-Merge-Join works by first sorting both input data sets on >>> their join key attributes (Sort Phase) and merging the sorted data sets as >>> a second step (Merge Phase). The sort is done in-memory if the local >>> partition of a data set is small enough. Otherwise, an external merge-sort >>> is done by collecting data until the working memory is filled, sorting it, >>> writing the sorted data to the local filesystem, and starting over by >>> filling the working memory again with more incoming data. After all input >>> data has been received, sorted, and written as sorted runs to the local >>> file system, a fully sorted stream can be obtained. This is done by reading >>> the partially sorted runs from the local filesystem and sort-merging the >>> records on the fly. Once the sorted streams of both inputs are available, >>> both streams are sequentially read and merge-joined in a zig-zag fashion by >>> comparing the sorted join key attributes, building join element pairs for >>> matching keys, and advancing the sorted stream with the lower join key.* >>> >>> I am still investigating the possibility that Docker is at fault >>> regarding secondary storage limitations, but how would I go about >>> estimating the amount of disk space needed for this spilling on this >>> dataset? >>> >>> Thanks for your time, >>> >>> My best regards, >>> >>> Miguel E. Coimbra >>> Email: miguel.e.coim...@gmail.com <miguel.e.coim...@ist.utl.pt> >>> Skype: miguel.e.coimbra >> >> >