Re: Thread 'SortMerger spilling thread' terminated due to an exception: No space left on device

Fabian Hueske Sun, 04 Dec 2016 23:41:49 -0800

Hi Miguel,

have you found a solution to your problem?
I'm not a docker expert but this forum thread looks like could be related
to your problem [1].


Best,
Fabian

[1] https://forums.docker.com/t/no-space-left-on-device-error/10894

2016-12-02 17:43 GMT+01:00 Miguel Coimbra <miguel.e.coim...@gmail.com>:

> Hello Fabian,
>
> I have created a directory on my host machine user directory (
> /home/myuser/mydir ) and I am mapping it as a volume with Docker for the
> TaskManager and JobManager containers.
> Each container will thus have the following directory /home/flink/htmp
>
> host ---> container
> /home/myuser/mydir ---> /home/flink/htmp
>
> I had previously done this successfully with the a host directory which
> holds several SNAP data sets.
> In the Flink configuration file, I specified /home/flink/htmp to be used
> as the tmp dir for the TaskManager.
> This seems to be working, as I was able to start the cluster and invoke
> Flink for that Friendster dataset.
>
> However, during execution, there were 2 intermediate files which kept
> growing until they reached about 30 GB.
> At that point, the Flink TaskManager threw the exception again:
>
> java.lang.RuntimeException: Error obtaining the sorted input: Thread
> 'SortMerger spilling thread' terminated due to an exception: No space left
> on device
>
> Here is an ls excerpt of the directory on the host (to which the
> TaskManager container was also writing successfully) shortly before the
> exception:
>
> *31G *9d177a1971322263f1597c3378885ccf.channel
> *31G* a693811249bc5f785a79d1b1b537fe93.channel
>
> Now I believe the host system is capable of storing hundred GBs more, so I
> am confused as to what the problem might be.
>
> Best regards,
>
> Miguel E. Coimbra
> Email: miguel.e.coim...@gmail.com <miguel.e.coim...@ist.utl.pt>
> Skype: miguel.e.coimbra
>
> 
>>
>> Hi Miguel,
>>
>> the exception does indeed indicate that the process ran out of available
>> disk space.
>> The quoted paragraph of the blog post describes the situation when you
>> receive the IOE.
>>
>> By default the systems default tmp dir is used. I don't know which folder
>> that would be in a Docker setup.
>> You can configure the temp dir using the taskmanager.tmp.dirs config key.
>> Please see the configuration documentation for details [1].
>>
>> Hope this helps,
>> Fabian
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/
>> setup/config.html#jobmanager-amp-taskmanager
>>
>> 2016-12-02 0:18 GMT+01:00 Miguel Coimbra <miguel.e.coim...@gmail.com>:
>> 
>>
>>> Hello,
>>>
>>> I have a problem for which I hope someone will be able to give a hint.
>>> I am running the Flink *standalone* cluster with 2 Docker containers (1
>>> TaskManager and 1 JobManager) using 1 TaskManager with 30 GB of RAM.
>>>
>>> The dataset is a large one: SNAP Friendster, which has around 1800 M
>>> edges.
>>> https://snap.stanford.edu/data/com-Friendster.html
>>>
>>> I am trying to run the Gelly built-in label propagation algorithm on top
>>> of it.
>>> As this is a very big dataset, I believe I am exceeding the available
>>> RAM and that the system is using secondary storage, which then fails:
>>>
>>>
>>> Connected to JobManager at Actor[akka.tcp://flink@172.19.
>>> 0.2:6123/user/jobmanager#894624508]
>>> 12/01/2016 17:58:24    Job execution switched to status RUNNING.
>>> 12/01/2016 17:58:24    DataSource (at main(App.java:33) (
>>> org.apache.flink.api.java.io.TupleCsvInputFormat))(1/1) switched to
>>> SCHEDULED
>>> 12/01/2016 17:58:24    DataSource (at main(App.java:33) (
>>> org.apache.flink.api.java.io.TupleCsvInputFormat))(1/1) switched to
>>> DEPLOYING
>>> 12/01/2016 17:58:24    DataSource (at main(App.java:33) (
>>> org.apache.flink.api.java.io.TupleCsvInputFormat))(1/1) switched to
>>> RUNNING
>>> 12/01/2016 17:58:24    Map (Map at fromTuple2DataSet(Graph.java:343))(1/1)
>>> switched to SCHEDULED
>>> 12/01/2016 17:58:24    Map (Map at fromTuple2DataSet(Graph.java:343))(1/1)
>>> switched to DEPLOYING
>>> 12/01/2016 17:58:24    Map (Map at fromTuple2DataSet(Graph.java:343))(1/1)
>>> switched to RUNNING
>>> 12/01/2016 17:59:51    Map (Map at fromTuple2DataSet(Graph.java:343))(1/1)
>>> switched to FAILED
>>> *java.lang.RuntimeException: Error obtaining the sorted input: Thread
>>> 'SortMerger spilling thread' terminated due to an exception: No space left
>>> on device*
>>>     at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>>> .getIterator(UnilateralSortMerger.java:619)
>>>     at org.apache.flink.runtime.operators.BatchTask.getInput(BatchT
>>> ask.java:1098)
>>>     at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.j
>>> ava:86)
>>>     at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.j
>>> ava:486)
>>>     at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTas
>>> k.java:351)
>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
>>>     at java.lang.Thread.run(Thread.java:745)
>>> *Caused by: java.io.IOException: Thread 'SortMerger spilling thread'
>>> terminated due to an exception: No space left on device*
>>>     at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>>> $ThreadBase.run(UnilateralSortMerger.java:800)
>>> Caused by: java.io.IOException: No space left on device
>>>     at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>     at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>>>     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>>     at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>>     at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>>>     at org.apache.flink.runtime.io.disk.iomanager.SegmentWriteReque
>>> st.write(AsynchronousFileIOChannel.java:344)
>>>     at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync$Wr
>>> iterThread.run(IOManagerAsync.java:502)
>>>
>>>
>>> I do not have secondary storage limitations on the host system, so I
>>> believe the system would be able to handle whatever is spilled to the
>>> disk...
>>> Perhaps this is a Docker limitation regarding the usage of the host's
>>> secondary storage?
>>>
>>> Or is there perhaps some configuration or setting for the TaskManager
>>> which I am missing?
>>> Running the label propagation of Gelly on this dataset and cluster
>>> configuration, what would be the expected behavior if the system consumes
>>> all the memory?
>>>
>>>
>>> I believe the SortMerger thread is associated to the following mechanism
>>> described in this blog post:
>>>
>>> https://flink.apache.org/news/2015/03/13/peeking-into-Apache
>>> -Flinks-Engine-Room.html
>>> *The Sort-Merge-Join works by first sorting both input data sets on
>>> their join key attributes (Sort Phase) and merging the sorted data sets as
>>> a second step (Merge Phase). The sort is done in-memory if the local
>>> partition of a data set is small enough. Otherwise, an external merge-sort
>>> is done by collecting data until the working memory is filled, sorting it,
>>> writing the sorted data to the local filesystem, and starting over by
>>> filling the working memory again with more incoming data. After all input
>>> data has been received, sorted, and written as sorted runs to the local
>>> file system, a fully sorted stream can be obtained. This is done by reading
>>> the partially sorted runs from the local filesystem and sort-merging the
>>> records on the fly. Once the sorted streams of both inputs are available,
>>> both streams are sequentially read and merge-joined in a zig-zag fashion by
>>> comparing the sorted join key attributes, building join element pairs for
>>> matching keys, and advancing the sorted stream with the lower join key.*
>>>
>>> I am still investigating the possibility that Docker is at fault
>>> regarding secondary storage limitations, but how would I go about
>>> estimating the amount of disk space needed for this spilling on this
>>> dataset?
>>>
>>> Thanks for your time,
>>>
>>> My best regards,
>>>
>>> Miguel E. Coimbra
>>> Email: miguel.e.coim...@gmail.com <miguel.e.coim...@ist.utl.pt>
>>> Skype: miguel.e.coimbra
>>
>>
>

Re: Thread 'SortMerger spilling thread' terminated due to an exception: No space left on device

Reply via email to