Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

Nathan Marin Mon, 30 Mar 2015 03:20:04 -0700

Hi,
thanks for your quick answers.

I looked at what was being written on disk and a folder called
blockmgr-d0236c76-7f7c-4a60-a6ae-ffc622b2db84 was enlarging every
second. This folder contained shuffle data and was not being cleaned
(after 30minutes of my application running it contained the shuffled
data from the beginning).

I looked online for this, and found out this:
https://issues.apache.org/jira/browse/SPARK-5836. In the pull request,
under the files changed tab there is this information:
« Shuffle also generates a large number of intermediate files on disk.
As of Spark 1.3, these files are not cleaned up from Spark's temporary
storage until Spark is stopped, which means that long-running Spark
jobs may consume available disk space. This is done so the shuffle
doesn't need to be re-computed if the lineage is re-computed. ».

Concerning this, I found some issues on Apache’s Jira advising to run
the garbage collector on the driver executor to clean the shuffled
data written on disk. For now I added a thread on my application that
calls System.gc() every 5 minutes and it seems to do the trick:
http://i.imgur.com/efEGaNS.png (I started the app around 11:00).

Furthermore, I checked the driver/executors logs (before adding the
thread running the garbage collector) for the potentiel errors on
cleaning stuff and nothing to be seen there. Although, the
ContextCleaner was only logging the Broadcast removal and no
RDD/Shuffle. After adding the Garbage Collector mechanics, I saw some
Shuffle being cleaned on the driver logs.

The thread running the garbage collections seems to work but it looks
a bit ugly to me. Do you have any idea on how I could remove the
shuffled data more easily?

Thanks a lot,
NM

On Sun, Mar 29, 2015 at 5:50 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> Nathan:
> Please look in log files for any of the following:
> doCleanupRDD():
>       case e: Exception => logError("Error cleaning RDD " + rddId, e)
> doCleanupShuffle():
>       case e: Exception => logError("Error cleaning shuffle " + shuffleId,
> e)
> doCleanupBroadcast():
>       case e: Exception => logError("Error cleaning broadcast " +
> broadcastId, e)
>
> Cheers
>
> On Sun, Mar 29, 2015 at 7:55 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>>
>> Try these:
>>
>> - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
>> - Enable log rotation:
>>
>> sparkConf.set("spark.executor.logs.rolling.strategy", "size")
>> .set("spark.executor.logs.rolling.size.maxBytes", "1024")
>> .set("spark.executor.logs.rolling.maxRetainedFiles", "3")
>>
>>
>> Also see, whats really getting filled on disk.
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Mar 28, 2015 at 8:18 PM, Nathan Marin <nathan.ma...@teads.tv>
>> wrote:
>>>
>>> Hi,
>>>
>>> I’ve been trying to use Spark Streaming for my real-time analysis
>>> application using the Kafka Stream API on a cluster (using the yarn
>>> version) of 6 executors with 4 dedicated cores and 8192mb of dedicated
>>> RAM.
>>>
>>> The thing is, my application should run 24/7 but the disk usage is
>>> leaking. This leads to some exceptions occurring when Spark tries to
>>> write on a file system where no space is left.
>>>
>>> Here are some graphs showing the disk space remaining on a node where
>>> my application is deployed:
>>> http://i.imgur.com/vdPXCP0.png
>>> The "drops" occurred on a 3 minute interval.
>>>
>>> The Disk Usage goes back to normal once I kill my application:
>>> http://i.imgur.com/ERZs2Cj.png
>>>
>>> The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even
>>> when I tried MEMORY_ONLY_SER_2 the same thing happened (this mode
>>> shouldn't even allow spark to write on disk, right?).
>>>
>>> My question is: How can I force Spark (Streaming?) to remove whatever
>>> he stores immediately after he processed-it? Obviously it doesn’t look
>>> like the disk is being cleaned up (even though the memory does) even
>>> with me calling the rdd.unpersist() method foreach RDD processed.
>>>
>>> Here’s a sample of my application code:
>>> http://pastebin.com/K86LE1J6
>>>
>>> Maybe something is wrong in my app too?
>>>
>>> Thanks for your help,
>>> NM
>>>
>>> ________________________________
>>> View this message in context: [Spark Streaming] Disk not being cleaned up
>>> during runtime after RDD being processed
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

Reply via email to