Re: /tmp fills up to 100GB when using a window function

2017-12-20 Thread Vadim Semenov
Ah, yes, I missed that part

it's `spark.local.dir`

spark.local.dir /tmp Directory to use for "scratch" space in Spark,
including map output files and RDDs that get stored on disk. This should be
on a fast, local disk in your system. It can also be a comma-separated list
of multiple directories on different disks. NOTE: In Spark 1.0 and later
this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or
LOCAL_DIRS (YARN) environment variables set by the cluster manager.

On Wed, Dec 20, 2017 at 2:58 PM, Gourav Sengupta 
wrote:

> I do think that there is an option to set the temporary shuffle location
> to a particular directory. While working with EMR I set it to /mnt1/. Let
> me know in case you are not able to find it.
>
> On Mon, Dec 18, 2017 at 8:10 PM, Mihai Iacob  wrote:
>
>> This code generates files under /tmp...blockmgr... which do not get
>> cleaned up after the job finishes.
>>
>> Anything wrong with the code below? or are there any known issues with
>> spark not cleaning up /tmp files?
>>
>>
>> window = Window.\
>>   partitionBy('***', 'date_str').\
>>   orderBy(sqlDf['***'])
>>
>> sqlDf = sqlDf.withColumn("***",rank().over(window))
>> df_w_least = sqlDf.filter("***=1")
>>
>>
>>
>>
>>
>> Regards,
>>
>> *Mihai Iacob*
>> DSX Local  - Security, IBM Analytics
>>
>> - To
>> unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: /tmp fills up to 100GB when using a window function

2017-12-20 Thread Gourav Sengupta
I do think that there is an option to set the temporary shuffle location to
a particular directory. While working with EMR I set it to /mnt1/. Let me
know in case you are not able to find it.

On Mon, Dec 18, 2017 at 8:10 PM, Mihai Iacob  wrote:

> This code generates files under /tmp...blockmgr... which do not get
> cleaned up after the job finishes.
>
> Anything wrong with the code below? or are there any known issues with
> spark not cleaning up /tmp files?
>
>
> window = Window.\
>   partitionBy('***', 'date_str').\
>   orderBy(sqlDf['***'])
>
> sqlDf = sqlDf.withColumn("***",rank().over(window))
> df_w_least = sqlDf.filter("***=1")
>
>
>
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local  - Security, IBM Analytics
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: /tmp fills up to 100GB when using a window function

2017-12-19 Thread Vadim Semenov
Until after an action is done (i.e. save/count/reduce) or if you explicitly
truncate the DAG by checkpointing.

Spark needs to keep all shuffle files because if some task/stage/node fails
it'll only need to recompute missing partitions by using already computed
parts.

On Tue, Dec 19, 2017 at 10:08 AM, Mihai Iacob <mia...@ca.ibm.com> wrote:

> When does spark remove them?
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics
>
>
>
> - Original message -
> From: Vadim Semenov <vadim.seme...@datadoghq.com>
> To: Mihai Iacob <mia...@ca.ibm.com>
> Cc: user <user@spark.apache.org>
> Subject: Re: /tmp fills up to 100GB when using a window function
> Date: Tue, Dec 19, 2017 9:46 AM
>
> Spark doesn't remove intermediate shuffle files if they're part of the
> same job.
>
> On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob <mia...@ca.ibm.com> wrote:
>
> This code generates files under /tmp...blockmgr... which do not get
> cleaned up after the job finishes.
>
> Anything wrong with the code below? or are there any known issues with
> spark not cleaning up /tmp files?
>
> window = Window.\
>   partitionBy('***', 'date_str').\
>   orderBy(sqlDf['***'])
>
> sqlDf = sqlDf.withColumn("***",rank().over(window))
> df_w_least = sqlDf.filter("***=1")
>
>
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>


Re: /tmp fills up to 100GB when using a window function

2017-12-19 Thread Mihai Iacob
When does spark remove them?
 
Regards, 

Mihai IacobDSX Local - Security, IBM Analytics
 
 
- Original message -From: Vadim Semenov To: Mihai Iacob Cc: user Subject: Re: /tmp fills up to 100GB when using a window functionDate: Tue, Dec 19, 2017 9:46 AM 
Spark doesn't remove intermediate shuffle files if they're part of the same job.
 
On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob  wrote:

This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
 
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
 
window = Window.\              partitionBy('***', 'date_str').\              orderBy(sqlDf['***'])sqlDf = sqlDf.withColumn("***",rank().over(window))df_w_least = sqlDf.filter("***=1")
 
 
 
Regards, 

Mihai IacobDSX Local - Security, IBM Analytics- To unsubscribe e-mail: user-unsubscribe@spark.apache.org
 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: /tmp fills up to 100GB when using a window function

2017-12-19 Thread Vadim Semenov
Spark doesn't remove intermediate shuffle files if they're part of the same
job.

On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob  wrote:

> This code generates files under /tmp...blockmgr... which do not get
> cleaned up after the job finishes.
>
> Anything wrong with the code below? or are there any known issues with
> spark not cleaning up /tmp files?
>
>
> window = Window.\
>   partitionBy('***', 'date_str').\
>   orderBy(sqlDf['***'])
>
> sqlDf = sqlDf.withColumn("***",rank().over(window))
> df_w_least = sqlDf.filter("***=1")
>
>
>
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local  - Security, IBM Analytics
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org


/tmp fills up to 100GB when using a window function

2017-12-18 Thread Mihai Iacob
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
 
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
 
window = Window.\
  partitionBy('***', 'date_str').\
  orderBy(sqlDf['***'])

sqlDf = sqlDf.withColumn("***",rank().over(window))
df_w_least = sqlDf.filter("***=1")
 
 
 
Regards, 

Mihai IacobDSX Local - Security, IBM Analytics


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org