Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-20 Thread Yong Zhang
If you have 2 different RDD (as 2 different references and RDD ID shown in your 
example), then YES, Spark will cache 2 exactly same thing in the memory.


There is no way that spark will compare and know that they are the same 
content. You define them as 2 RDD, then they are different RDDs, and will be 
cached individually.


Yong



From: Taotao.Li 
Sent: Sunday, November 20, 2016 6:18 AM
To: Rabin Banerjee
Cc: Yong Zhang; user; Mich Talebzadeh; Tathagata Das
Subject: Re: Will spark cache table once even if I call read/cache on the same 
table multiple times

hi, you can check my stackoverflow question : 
http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812

On Sat, Nov 19, 2016 at 3:16 AM, Rabin Banerjee 
> wrote:
Hi Yong,

  But every time  val tabdf = sqlContext.table(tablename) is called tabdf.rdd 
is having a new id which can be checked by calling 
tabdf.rdd.id .
And,
https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fbd8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268

Spark is maintaining the Map if [RDD_ID,RDD] , as RDD id is changing , will 
spark cache same data again and again ??

For example ,

val tabdf = sqlContext.table("employee")
tabdf.cache()
tabdf.someTransformation.someAction
println(tabledf.rdd.id)
val tabdf1 = sqlContext.table("employee")
tabdf1.cache() <= Will spark again go to disk read and load data into memory or 
look into cache ?
tabdf1.someTransformation.someAction
println(tabledf1.rdd.id)

Regards,
R Banerjee




On Fri, Nov 18, 2016 at 9:14 PM, Yong Zhang 
> wrote:

That's correct, as long as you don't change the StorageLevel.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166



Yong


From: Rabin Banerjee 
>
Sent: Friday, November 18, 2016 10:36 AM
To: user; Mich Talebzadeh; Tathagata Das
Subject: Will spark cache table once even if I call read/cache on the same 
table multiple times

Hi All ,

  I am working in a project where code is divided into multiple reusable module 
. I am not able to understand spark persist/cache on that context.

My Question is Will spark cache table once even if I call read/cache on the 
same table multiple times ??

 Sample Code ::

  TableReader::

   def getTableDF(tablename:String,persist:Boolean = false) : DataFrame = {
 val tabdf = sqlContext.table(tablename)
 if(persist) {
 tabdf.cache()
}
  return tableDF
}

 Now
Module1::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Module2::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction



ModuleN::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Will spark cache emp table once , or it will cache every time I am calling ?? 
Shall I maintain a global hashmap to handle that ? something like 
Map[String,DataFrame] ??

 Regards,
Rabin Banerjee







--
___
Quant | Engineer | Boy
___
blog:
http://litaotao.github.io
github: www.github.com/litaotao


Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-20 Thread Taotao.Li
hi, you can check my stackoverflow question :
http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812

On Sat, Nov 19, 2016 at 3:16 AM, Rabin Banerjee <
dev.rabin.baner...@gmail.com> wrote:

> Hi Yong,
>
>   But every time  val tabdf = sqlContext.table(tablename) is called tabdf.rdd
> is having a new id which can be checked by calling tabdf.rdd.id .
> And,
> https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fb
> d8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268
>
> Spark is maintaining the Map if [RDD_ID,RDD] , as RDD id is changing ,
> will spark cache same data again and again ??
>
> For example ,
>
> val tabdf = sqlContext.table("employee")
> tabdf.cache()
> tabdf.someTransformation.someAction
> println(tabledf.rdd.id)
> val tabdf1 = sqlContext.table("employee")
> tabdf1.cache() <= *Will spark again go to disk read and load data into
> memory or look into cache ?*
> tabdf1.someTransformation.someAction
> println(tabledf1.rdd.id)
>
> Regards,
> R Banerjee
>
>
>
>
> On Fri, Nov 18, 2016 at 9:14 PM, Yong Zhang  wrote:
>
>> That's correct, as long as you don't change the StorageLevel.
>>
>>
>> https://github.com/apache/spark/blob/master/core/src/main/
>> scala/org/apache/spark/rdd/RDD.scala#L166
>>
>>
>>
>> Yong
>>
>> --
>> *From:* Rabin Banerjee 
>> *Sent:* Friday, November 18, 2016 10:36 AM
>> *To:* user; Mich Talebzadeh; Tathagata Das
>> *Subject:* Will spark cache table once even if I call read/cache on the
>> same table multiple times
>>
>> Hi All ,
>>
>>   I am working in a project where code is divided into multiple reusable
>> module . I am not able to understand spark persist/cache on that context.
>>
>> My Question is Will spark cache table once even if I call read/cache on
>> the same table multiple times ??
>>
>>  Sample Code ::
>>
>>   TableReader::
>>
>>def getTableDF(tablename:String,persist:Boolean = false) : DataFrame
>> = {
>>  val tabdf = sqlContext.table(tablename)
>>  if(persist) {
>>  tabdf.cache()
>> }
>>   return tableDF
>> }
>>
>>  Now
>> Module1::
>>  val emp = TableReader.getTable("employee")
>>  emp.someTransformation.someAction
>>
>> Module2::
>>  val emp = TableReader.getTable("employee")
>>  emp.someTransformation.someAction
>>
>> 
>>
>> ModuleN::
>>  val emp = TableReader.getTable("employee")
>>  emp.someTransformation.someAction
>>
>> Will spark cache emp table once , or it will cache every time I am
>> calling ?? Shall I maintain a global hashmap to handle that ? something
>> like Map[String,DataFrame] ??
>>
>>  Regards,
>> Rabin Banerjee
>>
>>
>>
>>
>


-- 
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io

*github*: www.github.com/litaotao


Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Rabin Banerjee
Hi Yong,

  But every time  val tabdf = sqlContext.table(tablename) is called tabdf.rdd
is having a new id which can be checked by calling tabdf.rdd.id .
And,
https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fbd8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268

Spark is maintaining the Map if [RDD_ID,RDD] , as RDD id is changing , will
spark cache same data again and again ??

For example ,

val tabdf = sqlContext.table("employee")
tabdf.cache()
tabdf.someTransformation.someAction
println(tabledf.rdd.id)
val tabdf1 = sqlContext.table("employee")
tabdf1.cache() <= *Will spark again go to disk read and load data into
memory or look into cache ?*
tabdf1.someTransformation.someAction
println(tabledf1.rdd.id)

Regards,
R Banerjee




On Fri, Nov 18, 2016 at 9:14 PM, Yong Zhang  wrote:

> That's correct, as long as you don't change the StorageLevel.
>
>
> https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L166
>
>
>
> Yong
>
> --
> *From:* Rabin Banerjee 
> *Sent:* Friday, November 18, 2016 10:36 AM
> *To:* user; Mich Talebzadeh; Tathagata Das
> *Subject:* Will spark cache table once even if I call read/cache on the
> same table multiple times
>
> Hi All ,
>
>   I am working in a project where code is divided into multiple reusable
> module . I am not able to understand spark persist/cache on that context.
>
> My Question is Will spark cache table once even if I call read/cache on
> the same table multiple times ??
>
>  Sample Code ::
>
>   TableReader::
>
>def getTableDF(tablename:String,persist:Boolean = false) : DataFrame =
> {
>  val tabdf = sqlContext.table(tablename)
>  if(persist) {
>  tabdf.cache()
> }
>   return tableDF
> }
>
>  Now
> Module1::
>  val emp = TableReader.getTable("employee")
>  emp.someTransformation.someAction
>
> Module2::
>  val emp = TableReader.getTable("employee")
>  emp.someTransformation.someAction
>
> 
>
> ModuleN::
>  val emp = TableReader.getTable("employee")
>  emp.someTransformation.someAction
>
> Will spark cache emp table once , or it will cache every time I am calling
> ?? Shall I maintain a global hashmap to handle that ? something like
> Map[String,DataFrame] ??
>
>  Regards,
> Rabin Banerjee
>
>
>
>


Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Yong Zhang
That's correct, as long as you don't change the StorageLevel.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166



Yong


From: Rabin Banerjee 
Sent: Friday, November 18, 2016 10:36 AM
To: user; Mich Talebzadeh; Tathagata Das
Subject: Will spark cache table once even if I call read/cache on the same 
table multiple times

Hi All ,

  I am working in a project where code is divided into multiple reusable module 
. I am not able to understand spark persist/cache on that context.

My Question is Will spark cache table once even if I call read/cache on the 
same table multiple times ??

 Sample Code ::

  TableReader::

   def getTableDF(tablename:String,persist:Boolean = false) : DataFrame = {
 val tabdf = sqlContext.table(tablename)
 if(persist) {
 tabdf.cache()
}
  return tableDF
}

 Now
Module1::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Module2::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction



ModuleN::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Will spark cache emp table once , or it will cache every time I am calling ?? 
Shall I maintain a global hashmap to handle that ? something like 
Map[String,DataFrame] ??

 Regards,
Rabin Banerjee