Re: HiveContext is Serialized?

Mich Talebzadeh Wed, 26 Oct 2016 01:22:29 -0700

Hi,

Sorry for asking this rather naïve question.


The notion of serialisation in Spark and where it can be serialised or not.
Does this generally refer to the concept of serialisation in the context of
data storage?

In this context for example with reference to RDD operations is it process
of translating object state into a format that can be stored and retrieved
from memory buffer?

Thanks




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 October 2016 at 09:06, Sean Owen <so...@cloudera.com> wrote:

> It is the driver that has the info needed to schedule and manage
> distributed jobs and that is by design.
>
> This is narrowly about using the HiveContext or SparkContext directly. Of
> course SQL operations are distributed.
>
>
> On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi Sean,
>>
>> Your point:
>>
>> "You can't use the HiveContext or SparkContext in a distribution
>> operation..."
>>
>> Is this because of design issue?
>>
>> Case in point if I created a DF from RDD and register it as a tempTable,
>> does this imply that any sql calls on that table islocalised and not
>> distributed among executors?
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>>
>> Sean, thank you for making it clear. It was helpful.
>>
>> Regards,
>> Ajay
>>
>>
>> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>>
>> This usage is fine, because you are only using the HiveContext locally on
>> the driver. It's applied in a function that's used on a Scala collection.
>>
>> You can't use the HiveContext or SparkContext in a distribution
>> operation. It has nothing to do with for loops.
>>
>> The fact that they're serializable is misleading. It's there, I believe,
>> because these objects may be inadvertently referenced in the closure of a
>> function that executes remotely, yet doesn't use the context. The closure
>> cleaner can't always remove this reference. The task would fail to
>> serialize even though it doesn't use the context. You will find these
>> objects serialize but then don't work if used remotely.
>>
>> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
>> IIRC.
>>
>> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>>
>> Hi Everyone,
>>
>> I was thinking if I can use hiveContext inside foreach like below,
>>
>> object Test {
>>   def main(args: Array[String]): Unit = {
>>
>>     val conf = new SparkConf()
>>     val sc = new SparkContext(conf)
>>     val hiveContext = new HiveContext(sc)
>>
>>     val dataElementsFile = args(0)
>>     val deDF = 
>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>
>>     def calculate(de: Row) {
>>       val dataElement = de.getAs[String]("DataElement").trim
>>       val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>> TEST_DB.TEST_TABLE1 ")
>>       df1.write.insertInto("TEST_DB.TEST_TABLE1")
>>     }
>>
>>     deDF.collect().foreach(calculate)
>>   }
>> }
>>
>>
>> I looked at 
>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>  and I see it is extending SqlContext which extends Logging with 
>> Serializable.
>>
>> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>>
>> Regards,
>>
>> Ajay
>>
>>
>>

Re: HiveContext is Serialized?

Reply via email to