Hi Sean,

Your point:

"You can't use the HiveContext or SparkContext in a distribution
operation..."

Is this because of design issue?

Case in point if I created a DF from RDD and register it as a tempTable,
does this imply that any sql calls on that table islocalised and not
distributed among executors?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:

> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
>> This usage is fine, because you are only using the HiveContext locally on
>> the driver. It's applied in a function that's used on a Scala collection.
>>
>> You can't use the HiveContext or SparkContext in a distribution
>> operation. It has nothing to do with for loops.
>>
>> The fact that they're serializable is misleading. It's there, I believe,
>> because these objects may be inadvertently referenced in the closure of a
>> function that executes remotely, yet doesn't use the context. The closure
>> cleaner can't always remove this reference. The task would fail to
>> serialize even though it doesn't use the context. You will find these
>> objects serialize but then don't work if used remotely.
>>
>> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
>> IIRC.
>>
>> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I was thinking if I can use hiveContext inside foreach like below,
>>>
>>> object Test {
>>>   def main(args: Array[String]): Unit = {
>>>
>>>     val conf = new SparkConf()
>>>     val sc = new SparkContext(conf)
>>>     val hiveContext = new HiveContext(sc)
>>>
>>>     val dataElementsFile = args(0)
>>>     val deDF = 
>>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>>
>>>     def calculate(de: Row) {
>>>       val dataElement = de.getAs[String]("DataElement").trim
>>>       val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>>> TEST_DB.TEST_TABLE1 ")
>>>       df1.write.insertInto("TEST_DB.TEST_TABLE1")
>>>     }
>>>
>>>     deDF.collect().foreach(calculate)
>>>   }
>>> }
>>>
>>>
>>> I looked at 
>>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>>  and I see it is extending SqlContext which extends Logging with 
>>> Serializable.
>>>
>>> Can anyone tell me if this is the right way to use it ? Thanks for your 
>>> time.
>>>
>>> Regards,
>>>
>>> Ajay
>>>
>>>

Reply via email to