Re: class instance variable in PySpark used in lambda function

Mich Talebzadeh Thu, 16 Dec 2021 11:19:44 -0800

Many thanks Pol.

As it happens I was doing a work around with numRows = 10.  In general it
is bad practice to hard code the constants within the code. For the same
reason we ought not put URLs embedded in the PySpark program itself.


What I did was to add numRows to the yaml file which is read at the
beginning. It is basically number of rows of random numbers to be generated
in PySpark and posted to the database. That was the solution I adopted.

Cheers,

Mich



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Dec 2021 at 12:03, Pol Santamaria <p...@qbeast.io> wrote:

> To me it looks like you are accessing "self" on the workers by using
> "self.numRows" inside the map. As a consequence, "self" needs to be
> serialized which has an attribute referencing the "sparkContext", thus
> trying to serialize the context and failing.
>
> It can be solved in different ways, for instance by avoiding the use of
> "self" in the map, as you did in the last snippet, or by saving the spark
> context / session in a different class than "numRows".
>
> Bests,
>
> Pol Santamaria
>
>
> On Wed, Dec 15, 2021 at 12:24 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>> Hi,
>>
>> I define class instance variable self.numRows = 10 to be available to
>> all methods of this cls as below
>>
>> class RandomData:
>>     def __init__(self, spark_session, spark_context):
>>         self.spark = spark_session
>>         self.sc = spark_context
>>         self.config = config
>>         self.values = dict()
>>         *self.numRows = 10*
>>
>> In another method of the same class, I use lambda function to generate
>> random values
>>
>>     def generateRamdomData(self):
>>           rdd = self.sc.parallelize(Range). \
>>             map(lambda x: (x, uf.clustered(x, *self.numRows*), \
>>
>> This fails with the error below
>>
>> Could not serialize object: Exception: It appears that you are attempting
>> to reference SparkContext from a broadcast variable, action, or
>> transformation. SparkContext can only be used on the driver, not in code
>> that it run on workers. For more information, see SPARK-5063.
>>
>> However this works if I assign self.numRows to a local variable in the
>> that method as below
>>
>>
>>        *numRows = self.numRows*
>>          rdd = self.sc.parallelize(Range). \
>>             map(lambda x: (x, uf.clustered(x, *numRows*), \
>>
>>
>>
>> Any better explanation
>>
>>
>> Thanks
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: class instance variable in PySpark used in lambda function

Reply via email to