Re: Using unserializable classes in tasks

2015-08-25 Thread Akhil Das
Instead of foreach try to use forEachPartitions, that will initialize the
connector per partition rather than per record.

Thanks
Best Regards

On Fri, Aug 14, 2015 at 1:13 PM, Dawid Wysakowicz <
wysakowicz.da...@gmail.com> wrote:

> No the connector does not need to be serializable cause it is constructed
> on the worker. Only objects shuffled across partitions needs to be
> serializable.
>
> 2015-08-14 9:40 GMT+02:00 mark :
>
>> I guess I'm looking for a more general way to use complex graphs of
>> objects that cannot be serialized in a task executing on a worker, not just
>> DB connectors. Something like shipping jars to the worker maybe?
>>
>> I'm not sure I understand how your foreach example solves the issue - the
>> Connector there would still need to be serializable surely?
>>
>> Thanks
>> On 14 Aug 2015 8:32 am, "Dawid Wysakowicz" 
>> wrote:
>>
>>> I am not an expert but first of all check if there is no ready connector
>>> (you mentioned Cassandra - check: spark-cassandra-connector
>>>  ).
>>>
>>> If you really want to do sth on your own all objects constructed in the
>>> passed function will be allocated on the worker.
>>> Example given:
>>>
>>> sc.parrallelize((1 to 100)).forEach(x => new Connector().save(x))
>>>  but this way you allocate resources frequently
>>>
>>> 2015-08-14 9:05 GMT+02:00 mark :
>>>
 I have a Spark job that computes some values and needs to write those
 values to a data store. The classes that write to the data store are not
 serializable (eg, Cassandra session objects etc).

 I don't want to collect all the results at the driver, I want each
 worker to write the data - what is the suggested approach for using code
 that can't be serialized in a task?

>>>
>>>
>


Re: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
No the connector does not need to be serializable cause it is constructed
on the worker. Only objects shuffled across partitions needs to be
serializable.

2015-08-14 9:40 GMT+02:00 mark :

> I guess I'm looking for a more general way to use complex graphs of
> objects that cannot be serialized in a task executing on a worker, not just
> DB connectors. Something like shipping jars to the worker maybe?
>
> I'm not sure I understand how your foreach example solves the issue - the
> Connector there would still need to be serializable surely?
>
> Thanks
> On 14 Aug 2015 8:32 am, "Dawid Wysakowicz" 
> wrote:
>
>> I am not an expert but first of all check if there is no ready connector
>> (you mentioned Cassandra - check: spark-cassandra-connector
>>  ).
>>
>> If you really want to do sth on your own all objects constructed in the
>> passed function will be allocated on the worker.
>> Example given:
>>
>> sc.parrallelize((1 to 100)).forEach(x => new Connector().save(x))
>>  but this way you allocate resources frequently
>>
>> 2015-08-14 9:05 GMT+02:00 mark :
>>
>>> I have a Spark job that computes some values and needs to write those
>>> values to a data store. The classes that write to the data store are not
>>> serializable (eg, Cassandra session objects etc).
>>>
>>> I don't want to collect all the results at the driver, I want each
>>> worker to write the data - what is the suggested approach for using code
>>> that can't be serialized in a task?
>>>
>>
>>


Fwd: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
-- Forwarded message --
From: Dawid Wysakowicz 
Date: 2015-08-14 9:32 GMT+02:00
Subject: Re: Using unserializable classes in tasks
To: mark 


I am not an expert but first of all check if there is no ready connector
(you mentioned Cassandra - check: spark-cassandra-connector
<https://github.com/datastax/spark-cassandra-connector> ).

If you really want to do sth on your own all objects constructed in the
passed function will be allocated on the worker.
Example given:

sc.parrallelize((1 to 100)).forEach(x => new Connector().save(x))
 but this way you allocate resources frequently

2015-08-14 9:05 GMT+02:00 mark :

> I have a Spark job that computes some values and needs to write those
> values to a data store. The classes that write to the data store are not
> serializable (eg, Cassandra session objects etc).
>
> I don't want to collect all the results at the driver, I want each worker
> to write the data - what is the suggested approach for using code that
> can't be serialized in a task?
>


Using unserializable classes in tasks

2015-08-14 Thread mark
I have a Spark job that computes some values and needs to write those
values to a data store. The classes that write to the data store are not
serializable (eg, Cassandra session objects etc).

I don't want to collect all the results at the driver, I want each worker
to write the data - what is the suggested approach for using code that
can't be serialized in a task?