Re: RDD collect help

2014-04-18 Thread Eugen Cepoi
Because it happens to reference something outside the closures scope that
will reference some other objects (that you don't need) and so one,
resulting in serializing with your task a lot of things that you don't
want. But sure it is discutable and it's more my personal opinion.


2014-04-17 23:28 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Thanks again Eugen! I don't get the point..why you prefer to avoid kyro
 ser for closures?is there any problem with that?
 On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 You have two kind of ser : data and closures. They both use java ser.
 This means that in your function you reference an object outside of it and
 it is getting ser with your task. To enable kryo ser for closures set
 spark.closure.serializer property. But usualy I dont as it allows me to
 detect such unwanted references.
 Le 17 avr. 2014 22:17, Flavio Pompermaier pomperma...@okkam.it a
 écrit :

 Now I have another problem..I have to pass one o this non serializable
 object to a PairFunction and I received another non serializable
 exception..it seems that Kyro doesn't work within Functions. Am I wrong or
 this is a limit of Spark?
 On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 Ok thanks for the help!

 Best,
 Flavio


 On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote:

 Nope, those operations are lazy, meaning it will create the RDDs but
 won't trigger any action. The computation is launched by operations such
 as collect, count, save to HDFS etc. And even if they were not lazy, no
 serialization would happen. Serialization occurs only when data will be
 transfered (collect, shuffle, maybe perist to disk - but I am not sure for
 this one).


 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Ok, that's fair enough. But why things work up to the collect?during
 map and filter objects are not serialized?
  On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com
 wrote:

 Sure. As you have pointed, those classes don't implement
 Serializable and Spark uses by default java serialization (when you do
 collect the data from the workers will be serialized, collected by the
 driver and then deserialized on the driver side). Kryo (as most other
 decent serialization libs) doesn't require you to implement 
 Serializable.

 For the missing attributes it's due to the fact that java
 serialization does not ser/deser attributes from classes that don't 
 impl.
 Serializable (in your case the parent classes).


 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it
 :

 Thanks Eugen for tgee reply. Could you explain me why I have the
 problem?Why my serialization doesn't work?
 On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com
 wrote:

 Hi,

 as a easy workaround you can enable Kryo serialization
 http://spark.apache.org/docs/latest/configuration.html

 Eugen


 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier 
 pomperma...@okkam.it:

 Hi to all,

 in my application I read objects that are not serializable
 because I cannot modify the sources.
 So I tried to do a workaround creating a dummy class that extends
 the unmodifiable one but implements serializable.
 All attributes of the parent class are Lists of objects (some of
 them are still not serializable and some of them are, i.e. 
 ListString).

 Until I do map and filter on the RDD that objects are filled
 correclty (I checked that via Eclipse debug), but when I do collect 
 all the
 attributes of my objects are empty. Could you help me please?
 I'm using spark-core-2.10 e version 0.9.0-incubating.

 Best,
 Flavio







Re: RDD collect help

2014-04-18 Thread Flavio Pompermaier
Ok thanks. However it turns out that there's a problem with that and it's
not so safe to use kryo serialization with Spark:

Exception in thread Executor task launch worker-0
java.lang.NullPointerException
 at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.scala:267)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.scala:267)

This error is reported also at
http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCAPud8Tq7fK5j2Up9dDdRQ=y1efwidjnmqc55o9jm5dh7rpd...@mail.gmail.com%3E
.


On Fri, Apr 18, 2014 at 10:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Because it happens to reference something outside the closures scope that
 will reference some other objects (that you don't need) and so one,
 resulting in serializing with your task a lot of things that you don't
 want. But sure it is discutable and it's more my personal opinion.


 2014-04-17 23:28 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Thanks again Eugen! I don't get the point..why you prefer to avoid kyro
 ser for closures?is there any problem with that?
  On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 You have two kind of ser : data and closures. They both use java ser.
 This means that in your function you reference an object outside of it and
 it is getting ser with your task. To enable kryo ser for closures set
 spark.closure.serializer property. But usualy I dont as it allows me to
 detect such unwanted references.
 Le 17 avr. 2014 22:17, Flavio Pompermaier pomperma...@okkam.it a
 écrit :

 Now I have another problem..I have to pass one o this non serializable
 object to a PairFunction and I received another non serializable
 exception..it seems that Kyro doesn't work within Functions. Am I wrong or
 this is a limit of Spark?
 On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 Ok thanks for the help!

 Best,
 Flavio


 On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi 
 cepoi.eu...@gmail.comwrote:

 Nope, those operations are lazy, meaning it will create the RDDs but
 won't trigger any action. The computation is launched by operations 
 such
 as collect, count, save to HDFS etc. And even if they were not lazy, no
 serialization would happen. Serialization occurs only when data will be
 transfered (collect, shuffle, maybe perist to disk - but I am not sure 
 for
 this one).


 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Ok, that's fair enough. But why things work up to the collect?during
 map and filter objects are not serialized?
  On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com
 wrote:

 Sure. As you have pointed, those classes don't implement
 Serializable and Spark uses by default java serialization (when you do
 collect the data from the workers will be serialized, collected by 
 the
 driver and then deserialized on the driver side). Kryo (as most other
 decent serialization libs) doesn't require you to implement 
 Serializable.

 For the missing attributes it's due to the fact that java
 serialization does not ser/deser attributes from classes that don't 
 impl.
 Serializable (in your case the parent classes).


 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it
 :

 Thanks Eugen for tgee reply. Could you explain me why I have the
 problem?Why my serialization doesn't work?
 On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com
 wrote:

 Hi,

 as a easy workaround you can enable Kryo serialization
 http://spark.apache.org/docs/latest/configuration.html

 Eugen


 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier 
 pomperma...@okkam.it:

 Hi to all,

 in my application I read objects that are not serializable
 because I cannot modify the sources.
 So I tried to do a workaround creating a dummy class that
 extends the unmodifiable one but implements serializable.
 All attributes of the parent class are Lists of objects (some of
 them are still not serializable and some of them are, i.e. 
 ListString).

 Until I do map and filter on the RDD that objects are filled
 correclty (I checked that via Eclipse debug), but when I do collect 
 all the
 attributes of my objects are empty. Could you help me please?
 I'm using spark-core-2.10 e version 0.9.0-incubating.

 Best,
 Flavio








-- 

Flavio Pompermaier

*Development Department*___
*OKKAM**Srl **- www.okkam.it http://www.okkam.it/*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* pomperma...@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in 

Re: RDD collect help

2014-04-18 Thread Eugen Cepoi
Indeed, serialization is always tricky when you want to work on objects
that are more sophisticated than simple POJOs.
And you can have sometimes unexpected behaviour when using the deserialized
objects. In my case I had troubles when serializaing/deser Avro specific
records with lists. The implementation of java.util.List used by avro does
not have a default no arg constructor and has initialization logic inside
its constructors.


The best way to go (IMO) when you need some:
 - var is to do a copy of it inside the function having the closure
 - function to use in your closure = define it in some stateless dummy
class and implement serializable
 - also a trick with vars could be to define them as lazy, thus they will
be created inside the closure, so the closure won't have a reference on the
outter class (but you might get other surprises...)


2014-04-18 10:37 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Ok thanks. However it turns out that there's a problem with that and it's
 not so safe to use kryo serialization with Spark:

 Exception in thread Executor task launch worker-0
 java.lang.NullPointerException
  at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.scala:267)
 at
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.scala:267)

 This error is reported also at
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCAPud8Tq7fK5j2Up9dDdRQ=y1efwidjnmqc55o9jm5dh7rpd...@mail.gmail.com%3E
 .


 On Fri, Apr 18, 2014 at 10:31 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote:

 Because it happens to reference something outside the closures scope that
 will reference some other objects (that you don't need) and so one,
 resulting in serializing with your task a lot of things that you don't
 want. But sure it is discutable and it's more my personal opinion.


 2014-04-17 23:28 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Thanks again Eugen! I don't get the point..why you prefer to avoid kyro
 ser for closures?is there any problem with that?
  On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 You have two kind of ser : data and closures. They both use java ser.
 This means that in your function you reference an object outside of it and
 it is getting ser with your task. To enable kryo ser for closures set
 spark.closure.serializer property. But usualy I dont as it allows me to
 detect such unwanted references.
 Le 17 avr. 2014 22:17, Flavio Pompermaier pomperma...@okkam.it a
 écrit :

 Now I have another problem..I have to pass one o this non serializable
 object to a PairFunction and I received another non serializable
 exception..it seems that Kyro doesn't work within Functions. Am I wrong or
 this is a limit of Spark?
 On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 Ok thanks for the help!

 Best,
 Flavio


 On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi 
 cepoi.eu...@gmail.comwrote:

 Nope, those operations are lazy, meaning it will create the RDDs but
 won't trigger any action. The computation is launched by operations 
 such
 as collect, count, save to HDFS etc. And even if they were not lazy, no
 serialization would happen. Serialization occurs only when data will be
 transfered (collect, shuffle, maybe perist to disk - but I am not sure 
 for
 this one).


 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Ok, that's fair enough. But why things work up to the collect?during
 map and filter objects are not serialized?
  On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com
 wrote:

 Sure. As you have pointed, those classes don't implement
 Serializable and Spark uses by default java serialization (when you do
 collect the data from the workers will be serialized, collected by 
 the
 driver and then deserialized on the driver side). Kryo (as most other
 decent serialization libs) doesn't require you to implement 
 Serializable.

 For the missing attributes it's due to the fact that java
 serialization does not ser/deser attributes from classes that don't 
 impl.
 Serializable (in your case the parent classes).


 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier 
 pomperma...@okkam.it:

 Thanks Eugen for tgee reply. Could you explain me why I have the
 problem?Why my serialization doesn't work?
 On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com
 wrote:

 Hi,

 as a easy workaround you can enable Kryo serialization
 http://spark.apache.org/docs/latest/configuration.html

 Eugen


 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier 
 pomperma...@okkam.it:

 Hi to all,

 in my application I read objects that are not serializable
 because I cannot modify the sources.
 So I tried to do a workaround creating a dummy class that
 extends the unmodifiable one but implements serializable.
 All attributes of the parent class are Lists of objects (some
 of them are still not serializable and some of them are, i.e. 
 ListString).

 Until I do map 

Re: RDD collect help

2014-04-17 Thread Eugen Cepoi
You have two kind of ser : data and closures. They both use java ser. This
means that in your function you reference an object outside of it and it is
getting ser with your task. To enable kryo ser for closures set
spark.closure.serializer property. But usualy I dont as it allows me to
detect such unwanted references.
Le 17 avr. 2014 22:17, Flavio Pompermaier pomperma...@okkam.it a écrit :

 Now I have another problem..I have to pass one o this non serializable
 object to a PairFunction and I received another non serializable
 exception..it seems that Kyro doesn't work within Functions. Am I wrong or
 this is a limit of Spark?
 On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 Ok thanks for the help!

 Best,
 Flavio


 On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote:

 Nope, those operations are lazy, meaning it will create the RDDs but
 won't trigger any action. The computation is launched by operations such
 as collect, count, save to HDFS etc. And even if they were not lazy, no
 serialization would happen. Serialization occurs only when data will be
 transfered (collect, shuffle, maybe perist to disk - but I am not sure for
 this one).


 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Ok, that's fair enough. But why things work up to the collect?during map
 and filter objects are not serialized?
  On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Sure. As you have pointed, those classes don't implement Serializable
 and Spark uses by default java serialization (when you do collect the data
 from the workers will be serialized, collected by the driver and then
 deserialized on the driver side). Kryo (as most other decent serialization
 libs) doesn't require you to implement Serializable.

 For the missing attributes it's due to the fact that java
 serialization does not ser/deser attributes from classes that don't impl.
 Serializable (in your case the parent classes).


 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Thanks Eugen for tgee reply. Could you explain me why I have the
 problem?Why my serialization doesn't work?
 On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Hi,

 as a easy workaround you can enable Kryo serialization
 http://spark.apache.org/docs/latest/configuration.html

 Eugen


 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it
 :

 Hi to all,

 in my application I read objects that are not serializable because
 I cannot modify the sources.
 So I tried to do a workaround creating a dummy class that extends
 the unmodifiable one but implements serializable.
 All attributes of the parent class are Lists of objects (some of
 them are still not serializable and some of them are, i.e. 
 ListString).

 Until I do map and filter on the RDD that objects are filled
 correclty (I checked that via Eclipse debug), but when I do collect 
 all the
 attributes of my objects are empty. Could you help me please?
 I'm using spark-core-2.10 e version 0.9.0-incubating.

 Best,
 Flavio







RDD collect help

2014-04-14 Thread Flavio Pompermaier
Hi to all,

in my application I read objects that are not serializable because I cannot
modify the sources.
So I tried to do a workaround creating a dummy class that extends the
unmodifiable one but implements serializable.
All attributes of the parent class are Lists of objects (some of them are
still not serializable and some of them are, i.e. ListString).

Until I do map and filter on the RDD that objects are filled correclty (I
checked that via Eclipse debug), but when I do collect all the attributes
of my objects are empty. Could you help me please?
I'm using spark-core-2.10 e version 0.9.0-incubating.

Best,
Flavio


Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
Thanks Eugen for tgee reply. Could you explain me why I have the
problem?Why my serialization doesn't work?
On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Hi,

 as a easy workaround you can enable Kryo serialization
 http://spark.apache.org/docs/latest/configuration.html

 Eugen


 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Hi to all,

 in my application I read objects that are not serializable because I
 cannot modify the sources.
 So I tried to do a workaround creating a dummy class that extends the
 unmodifiable one but implements serializable.
 All attributes of the parent class are Lists of objects (some of them are
 still not serializable and some of them are, i.e. ListString).

 Until I do map and filter on the RDD that objects are filled correclty (I
 checked that via Eclipse debug), but when I do collect all the attributes
 of my objects are empty. Could you help me please?
 I'm using spark-core-2.10 e version 0.9.0-incubating.

 Best,
 Flavio





Re: RDD collect help

2014-04-14 Thread Eugen Cepoi
Sure. As you have pointed, those classes don't implement Serializable and
Spark uses by default java serialization (when you do collect the data from
the workers will be serialized, collected by the driver and then
deserialized on the driver side). Kryo (as most other decent serialization
libs) doesn't require you to implement Serializable.

For the missing attributes it's due to the fact that java serialization
does not ser/deser attributes from classes that don't impl. Serializable
(in your case the parent classes).


2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Thanks Eugen for tgee reply. Could you explain me why I have the
 problem?Why my serialization doesn't work?
 On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Hi,

 as a easy workaround you can enable Kryo serialization
 http://spark.apache.org/docs/latest/configuration.html

 Eugen


 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Hi to all,

 in my application I read objects that are not serializable because I
 cannot modify the sources.
 So I tried to do a workaround creating a dummy class that extends the
 unmodifiable one but implements serializable.
 All attributes of the parent class are Lists of objects (some of them
 are still not serializable and some of them are, i.e. ListString).

 Until I do map and filter on the RDD that objects are filled correclty
 (I checked that via Eclipse debug), but when I do collect all the
 attributes of my objects are empty. Could you help me please?
 I'm using spark-core-2.10 e version 0.9.0-incubating.

 Best,
 Flavio





Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
Ok, that's fair enough. But why things work up to the collect?during map
and filter objects are not serialized?
On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Sure. As you have pointed, those classes don't implement Serializable and
 Spark uses by default java serialization (when you do collect the data from
 the workers will be serialized, collected by the driver and then
 deserialized on the driver side). Kryo (as most other decent serialization
 libs) doesn't require you to implement Serializable.

 For the missing attributes it's due to the fact that java serialization
 does not ser/deser attributes from classes that don't impl. Serializable
 (in your case the parent classes).


 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Thanks Eugen for tgee reply. Could you explain me why I have the
 problem?Why my serialization doesn't work?
 On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Hi,

 as a easy workaround you can enable Kryo serialization
 http://spark.apache.org/docs/latest/configuration.html

 Eugen


 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Hi to all,

 in my application I read objects that are not serializable because I
 cannot modify the sources.
 So I tried to do a workaround creating a dummy class that extends the
 unmodifiable one but implements serializable.
 All attributes of the parent class are Lists of objects (some of them
 are still not serializable and some of them are, i.e. ListString).

 Until I do map and filter on the RDD that objects are filled correclty
 (I checked that via Eclipse debug), but when I do collect all the
 attributes of my objects are empty. Could you help me please?
 I'm using spark-core-2.10 e version 0.9.0-incubating.

 Best,
 Flavio






Re: RDD collect help

2014-04-14 Thread Eugen Cepoi
Nope, those operations are lazy, meaning it will create the RDDs but won't
trigger any action. The computation is launched by operations such as
collect, count, save to HDFS etc. And even if they were not lazy, no
serialization would happen. Serialization occurs only when data will be
transfered (collect, shuffle, maybe perist to disk - but I am not sure for
this one).


2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Ok, that's fair enough. But why things work up to the collect?during map
 and filter objects are not serialized?
 On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Sure. As you have pointed, those classes don't implement Serializable and
 Spark uses by default java serialization (when you do collect the data from
 the workers will be serialized, collected by the driver and then
 deserialized on the driver side). Kryo (as most other decent serialization
 libs) doesn't require you to implement Serializable.

 For the missing attributes it's due to the fact that java serialization
 does not ser/deser attributes from classes that don't impl. Serializable
 (in your case the parent classes).


 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Thanks Eugen for tgee reply. Could you explain me why I have the
 problem?Why my serialization doesn't work?
 On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Hi,

 as a easy workaround you can enable Kryo serialization
 http://spark.apache.org/docs/latest/configuration.html

 Eugen


 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Hi to all,

 in my application I read objects that are not serializable because I
 cannot modify the sources.
 So I tried to do a workaround creating a dummy class that extends the
 unmodifiable one but implements serializable.
 All attributes of the parent class are Lists of objects (some of them
 are still not serializable and some of them are, i.e. ListString).

 Until I do map and filter on the RDD that objects are filled correclty
 (I checked that via Eclipse debug), but when I do collect all the
 attributes of my objects are empty. Could you help me please?
 I'm using spark-core-2.10 e version 0.9.0-incubating.

 Best,
 Flavio