Re: RDD collect help
Because it happens to reference something outside the closures scope that will reference some other objects (that you don't need) and so one, resulting in serializing with your task a lot of things that you don't want. But sure it is discutable and it's more my personal opinion. 2014-04-17 23:28 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks again Eugen! I don't get the point..why you prefer to avoid kyro ser for closures?is there any problem with that? On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: You have two kind of ser : data and closures. They both use java ser. This means that in your function you reference an object outside of it and it is getting ser with your task. To enable kryo ser for closures set spark.closure.serializer property. But usualy I dont as it allows me to detect such unwanted references. Le 17 avr. 2014 22:17, Flavio Pompermaier pomperma...@okkam.it a écrit : Now I have another problem..I have to pass one o this non serializable object to a PairFunction and I received another non serializable exception..it seems that Kyro doesn't work within Functions. Am I wrong or this is a limit of Spark? On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thanks for the help! Best, Flavio On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Nope, those operations are lazy, meaning it will create the RDDs but won't trigger any action. The computation is launched by operations such as collect, count, save to HDFS etc. And even if they were not lazy, no serialization would happen. Serialization occurs only when data will be transfered (collect, shuffle, maybe perist to disk - but I am not sure for this one). 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs) doesn't require you to implement Serializable. For the missing attributes it's due to the fact that java serialization does not ser/deser attributes from classes that don't impl. Serializable (in your case the parent classes). 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it : Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map and filter on the RDD that objects are filled correclty (I checked that via Eclipse debug), but when I do collect all the attributes of my objects are empty. Could you help me please? I'm using spark-core-2.10 e version 0.9.0-incubating. Best, Flavio
Re: RDD collect help
Ok thanks. However it turns out that there's a problem with that and it's not so safe to use kryo serialization with Spark: Exception in thread Executor task launch worker-0 java.lang.NullPointerException at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.scala:267) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.scala:267) This error is reported also at http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCAPud8Tq7fK5j2Up9dDdRQ=y1efwidjnmqc55o9jm5dh7rpd...@mail.gmail.com%3E . On Fri, Apr 18, 2014 at 10:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Because it happens to reference something outside the closures scope that will reference some other objects (that you don't need) and so one, resulting in serializing with your task a lot of things that you don't want. But sure it is discutable and it's more my personal opinion. 2014-04-17 23:28 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks again Eugen! I don't get the point..why you prefer to avoid kyro ser for closures?is there any problem with that? On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: You have two kind of ser : data and closures. They both use java ser. This means that in your function you reference an object outside of it and it is getting ser with your task. To enable kryo ser for closures set spark.closure.serializer property. But usualy I dont as it allows me to detect such unwanted references. Le 17 avr. 2014 22:17, Flavio Pompermaier pomperma...@okkam.it a écrit : Now I have another problem..I have to pass one o this non serializable object to a PairFunction and I received another non serializable exception..it seems that Kyro doesn't work within Functions. Am I wrong or this is a limit of Spark? On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thanks for the help! Best, Flavio On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Nope, those operations are lazy, meaning it will create the RDDs but won't trigger any action. The computation is launched by operations such as collect, count, save to HDFS etc. And even if they were not lazy, no serialization would happen. Serialization occurs only when data will be transfered (collect, shuffle, maybe perist to disk - but I am not sure for this one). 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs) doesn't require you to implement Serializable. For the missing attributes it's due to the fact that java serialization does not ser/deser attributes from classes that don't impl. Serializable (in your case the parent classes). 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it : Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map and filter on the RDD that objects are filled correclty (I checked that via Eclipse debug), but when I do collect all the attributes of my objects are empty. Could you help me please? I'm using spark-core-2.10 e version 0.9.0-incubating. Best, Flavio -- Flavio Pompermaier *Development Department*___ *OKKAM**Srl **- www.okkam.it http://www.okkam.it/* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in
Re: RDD collect help
Indeed, serialization is always tricky when you want to work on objects that are more sophisticated than simple POJOs. And you can have sometimes unexpected behaviour when using the deserialized objects. In my case I had troubles when serializaing/deser Avro specific records with lists. The implementation of java.util.List used by avro does not have a default no arg constructor and has initialization logic inside its constructors. The best way to go (IMO) when you need some: - var is to do a copy of it inside the function having the closure - function to use in your closure = define it in some stateless dummy class and implement serializable - also a trick with vars could be to define them as lazy, thus they will be created inside the closure, so the closure won't have a reference on the outter class (but you might get other surprises...) 2014-04-18 10:37 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Ok thanks. However it turns out that there's a problem with that and it's not so safe to use kryo serialization with Spark: Exception in thread Executor task launch worker-0 java.lang.NullPointerException at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.scala:267) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1$$anonfun$6.apply(Executor.scala:267) This error is reported also at http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCAPud8Tq7fK5j2Up9dDdRQ=y1efwidjnmqc55o9jm5dh7rpd...@mail.gmail.com%3E . On Fri, Apr 18, 2014 at 10:31 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Because it happens to reference something outside the closures scope that will reference some other objects (that you don't need) and so one, resulting in serializing with your task a lot of things that you don't want. But sure it is discutable and it's more my personal opinion. 2014-04-17 23:28 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks again Eugen! I don't get the point..why you prefer to avoid kyro ser for closures?is there any problem with that? On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: You have two kind of ser : data and closures. They both use java ser. This means that in your function you reference an object outside of it and it is getting ser with your task. To enable kryo ser for closures set spark.closure.serializer property. But usualy I dont as it allows me to detect such unwanted references. Le 17 avr. 2014 22:17, Flavio Pompermaier pomperma...@okkam.it a écrit : Now I have another problem..I have to pass one o this non serializable object to a PairFunction and I received another non serializable exception..it seems that Kyro doesn't work within Functions. Am I wrong or this is a limit of Spark? On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thanks for the help! Best, Flavio On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Nope, those operations are lazy, meaning it will create the RDDs but won't trigger any action. The computation is launched by operations such as collect, count, save to HDFS etc. And even if they were not lazy, no serialization would happen. Serialization occurs only when data will be transfered (collect, shuffle, maybe perist to disk - but I am not sure for this one). 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs) doesn't require you to implement Serializable. For the missing attributes it's due to the fact that java serialization does not ser/deser attributes from classes that don't impl. Serializable (in your case the parent classes). 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map
Re: RDD collect help
You have two kind of ser : data and closures. They both use java ser. This means that in your function you reference an object outside of it and it is getting ser with your task. To enable kryo ser for closures set spark.closure.serializer property. But usualy I dont as it allows me to detect such unwanted references. Le 17 avr. 2014 22:17, Flavio Pompermaier pomperma...@okkam.it a écrit : Now I have another problem..I have to pass one o this non serializable object to a PairFunction and I received another non serializable exception..it seems that Kyro doesn't work within Functions. Am I wrong or this is a limit of Spark? On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thanks for the help! Best, Flavio On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Nope, those operations are lazy, meaning it will create the RDDs but won't trigger any action. The computation is launched by operations such as collect, count, save to HDFS etc. And even if they were not lazy, no serialization would happen. Serialization occurs only when data will be transfered (collect, shuffle, maybe perist to disk - but I am not sure for this one). 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs) doesn't require you to implement Serializable. For the missing attributes it's due to the fact that java serialization does not ser/deser attributes from classes that don't impl. Serializable (in your case the parent classes). 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it : Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map and filter on the RDD that objects are filled correclty (I checked that via Eclipse debug), but when I do collect all the attributes of my objects are empty. Could you help me please? I'm using spark-core-2.10 e version 0.9.0-incubating. Best, Flavio
RDD collect help
Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map and filter on the RDD that objects are filled correclty (I checked that via Eclipse debug), but when I do collect all the attributes of my objects are empty. Could you help me please? I'm using spark-core-2.10 e version 0.9.0-incubating. Best, Flavio
Re: RDD collect help
Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map and filter on the RDD that objects are filled correclty (I checked that via Eclipse debug), but when I do collect all the attributes of my objects are empty. Could you help me please? I'm using spark-core-2.10 e version 0.9.0-incubating. Best, Flavio
Re: RDD collect help
Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs) doesn't require you to implement Serializable. For the missing attributes it's due to the fact that java serialization does not ser/deser attributes from classes that don't impl. Serializable (in your case the parent classes). 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map and filter on the RDD that objects are filled correclty (I checked that via Eclipse debug), but when I do collect all the attributes of my objects are empty. Could you help me please? I'm using spark-core-2.10 e version 0.9.0-incubating. Best, Flavio
Re: RDD collect help
Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs) doesn't require you to implement Serializable. For the missing attributes it's due to the fact that java serialization does not ser/deser attributes from classes that don't impl. Serializable (in your case the parent classes). 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map and filter on the RDD that objects are filled correclty (I checked that via Eclipse debug), but when I do collect all the attributes of my objects are empty. Could you help me please? I'm using spark-core-2.10 e version 0.9.0-incubating. Best, Flavio
Re: RDD collect help
Nope, those operations are lazy, meaning it will create the RDDs but won't trigger any action. The computation is launched by operations such as collect, count, save to HDFS etc. And even if they were not lazy, no serialization would happen. Serialization occurs only when data will be transfered (collect, shuffle, maybe perist to disk - but I am not sure for this one). 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs) doesn't require you to implement Serializable. For the missing attributes it's due to the fact that java serialization does not ser/deser attributes from classes that don't impl. Serializable (in your case the parent classes). 2014-04-14 23:17 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen 2014-04-14 18:21 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are still not serializable and some of them are, i.e. ListString). Until I do map and filter on the RDD that objects are filled correclty (I checked that via Eclipse debug), but when I do collect all the attributes of my objects are empty. Could you help me please? I'm using spark-core-2.10 e version 0.9.0-incubating. Best, Flavio