Re: [VOTE] Release Apache Spark 1.3.0 (RC2)
Hey Marcelo, Yes - I agree. That one trickled in just as I was packaging this RC. However, I still put this out here to allow people to test the existing fixes, etc. - Patrick On Wed, Mar 4, 2015 at 9:26 AM, Marcelo Vanzin van...@cloudera.com wrote: I haven't tested the rc2 bits yet, but I'd consider https://issues.apache.org/jira/browse/SPARK-6144 a serious regression from 1.2 (since it affects existing addFile() functionality if the URL is hdfs:...). Will test other parts separately. On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc2 (commit 3af2687): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc Staging repositories for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1074/ (published with version '1.3.0') https://repository.apache.org/content/repositories/orgapachespark-1075/ (published with version '1.3.0-rc2') The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Saturday, March 07, at 04:17 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How does this compare to RC1 == This patch includes a variety of bug fixes found in RC1. == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. If you are happy with this release based on your own testing, give a +1 vote. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC2)
I haven't tested the rc2 bits yet, but I'd consider https://issues.apache.org/jira/browse/SPARK-6144 a serious regression from 1.2 (since it affects existing addFile() functionality if the URL is hdfs:...). Will test other parts separately. On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc2 (commit 3af2687): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc Staging repositories for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1074/ (published with version '1.3.0') https://repository.apache.org/content/repositories/orgapachespark-1075/ (published with version '1.3.0-rc2') The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Saturday, March 07, at 04:17 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How does this compare to RC1 == This patch includes a variety of bug fixes found in RC1. == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. If you are happy with this release based on your own testing, give a +1 vote. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC2)
-1 (non-binding) because of SPARK-6144. But aside from that I ran a set of tests on top of standalone and yarn and things look good. On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc2 (commit 3af2687): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc Staging repositories for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1074/ (published with version '1.3.0') https://repository.apache.org/content/repositories/orgapachespark-1075/ (published with version '1.3.0-rc2') The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Saturday, March 07, at 04:17 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How does this compare to RC1 == This patch includes a variety of bug fixes found in RC1. == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. If you are happy with this release based on your own testing, give a +1 vote. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
short jenkins 7am downtime tomorrow morning (3-5-15)
the master and workers need some system and package updates, and i'll also be rebooting the machines as well. this shouldn't take very long to perform, and i expect jenkins to be back up and building by 9am at the *latest*. important note: i will NOT be updating jenkins or any of the plugins during this maintenance! as always, please let me know if you have any questions or concerns. danke shane
Re: enum-like types in Spark
#4 with a preference for CamelCaseEnums On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com wrote: another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant names should be in upper camel case. That is, if the member is final, immutable and it belongs to a package object or an object, it may be considered a constant (similar to Java’sstatic final members): 1. object Container { 2. val MyConstant = ... 3. } 2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com: Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types show up as Enumeration.Value in Java. http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html 2. Java’s Enum (e.g., SaveMode, IOMode) * Implementation must be in a Java file. * Values doesn’t show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode 3. Static fields in Java (e.g., TripletFields) * Implementation must be in a Java file. * Doesn’t need “()” in Java code. * Values don't show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields 4. Objects in Scala. (e.g., StorageLevel) * Needs “()” in Java code. * Values show up in both ScalaDoc and JavaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$ http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html It would be great if we have an “official” approach for this as well as the naming convention for enum-like values (“MEMORY_ONLY” or “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts? Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
enum-like types in Spark
Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types show up as Enumeration.Value in Java. http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html 2. Java’s Enum (e.g., SaveMode, IOMode) * Implementation must be in a Java file. * Values doesn’t show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode 3. Static fields in Java (e.g., TripletFields) * Implementation must be in a Java file. * Doesn’t need “()” in Java code. * Values don't show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields 4. Objects in Scala. (e.g., StorageLevel) * Needs “()” in Java code. * Values show up in both ScalaDoc and JavaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$ http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html It would be great if we have an “official” approach for this as well as the naming convention for enum-like values (“MEMORY_ONLY” or “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts? Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Task result is serialized twice by serializer and closure serializer
Hey Mingyu, I think it's broken out separately so we can record the time taken to serialize the result. Once we serializing it once, the second serialization should be really simple since it's just wrapping something that has already been turned into a byte buffer. Do you see a specific issue with serializing it twice? I think you need to have two steps if you want to record the time taken to serialize the result, since that needs to be sent back to the driver when the task completes. - Patrick On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim m...@palantir.com wrote: Hi all, It looks like the result of task is serialized twice, once by serializer (I.e. Java/Kryo depending on configuration) and once again by closure serializer (I.e. Java). To link the actual code, The first one: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L213 The second one: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L226 This serializes the value, which is the result of task run twice, which affects things like collect(), takeSample(), and toLocalIterator(). Would it make sense to simply serialize the DirectTaskResult once using the regular serializer (as opposed to closure serializer)? Would it cause problems when the Accumulator values are not Kryo-serializable? Alternatively, if we can assume that Accumator values are small, we can closure-serialize those, put the serialized byte array in DirectTaskResult with the raw task result value, and serialize DirectTaskResult. What do people think? Thanks, Mingyu - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: enum-like types in Spark
another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant names should be in upper camel case. That is, if the member is final, immutable and it belongs to a package object or an object, it may be considered a constant (similar to Java’sstatic final members): 1. object Container { 2. val MyConstant = ... 3. } 2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com: Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types show up as Enumeration.Value in Java. http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html 2. Java’s Enum (e.g., SaveMode, IOMode) * Implementation must be in a Java file. * Values doesn’t show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode 3. Static fields in Java (e.g., TripletFields) * Implementation must be in a Java file. * Doesn’t need “()” in Java code. * Values don't show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields 4. Objects in Scala. (e.g., StorageLevel) * Needs “()” in Java code. * Values show up in both ScalaDoc and JavaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$ http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html It would be great if we have an “official” approach for this as well as the naming convention for enum-like values (“MEMORY_ONLY” or “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts? Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Task result is serialized twice by serializer and closure serializer
The concern is really just the runtime overhead and memory footprint of Java-serializing an already-serialized byte array again. We originally noticed this when we were using RDD.toLocalIterator() which serializes the entire 64MB partition. We worked around this issue by kryo-serializing and snappy-compressing the partition on the executor side before returning it back to the driver, but this operation just felt redundant. Your explanation about reporting the time taken makes it clearer why it¹s designed this way. Since the byte array for the serialized task result shouldn¹t account for the majority of memory footprint anyways, I¹m okay with leaving it as is, then. Thanks, Mingyu On 3/4/15, 5:07 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Mingyu, I think it's broken out separately so we can record the time taken to serialize the result. Once we serializing it once, the second serialization should be really simple since it's just wrapping something that has already been turned into a byte buffer. Do you see a specific issue with serializing it twice? I think you need to have two steps if you want to record the time taken to serialize the result, since that needs to be sent back to the driver when the task completes. - Patrick On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim m...@palantir.com wrote: Hi all, It looks like the result of task is serialized twice, once by serializer (I.e. Java/Kryo depending on configuration) and once again by closure serializer (I.e. Java). To link the actual code, The first one: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc ala-23L213d=AwIFAwc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJ q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9 WMY_2Z07ulAs=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aAe= The second one: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc ala-23L226d=AwIFAwc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJ q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9 WMY_2Z07ulAs=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxMe= This serializes the value, which is the result of task run twice, which affects things like collect(), takeSample(), and toLocalIterator(). Would it make sense to simply serialize the DirectTaskResult once using the regular serializer (as opposed to closure serializer)? Would it cause problems when the Accumulator values are not Kryo-serializable? Alternatively, if we can assume that Accumator values are small, we can closure-serialize those, put the serialized byte array in DirectTaskResult with the raw task result value, and serialize DirectTaskResult. What do people think? Thanks, Mingyu - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Task result is serialized twice by serializer and closure serializer
Hi all, It looks like the result of task is serialized twice, once by serializer (I.e. Java/Kryo depending on configuration) and once again by closure serializer (I.e. Java). To link the actual code, The first one: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L213 The second one: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L226 This serializes the “value”, which is the result of task run twice, which affects things like collect(), takeSample(), and toLocalIterator(). Would it make sense to simply serialize the DirectTaskResult once using the regular “serializer” (as opposed to closure serializer)? Would it cause problems when the Accumulator values are not Kryo-serializable? Alternatively, if we can assume that Accumator values are small, we can closure-serialize those, put the serialized byte array in DirectTaskResult with the raw task result “value”, and serialize DirectTaskResult. What do people think? Thanks, Mingyu
Re: enum-like types in Spark
I'm cool with #4 as well, but make sure we dictate that the values should be defined within an object with the same name as the enumeration (like we do for StorageLevel). Otherwise we may pollute a higher namespace. e.g. we SHOULD do: trait StorageLevel object StorageLevel { case object MemoryOnly extends StorageLevel case object DiskOnly extends StorageLevel } On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust mich...@databricks.com wrote: #4 with a preference for CamelCaseEnums On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com wrote: another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant names should be in upper camel case. That is, if the member is final, immutable and it belongs to a package object or an object, it may be considered a constant (similar to Java’sstatic final members): 1. object Container { 2. val MyConstant = ... 3. } 2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com: Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types show up as Enumeration.Value in Java. http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html 2. Java’s Enum (e.g., SaveMode, IOMode) * Implementation must be in a Java file. * Values doesn’t show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode 3. Static fields in Java (e.g., TripletFields) * Implementation must be in a Java file. * Doesn’t need “()” in Java code. * Values don't show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields 4. Objects in Scala. (e.g., StorageLevel) * Needs “()” in Java code. * Values show up in both ScalaDoc and JavaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$ http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html It would be great if we have an “official” approach for this as well as the naming convention for enum-like values (“MEMORY_ONLY” or “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts? Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC2)
I think we will have to fix https://issues.apache.org/jira/browse/SPARK-5143 as well before the final 1.3.x. But yes everything else checks out for me, including sigs and hashes and building the source release. I have been following JIRA closely and am not aware of other blockers besides the ones already identified. On Wed, Mar 4, 2015 at 7:09 PM, Marcelo Vanzin van...@cloudera.com wrote: -1 (non-binding) because of SPARK-6144. But aside from that I ran a set of tests on top of standalone and yarn and things look good. On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc2 (commit 3af2687): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc Staging repositories for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1074/ (published with version '1.3.0') https://repository.apache.org/content/repositories/orgapachespark-1075/ (published with version '1.3.0-rc2') The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Saturday, March 07, at 04:17 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How does this compare to RC1 == This patch includes a variety of bug fixes found in RC1. == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. If you are happy with this release based on your own testing, give a +1 vote. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: enum-like types in Spark
#4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant names should be in upper camel case. That is, if the member is final, immutable and it belongs to a package object or an object, it may be considered a constant (similar to Java’sstatic final members): 1. object Container { 2. val MyConstant = ... 3. } 2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com: Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types show up as Enumeration.Value in Java. http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html 2. Java’s Enum (e.g., SaveMode, IOMode) * Implementation must be in a Java file. * Values doesn’t show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode 3. Static fields in Java (e.g., TripletFields) * Implementation must be in a Java file. * Doesn’t need “()” in Java code. * Values don't show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields 4. Objects in Scala. (e.g., StorageLevel) * Needs “()” in Java code. * Values show up in both ScalaDoc and JavaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$ http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html It would be great if we have an “official” approach for this as well as the naming convention for enum-like values (“MEMORY_ONLY” or “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts? Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: enum-like types in Spark
I like #4 as well and agree with Aaron's suggestion. - Patrick On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson ilike...@gmail.com wrote: I'm cool with #4 as well, but make sure we dictate that the values should be defined within an object with the same name as the enumeration (like we do for StorageLevel). Otherwise we may pollute a higher namespace. e.g. we SHOULD do: trait StorageLevel object StorageLevel { case object MemoryOnly extends StorageLevel case object DiskOnly extends StorageLevel } On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust mich...@databricks.com wrote: #4 with a preference for CamelCaseEnums On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com wrote: another vote for #4 People are already used to adding () in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote: #4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant names should be in upper camel case. That is, if the member is final, immutable and it belongs to a package object or an object, it may be considered a constant (similar to Java'sstatic final members): 1. object Container { 2. val MyConstant = ... 3. } 2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com: Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an official approach for enum-like types in Spark. 1. Scala's Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types show up as Enumeration.Value in Java. http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html 2. Java's Enum (e.g., SaveMode, IOMode) * Implementation must be in a Java file. * Values doesn't show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode 3. Static fields in Java (e.g., TripletFields) * Implementation must be in a Java file. * Doesn't need () in Java code. * Values don't show up in the ScalaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields 4. Objects in Scala. (e.g., StorageLevel) * Needs () in Java code. * Values show up in both ScalaDoc and JavaDoc: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$ http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html It would be great if we have an official approach for this as well as the naming convention for enum-like values (MEMORY_ONLY or MemoryOnly). Personally, I like 4) with MEMORY_ONLY. Any thoughts? Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Fwd: Unable to Read/Write Avro RDD on cluster.
I am trying to read RDD avro, transform and write. I am able to run it locally fine but when i run onto cluster, i see issues with Avro. export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1 export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf export HADOOP_CONF_DIR=/apache/hadoop/conf export YARN_CONF_DIR=/apache/hadoop/conf export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf export SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native export YARN_CONF_DIR=/apache/hadoop/conf/ cd $SPARK_HOME ./bin/spark-submit --master yarn-cluster --jars /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp /home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession subcommand=successevents outputdir=/user/dvasthimal/epdatasets/successdetail Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 2221 15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark, queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08, queueApplicationCount = 7, queueChildQueueCount = 0 15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single resource in this cluster 16384 15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources 15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 7780745 for dvasthimal on 10.115.206.112:8020 15/03/04 03:20:46 INFO yarn.Client: Uploading file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar 15/03/04 03:20:47 INFO yarn.Client: Uploading file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar to hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar 15/03/04 03:20:52 INFO yarn.Client: Uploading file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar 15/03/04 03:20:52 INFO yarn.Client: Uploading file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar 15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment 15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context 15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m, -Djava.io.tmpdir=$PWD/tmp, -Dspark.app.name=\com.company.ep.poc.spark.reporting.SparkApp\, -Dlog4j.configuration=log4j-spark-container.properties, org.apache.spark.deploy.yarn.ApplicationMaster, --class, com.company.ep.poc.spark.reporting.SparkApp, --jar , file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar, --args 'startDate=2015-02-16' --args 'endDate=2015-02-16' --args 'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession' --args 'subcommand=successevents' --args 'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory, 2048, --executor-cores, 1, --num-executors , 3, 1, LOG_DIR/stdout, 2, LOG_DIR/stderr) 15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM 15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application application_1425075571333_61948 15/03/04 03:20:56 INFO yarn.Client: Application report from ASM: application identifier: application_1425075571333_61948 appId: 61948 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: hdmi-spark appMasterRpcPort: -1 appStartTime: 1425464454263 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/ appUser: dvasthimal
Re: ideas for MLlib development
Thanks for your reply, Evan. It may make sense to have a more general Gibbs sampling framework, but it might be good to have a few desired applications in mind (e.g. higher level models that rely on Gibbs) to help API design, parallelization strategy, etc. I think I'm more interested in a general framework which could be applied to a variety of models, as opposed to an implementation tailored to a specific model such as LDA. I'm thinking that such a framework could be used in model exploration, either as an end in itself or perhaps to identify promising models that could then be given optimized, custom implementations. This would be very much in the spirit of existing packages such as BUGS. In fact, if we were to go down this road, I would propose that models be specified in the BUGS modeling language -- no need to reinvent that wheel, I would say. At a very high level, the API for this framework would specify methods to compute conditional distributions, marginalizing as necessary via MCMC. Other operations could include computing the expected value of a variable or function. All this is very reminiscent of BUGS, of course. best, Robert Dodier - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Task result is serialized twice by serializer and closure serializer
Yeah, it will result in a second serialized copy of the array (costing some memory). But the computational overhead should be very small. The absolute worst case here will be when doing a collect() or something similar that just bundles the entire partition. - Patrick On Wed, Mar 4, 2015 at 5:47 PM, Mingyu Kim m...@palantir.com wrote: The concern is really just the runtime overhead and memory footprint of Java-serializing an already-serialized byte array again. We originally noticed this when we were using RDD.toLocalIterator() which serializes the entire 64MB partition. We worked around this issue by kryo-serializing and snappy-compressing the partition on the executor side before returning it back to the driver, but this operation just felt redundant. Your explanation about reporting the time taken makes it clearer why it¹s designed this way. Since the byte array for the serialized task result shouldn¹t account for the majority of memory footprint anyways, I¹m okay with leaving it as is, then. Thanks, Mingyu On 3/4/15, 5:07 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Mingyu, I think it's broken out separately so we can record the time taken to serialize the result. Once we serializing it once, the second serialization should be really simple since it's just wrapping something that has already been turned into a byte buffer. Do you see a specific issue with serializing it twice? I think you need to have two steps if you want to record the time taken to serialize the result, since that needs to be sent back to the driver when the task completes. - Patrick On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim m...@palantir.com wrote: Hi all, It looks like the result of task is serialized twice, once by serializer (I.e. Java/Kryo depending on configuration) and once again by closure serializer (I.e. Java). To link the actual code, The first one: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc ala-23L213d=AwIFAwc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJ q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9 WMY_2Z07ulAs=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aAe= The second one: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc ala-23L226d=AwIFAwc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJ q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9 WMY_2Z07ulAs=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxMe= This serializes the value, which is the result of task run twice, which affects things like collect(), takeSample(), and toLocalIterator(). Would it make sense to simply serialize the DirectTaskResult once using the regular serializer (as opposed to closure serializer)? Would it cause problems when the Accumulator values are not Kryo-serializable? Alternatively, if we can assume that Accumator values are small, we can closure-serialize those, put the serialized byte array in DirectTaskResult with the raw task result value, and serialize DirectTaskResult. What do people think? Thanks, Mingyu - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC2)
It is the LR over car-data at https://github.com/xsankar/cloaked-ironman. 1.2.0 gives Mean Squared Error = 40.8130551358 1.3.0 gives Mean Squared Error = 105.857603953 I will verify it one more time tomorrow. Cheers k/ On Tue, Mar 3, 2015 at 11:28 PM, Xiangrui Meng men...@gmail.com wrote: On Tue, Mar 3, 2015 at 11:15 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:53 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.1.x 1.2.x 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK But MSE has increased from 40.81 to 105.86. Has some refactoring happened on SGD/Linear Models ? Or do we have some extra parameters ? or change of defaults ? Could you share the code you used? I don't remember any changes in linear regression. Thanks! -Xiangrui 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK WSSSE has come down slightly 2.5. rdd operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lmbda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWIthSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 4.0. SQL from Python 4.1. result = sqlContext.sql(SELECT * from Employees WHERE State = 'WA') OK Cheers k/ On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc2 (commit 3af2687): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc Staging repositories for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1074/ (published with version '1.3.0') https://repository.apache.org/content/repositories/orgapachespark-1075/ (published with version '1.3.0-rc2') The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Saturday, March 07, at 04:17 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How does this compare to RC1 == This patch includes a variety of bug fixes found in RC1. == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. If you are happy with this release based on your own testing, give a +1 vote. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark Streaming and SchemaRDD usage
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application? Thank you very much! - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC2)
+1 (subject to comments on ec2 issues below) machine 1: Macbook Air, OSX 10.10.2 (Yosemite), Java 8 machine 2: iMac, OSX 10.8.4, Java 7 1. mvn clean package -DskipTests (33min/13min) 2. ran SVM benchmark https://github.com/insidedctm/spark-mllib-benchmark EC2 issues: 1) Unable to successfully run servers on ec2 using the following command: ./spark-ec2 -k key -i pathtokey -s 2 launch test Error was: Initializing Spark ERROR: Unknown Spark Version This seems to be due to init.sh in https://github.com/mesos/spark-ec2/tree/branch-1.3/spark not being updated for version 1.3.0. Question: Is it expected that this repository is updated in tandem with the Spark release? 2) One of my machines fails to authenticate when running the ec2 scripts. Using spark-ec2 I get the following error: ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed It seems this machine has a problem where openssl can’t verify server certificates. Not been a problem for me before but spark1.3.0 has updated the boto repository it uses for EC2 functionality to 2.34.0. This new version has the capability to verify server certificates when authenticating to AWS servers and the default behaviour is to check certificates. Clearly this is not a Spark issue per se however if the problem I have is reasonably widespread then there maybe quite a few issues coming up on the mailing lists. There is a workaround option that could be put into spark_ec2.py script to turn off certificate validation if necessary. On 4 Mar 2015, at 04:19, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc2 (commit 3af2687): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc Staging repositories for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1074/ (published with version '1.3.0') https://repository.apache.org/content/repositories/orgapachespark-1075/ (published with version '1.3.0-rc2') The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Saturday, March 07, at 04:17 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How does this compare to RC1 == This patch includes a variety of bug fixes found in RC1. == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. If you are happy with this release based on your own testing, give a +1 vote. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Google Summer of Code - Quick Query
Hi Manoj, this question is best asked on the Spark mailing lists (copied). From a formal point of view all that counts is your proposal in Melange once applications start but your mentor or the project you wish to contribute to may have additional requirements. Cheers, Uli On 2015-03-03 08:54, Manoj Kumar wrote: Hello, I am Manoj, a prospective student from the Apache Spark project. I have been contributing to Spark and discussing the project idea with my mentor for some time now. The tentative project has a number of JIRA's associated with it. I would still like to know if it is necessary to create an umbrella JIRA with all the other JIRA's linked to it (and tagged with gsoc) or is it sufficient to just upload a proposal when the registration opens. Regards - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org