Which method do you think is better for making MIN_REMEMBER_DURATION configurable?
Hello, This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is now a constant) a variable (configurable, with a default value). Before spending effort on developing something and creating a pull request, I wanted to consult with the core developers to see which approach makes most sense, and has the higher probability of being accepted. The constant MIN_REMEMBER_DURATION can be seen at: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338 it is marked as private member of private[streaming] object FileInputDStream. Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of minRememberDuration, and then add a new fileStream method to JavaStreamingContext.scala : https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala such that the new fileStream method accepts a new parameter, e.g. minRememberDuration: Int (in seconds), and then use this value to set the private minRememberDuration. Approach 2: Create a new, public Spark configuration property, e.g. named spark.rememberDuration.min (with a default value of 60 seconds), and then set the private variable minRememberDuration to the value of this Spark property. Approach 1 would mean adding a new method to the public API, Approach 2 would mean creating a new public Spark property. Right now, approach 2 seems more straightforward and simpler to me, but nevertheless I wanted to have the opinions of other developers who know the internals of Spark better than I do. Kind regards, Emre Sevinç
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
finding free ports for tests
I'm writing some functional tests for the SPARK-1537 JIRA, Yarn timeline service integration, for which I need to allocate some free ports. I don't want to hard code them in as that can lead to unreliable tests, especially on Jenkins. Before I implement the logic myself -Is there a utility class/trait for finding ports for tests? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: finding free ports for tests
Utils.startServiceOnPort? On Wed, Apr 8, 2015 at 6:16 AM, Steve Loughran ste...@hortonworks.com wrote: I'm writing some functional tests for the SPARK-1537 JIRA, Yarn timeline service integration, for which I need to allocate some free ports. I don't want to hard code them in as that can lead to unreliable tests, especially on Jenkins. Before I implement the logic myself -Is there a utility class/trait for finding ports for tests? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?
Approach 2 is definitely better :) Can you tell us more about the use case why you want to do this? TD On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is now a constant) a variable (configurable, with a default value). Before spending effort on developing something and creating a pull request, I wanted to consult with the core developers to see which approach makes most sense, and has the higher probability of being accepted. The constant MIN_REMEMBER_DURATION can be seen at: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338 it is marked as private member of private[streaming] object FileInputDStream. Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of minRememberDuration, and then add a new fileStream method to JavaStreamingContext.scala : https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala such that the new fileStream method accepts a new parameter, e.g. minRememberDuration: Int (in seconds), and then use this value to set the private minRememberDuration. Approach 2: Create a new, public Spark configuration property, e.g. named spark.rememberDuration.min (with a default value of 60 seconds), and then set the private variable minRememberDuration to the value of this Spark property. Approach 1 would mean adding a new method to the public API, Approach 2 would mean creating a new public Spark property. Right now, approach 2 seems more straightforward and simpler to me, but nevertheless I wanted to have the opinions of other developers who know the internals of Spark better than I do. Kind regards, Emre Sevinç
RDD firstParent
Is does not seem to be safe to call RDD.firstParent from anywhere, as it might throw a java.util.NoSuchElementException: head of empty list. This seems to be a bug for a consumer of the RDD API. Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE)
PR 5140
Could I get someone to look at PR 5140 please? It's been languishing more than two weeks.
Re: RDD firstParent
Why is this a bug? Each RDD implementation should know whether they have a parent or not. For example, if you are a MapPartitionedRDD, there is always a parent since it is a unary operator. On Wed, Apr 8, 2015 at 6:19 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Is does not seem to be safe to call RDD.firstParent from anywhere, as it might throw a java.util.NoSuchElementException: head of empty list. This seems to be a bug for a consumer of the RDD API. Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE)
Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?
Tathagata, Thanks for stating your preference for Approach 2. My use case and motivation are similar to the concerns raised by others in SPARK-3276. In previous versions of Spark, e.g. 1.1.x we had the ability for Spark Streaming applications to process the files in an input directory that existed before the streaming application began, and for some projects that we did for our customers, we relied on that feature. Starting from 1.2.x series, we are limited in this respect to the files whose time stamp is not older than 1 minute. The only workaround is to 'touch' those files before starting a streaming application. Moreover, this MIN_REMEMBER_DURATION is set to an arbitrary value of 1 minute, and I don't see any argument why it cannot be set to another arbitrary value (keeping the default value of 1 minute, if nothing is set by the user). Putting all this together, my plan is to create a Pull Request that is like 1- Convert private val MIN_REMEMBER_DURATION into private val minRememberDuration (to reflect the change that it is not a constant in the sense that it can be set via configuration) 2- Set its value by using something like getConf(spark.streaming.minRememberDuration, Minutes(1)) 3- Document the spark.streaming.minRememberDuration in Spark Streaming Programming Guide If the above sounds fine, then I'll go on implementing this small change and submit a pull request for fixing SPARK-3276. What do you say? Kind regards, Emre Sevinç http://www.bigindustries.be/ On Wed, Apr 8, 2015 at 7:16 PM, Tathagata Das t...@databricks.com wrote: Approach 2 is definitely better :) Can you tell us more about the use case why you want to do this? TD On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is now a constant) a variable (configurable, with a default value). Before spending effort on developing something and creating a pull request, I wanted to consult with the core developers to see which approach makes most sense, and has the higher probability of being accepted. The constant MIN_REMEMBER_DURATION can be seen at: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338 it is marked as private member of private[streaming] object FileInputDStream. Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of minRememberDuration, and then add a new fileStream method to JavaStreamingContext.scala : https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala such that the new fileStream method accepts a new parameter, e.g. minRememberDuration: Int (in seconds), and then use this value to set the private minRememberDuration. Approach 2: Create a new, public Spark configuration property, e.g. named spark.rememberDuration.min (with a default value of 60 seconds), and then set the private variable minRememberDuration to the value of this Spark property. Approach 1 would mean adding a new method to the public API, Approach 2 would mean creating a new public Spark property. Right now, approach 2 seems more straightforward and simpler to me, but nevertheless I wanted to have the opinions of other developers who know the internals of Spark better than I do. Kind regards, Emre Sevinç -- Emre Sevinc
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1. Tested on Mac OS X and verified that some of the bugs were fixed. Matei On Apr 8, 2015, at 7:13 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode. Tim On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote: The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
Hey Denny, I beleive the 2.4 bits are there. The 2.6 bits I had done specially (we haven't merge that into our upstream build script). I'll do it again now for RC2. - Patrick On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote: +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode. Tim On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote: The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
Oh, it appears the 2.4 bits without hive are there but not the 2.4 bits with hive. Cool stuff on the 2.6. On Wed, Apr 8, 2015 at 12:30 Patrick Wendell pwend...@gmail.com wrote: Hey Denny, I beleive the 2.4 bits are there. The 2.6 bits I had done specially (we haven't merge that into our upstream build script). I'll do it again now for RC2. - Patrick On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote: +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode. Tim On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote: The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/ orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
Oh I see - ah okay I'm guessing it was a transient build error and I'll get it posted ASAP. On Wed, Apr 8, 2015 at 3:41 PM, Denny Lee denny.g@gmail.com wrote: Oh, it appears the 2.4 bits without hive are there but not the 2.4 bits with hive. Cool stuff on the 2.6. On Wed, Apr 8, 2015 at 12:30 Patrick Wendell pwend...@gmail.com wrote: Hey Denny, I beleive the 2.4 bits are there. The 2.6 bits I had done specially (we haven't merge that into our upstream build script). I'll do it again now for RC2. - Patrick On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote: +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode. Tim On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote: The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: PR 5140
Hey Nathan, thanks for bringing this up I will look at this within the next day or two. 2015-04-08 8:03 GMT-07:00 Nathan Kronenfeld nkronenfeld@uncharted.software : Could I get someone to look at PR 5140 please? It's been languishing more than two weeks.
Re: [mllib] Deprecate static train and use builder instead for Scala/Java
I'll add a note that this is just for ML, not other parts of Spark. (We can discuss more on the JIRA.) Thanks! Joseph On Mon, Apr 6, 2015 at 9:46 PM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: Hi all, Joseph proposed an idea about using just builder methods, instead of static train() methods for Scala/Java. I agree with that idea. Because we have many duplicated static train() method. If you have any thoughts on that please share it with us. [SPARK-6682] Deprecate static train and use builder instead for Scala/Java https://issues.apache.org/jira/browse/SPARK-6682 Thanks Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Deprecate-static-train-and-use-builder-instead-for-Scala-Java-tp11438.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1 (non-binding) Tested Scala, SparkSQL, and MLLib on OSX against Hadoop 2.6 On Wed, Apr 8, 2015 at 5:35 PM Joseph Bradley jos...@databricks.com wrote: +1 tested ML-related items on Mac OS X On Wed, Apr 8, 2015 at 7:59 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:16 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.3.0 pyspark works well with the new iPython 3.0.0 release 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWithSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 4.0. Spark SQL from Python OK 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK On Tue, Apr 7, 2015 at 10:46 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/ orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1 Built against Hadoop 2.6 and ran some jobs against a pseudo-distributed YARN cluster. -Sandy On Wed, Apr 8, 2015 at 12:49 PM, Patrick Wendell pwend...@gmail.com wrote: Oh I see - ah okay I'm guessing it was a transient build error and I'll get it posted ASAP. On Wed, Apr 8, 2015 at 3:41 PM, Denny Lee denny.g@gmail.com wrote: Oh, it appears the 2.4 bits without hive are there but not the 2.4 bits with hive. Cool stuff on the 2.6. On Wed, Apr 8, 2015 at 12:30 Patrick Wendell pwend...@gmail.com wrote: Hey Denny, I beleive the 2.4 bits are there. The 2.6 bits I had done specially (we haven't merge that into our upstream build script). I'll do it again now for RC2. - Patrick On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote: +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode. Tim On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote: The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1 tested ML-related items on Mac OS X On Wed, Apr 8, 2015 at 7:59 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:16 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.3.0 pyspark works well with the new iPython 3.0.0 release 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWithSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 4.0. Spark SQL from Python OK 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK On Tue, Apr 7, 2015 at 10:46 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:16 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.3.0 pyspark works well with the new iPython 3.0.0 release 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK 2.5. RDD operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lambda) with itertools OK 3. Scala - MLlib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWithSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK 4.0. Spark SQL from Python OK 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK On Tue, Apr 7, 2015 at 10:46 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?
+1 for this feature In our use case, we probably wouldn’t use this feature in production, but it can be useful during prototyping and algorithm development to repeatedly perform the same streaming operation on a fixed, already existing set of files. - jeremyfreeman.net @thefreemanlab On Apr 8, 2015, at 2:51 PM, Emre Sevinc emre.sev...@gmail.com wrote: Tathagata, Thanks for stating your preference for Approach 2. My use case and motivation are similar to the concerns raised by others in SPARK-3276. In previous versions of Spark, e.g. 1.1.x we had the ability for Spark Streaming applications to process the files in an input directory that existed before the streaming application began, and for some projects that we did for our customers, we relied on that feature. Starting from 1.2.x series, we are limited in this respect to the files whose time stamp is not older than 1 minute. The only workaround is to 'touch' those files before starting a streaming application. Moreover, this MIN_REMEMBER_DURATION is set to an arbitrary value of 1 minute, and I don't see any argument why it cannot be set to another arbitrary value (keeping the default value of 1 minute, if nothing is set by the user). Putting all this together, my plan is to create a Pull Request that is like 1- Convert private val MIN_REMEMBER_DURATION into private val minRememberDuration (to reflect the change that it is not a constant in the sense that it can be set via configuration) 2- Set its value by using something like getConf(spark.streaming.minRememberDuration, Minutes(1)) 3- Document the spark.streaming.minRememberDuration in Spark Streaming Programming Guide If the above sounds fine, then I'll go on implementing this small change and submit a pull request for fixing SPARK-3276. What do you say? Kind regards, Emre Sevinç http://www.bigindustries.be/ On Wed, Apr 8, 2015 at 7:16 PM, Tathagata Das t...@databricks.com wrote: Approach 2 is definitely better :) Can you tell us more about the use case why you want to do this? TD On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is now a constant) a variable (configurable, with a default value). Before spending effort on developing something and creating a pull request, I wanted to consult with the core developers to see which approach makes most sense, and has the higher probability of being accepted. The constant MIN_REMEMBER_DURATION can be seen at: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338 it is marked as private member of private[streaming] object FileInputDStream. Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of minRememberDuration, and then add a new fileStream method to JavaStreamingContext.scala : https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala such that the new fileStream method accepts a new parameter, e.g. minRememberDuration: Int (in seconds), and then use this value to set the private minRememberDuration. Approach 2: Create a new, public Spark configuration property, e.g. named spark.rememberDuration.min (with a default value of 60 seconds), and then set the private variable minRememberDuration to the value of this Spark property. Approach 1 would mean adding a new method to the public API, Approach 2 would mean creating a new public Spark property. Right now, approach 2 seems more straightforward and simpler to me, but nevertheless I wanted to have the opinions of other developers who know the internals of Spark better than I do. Kind regards, Emre Sevinç -- Emre Sevinc