Re: wait time between start master and start slaves
Yeah from what I remember it was set defensively. I don't know of a good way to check if the master is up though. I guess we could poll the Master Web UI and see if we get a 200/ok response Shivaram On Fri, Apr 10, 2015 at 8:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Check this out https://github.com/mesos/spark-ec2/blob/f0a48be1bb5aaeef508619a46065648beb8f1d92/spark-standalone/setup.sh#L26-L33 (from spark-ec2): # Start Master$BIN_FOLDER/start-master.sh # Pause sleep 20 # Start Workers$BIN_FOLDER/start-slaves.sh I know this was probably done defensively, but is there a more direct way to know when the master is ready? Nick
Re: [VOTE] Release Apache Spark 1.3.1 (RC3)
+1 (non-binding) On Sat, Apr 11, 2015 at 11:48 AM Krishna Sankar ksanka...@gmail.com wrote: +1. All tests OK (same as RC2) Cheers k/ On Fri, Apr 10, 2015 at 11:05 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 3e83913): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1088/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/ The patches on top of RC2 are: [SPARK-6851] [SQL] Create new instance for each converted parquet relation [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey. [SPARK-6343] Doc driver-worker network reqs [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme [SPARK-6781] [SQL] use sqlContext in python shell [SPARK-6753] Clone SparkConf in ShuffleSuite tests [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed... Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Tuesday, April 14, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: wait time between start master and start slaves
So basically, to tell if the master is ready to accept slaves, just poll http://master-node:4040 for an HTTP 200 response? On Sat, Apr 11, 2015 at 2:42 PM Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah from what I remember it was set defensively. I don't know of a good way to check if the master is up though. I guess we could poll the Master Web UI and see if we get a 200/ok response Shivaram On Fri, Apr 10, 2015 at 8:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Check this out https://github.com/mesos/spark-ec2/blob/f0a48be1bb5aaeef508619a46065648beb8f1d92/spark-standalone/setup.sh#L26-L33 (from spark-ec2): # Start Master$BIN_FOLDER/start-master.sh # Pause sleep 20 # Start Workers$BIN_FOLDER/start-slaves.sh I know this was probably done defensively, but is there a more direct way to know when the master is ready? Nick
Re: Integrating Spark with Ignite File System
Welcome, Dmitriy, to the Spark dev list! On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan dsetrak...@apache.org wrote: Hello Everyone, I am one of the committers to Apache Ignite and have noticed some talks on this dev list about integrating Ignite In-Memory File System (IgniteFS) with Spark. We definitely like the idea. If you have any questions about Apache Ignite at all, feel free to forward them to the Ignite dev list. We are going to be monitoring this list as well. Ignite mailing list: dev-subscr...@ignite.incubator.apache.org Regards, Dmitriy
Re: Integrating Spark with Ignite File System
Hi Dmitriy, Thanks for the input, I think as per my previous email it would be good to have a bridge project that for example, creates a IgniteFS RDD, similar to the JDBC or HDFS one in which we can extract blocks and populate RDD partitions, I'll post this proposal on your list. Thanks Devl On Sat, Apr 11, 2015 at 9:28 AM, Reynold Xin r...@databricks.com wrote: Welcome, Dmitriy, to the Spark dev list! On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan dsetrak...@apache.org wrote: Hello Everyone, I am one of the committers to Apache Ignite and have noticed some talks on this dev list about integrating Ignite In-Memory File System (IgniteFS) with Spark. We definitely like the idea. If you have any questions about Apache Ignite at all, feel free to forward them to the Ignite dev list. We are going to be monitoring this list as well. Ignite mailing list: dev-subscr...@ignite.incubator.apache.org Regards, Dmitriy
Integrating Spark with Ignite File System
Hello Everyone, I am one of the committers to Apache Ignite and have noticed some talks on this dev list about integrating Ignite In-Memory File System (IgniteFS) with Spark. We definitely like the idea. If you have any questions about Apache Ignite at all, feel free to forward them to the Ignite dev list. We are going to be monitoring this list as well. Ignite mailing list: dev-subscr...@ignite.incubator.apache.org Regards, Dmitriy
Re: [VOTE] Release Apache Spark 1.3.1 (RC3)
+1 On Fri, Apr 10, 2015 at 11:07 PM -0700, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 3e83913): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1088/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/ The patches on top of RC2 are: [SPARK-6851] [SQL] Create new instance for each converted parquet relation [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey. [SPARK-6343] Doc driver-worker network reqs [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme [SPARK-6781] [SQL] use sqlContext in python shell [SPARK-6753] Clone SparkConf in ShuffleSuite tests [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed... Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Tuesday, April 14, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Parquet File Binary column statistics error when reuse byte[] among rows
Hi, Suppose I create a dataRDD which extends RDD[Row], and each row is GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max min) of Binary column would be wrong. Here is the reason: In Parquet, BinaryStatistic just keep max min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array[Byte] passed from row. reference backed max: Binary--ByteArrayBackedBinary-- Array[Byte] Therefore, each time parquet updating row group's statistic, max min would always refer to the same Array[Byte], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max min. It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly. But not quite sure. Should I report it as a bug in parquet JIRA? The spark JIRA is https://issues.apache.org/jira/browse/SPARK-6859
Re: [VOTE] Release Apache Spark 1.3.1 (RC3)
+1 same result as last time. On Sat, Apr 11, 2015 at 7:05 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 3e83913): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1088/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/ The patches on top of RC2 are: [SPARK-6851] [SQL] Create new instance for each converted parquet relation [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey. [SPARK-6343] Doc driver-worker network reqs [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme [SPARK-6781] [SQL] use sqlContext in python shell [SPARK-6753] Clone SparkConf in ShuffleSuite tests [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed... Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Tuesday, April 14, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: wait time between start master and start slaves
From SparkUI.scala : def getUIPort(conf: SparkConf): Int = { conf.getInt(spark.ui.port, SparkUI.DEFAULT_PORT) } Better retrieve effective UI port before probing. Cheers On Sat, Apr 11, 2015 at 2:38 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: So basically, to tell if the master is ready to accept slaves, just poll http://master-node:4040 for an HTTP 200 response? On Sat, Apr 11, 2015 at 2:42 PM Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah from what I remember it was set defensively. I don't know of a good way to check if the master is up though. I guess we could poll the Master Web UI and see if we get a 200/ok response Shivaram On Fri, Apr 10, 2015 at 8:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Check this out https://github.com/mesos/spark-ec2/blob/f0a48be1bb5aaeef508619a46065648beb8f1d92/spark-standalone/setup.sh#L26-L33 (from spark-ec2): # Start Master$BIN_FOLDER/start-master.sh # Pause sleep 20 # Start Workers$BIN_FOLDER/start-slaves.sh I know this was probably done defensively, but is there a more direct way to know when the master is ready? Nick