Re: wait time between start master and start slaves

2015-04-11 Thread Shivaram Venkataraman
Yeah from what I remember it was set defensively. I don't know of a good
way to check if the master is up though. I guess we could poll the Master
Web UI and see if we get a 200/ok response

Shivaram

On Fri, Apr 10, 2015 at 8:24 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Check this out
 
 https://github.com/mesos/spark-ec2/blob/f0a48be1bb5aaeef508619a46065648beb8f1d92/spark-standalone/setup.sh#L26-L33
 
 (from spark-ec2):

 # Start Master$BIN_FOLDER/start-master.sh
 # Pause
 sleep 20
 # Start Workers$BIN_FOLDER/start-slaves.sh

 I know this was probably done defensively, but is there a more direct way
 to know when the master is ready?

 Nick
 ​



Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-11 Thread Denny Lee
+1 (non-binding)


On Sat, Apr 11, 2015 at 11:48 AM Krishna Sankar ksanka...@gmail.com wrote:

 +1. All tests OK (same as RC2)
 Cheers
 k/

 On Fri, Apr 10, 2015 at 11:05 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark version
  1.3.1!
 
  The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44
 
  The list of fixes present in this release can be found at:
  http://bit.ly/1C2nVPY
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.3.1-rc3/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1088/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/
 
  The patches on top of RC2 are:
  [SPARK-6851] [SQL] Create new instance for each converted parquet
 relation
  [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
  [SPARK-6343] Doc driver-worker network reqs
  [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
  [SPARK-6781] [SQL] use sqlContext in python shell
  [SPARK-6753] Clone SparkConf in ShuffleSuite tests
  [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not
 needed...
 
  Please vote on releasing this package as Apache Spark 1.3.1!
 
  The vote is open until Tuesday, April 14, at 07:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.3.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: wait time between start master and start slaves

2015-04-11 Thread Nicholas Chammas
So basically, to tell if the master is ready to accept slaves, just poll
http://master-node:4040 for an HTTP 200 response?
​

On Sat, Apr 11, 2015 at 2:42 PM Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 Yeah from what I remember it was set defensively. I don't know of a good
 way to check if the master is up though. I guess we could poll the Master
 Web UI and see if we get a 200/ok response

 Shivaram

 On Fri, Apr 10, 2015 at 8:24 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Check this out
 
 https://github.com/mesos/spark-ec2/blob/f0a48be1bb5aaeef508619a46065648beb8f1d92/spark-standalone/setup.sh#L26-L33
 
 (from spark-ec2):

 # Start Master$BIN_FOLDER/start-master.sh


 # Pause
 sleep 20
 # Start Workers$BIN_FOLDER/start-slaves.sh

 I know this was probably done defensively, but is there a more direct way
 to know when the master is ready?

 Nick
 ​




Re: Integrating Spark with Ignite File System

2015-04-11 Thread Reynold Xin
Welcome, Dmitriy, to the Spark dev list!


On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan dsetrak...@apache.org
wrote:

 Hello Everyone,

 I am one of the committers to Apache Ignite and have noticed some talks on
 this dev list about integrating Ignite In-Memory File System (IgniteFS)
 with Spark. We definitely like the idea. If you have any questions about
 Apache Ignite at all, feel free to forward them to the Ignite dev list. We
 are going to be monitoring this list as well.

 Ignite mailing list: dev-subscr...@ignite.incubator.apache.org

 Regards,
 Dmitriy



Re: Integrating Spark with Ignite File System

2015-04-11 Thread Devl Devel
Hi Dmitriy,

Thanks for the input, I think as per my previous email it would be good to
have a bridge project that for example, creates a IgniteFS RDD, similar to
the JDBC or HDFS one in which we can extract blocks and populate RDD
partitions, I'll post this proposal on your list.

Thanks
Devl



On Sat, Apr 11, 2015 at 9:28 AM, Reynold Xin r...@databricks.com wrote:

 Welcome, Dmitriy, to the Spark dev list!


 On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan dsetrak...@apache.org
 wrote:

  Hello Everyone,
 
  I am one of the committers to Apache Ignite and have noticed some talks
 on
  this dev list about integrating Ignite In-Memory File System (IgniteFS)
  with Spark. We definitely like the idea. If you have any questions about
  Apache Ignite at all, feel free to forward them to the Ignite dev list.
 We
  are going to be monitoring this list as well.
 
  Ignite mailing list: dev-subscr...@ignite.incubator.apache.org
 
  Regards,
  Dmitriy
 



Integrating Spark with Ignite File System

2015-04-11 Thread Dmitriy Setrakyan
Hello Everyone,

I am one of the committers to Apache Ignite and have noticed some talks on
this dev list about integrating Ignite In-Memory File System (IgniteFS)
with Spark. We definitely like the idea. If you have any questions about
Apache Ignite at all, feel free to forward them to the Ignite dev list. We
are going to be monitoring this list as well.

Ignite mailing list: dev-subscr...@ignite.incubator.apache.org

Regards,
Dmitriy


Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-11 Thread Reynold Xin
+1






On Fri, Apr 10, 2015 at 11:07 PM -0700, Patrick Wendell pwend...@gmail.com 
wrote:










Please vote on releasing the following candidate as Apache Spark version 1.3.1!

The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44

The list of fixes present in this release can be found at:
http://bit.ly/1C2nVPY

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1088/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/

The patches on top of RC2 are:
[SPARK-6851] [SQL] Create new instance for each converted parquet relation
[SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
[SPARK-6343] Doc driver-worker network reqs
[SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
[SPARK-6781] [SQL] use sqlContext in python shell
[SPARK-6753] Clone SparkConf in ShuffleSuite tests
[SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed...

Please vote on releasing this package as Apache Spark 1.3.1!

The vote is open until Tuesday, April 14, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-11 Thread Yijie Shen
Hi,

Suppose I create a dataRDD which extends RDD[Row], and each row is
GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is
reused among rows but has different content each time. When I convert it to
a dataFrame and save it as Parquet File, the file's row group statistic(max
 min) of Binary column would be wrong.



Here is the reason: In Parquet, BinaryStatistic just keep max  min as
parquet.io.api.Binary references, Spark sql would generate a new Binary
backed by the same Array[Byte] passed from row.
 reference backed max: Binary--ByteArrayBackedBinary--
Array[Byte]

Therefore, each time parquet updating row group's statistic, max  min
would always refer to the same Array[Byte], which has new content each
time. When parquet decides to save it into file, the last row's content
would be saved as both max  min.



It seems it is a parquet bug because it's parquet's responsibility to
update statistics correctly.
But not quite sure. Should I report it as a bug in parquet JIRA?


The spark JIRA is https://issues.apache.org/jira/browse/SPARK-6859


Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-11 Thread Sean Owen
+1 same result as last time.

On Sat, Apr 11, 2015 at 7:05 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!

 The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44

 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1088/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/

 The patches on top of RC2 are:
 [SPARK-6851] [SQL] Create new instance for each converted parquet relation
 [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
 [SPARK-6343] Doc driver-worker network reqs
 [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
 [SPARK-6781] [SQL] use sqlContext in python shell
 [SPARK-6753] Clone SparkConf in ShuffleSuite tests
 [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed...

 Please vote on releasing this package as Apache Spark 1.3.1!

 The vote is open until Tuesday, April 14, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: wait time between start master and start slaves

2015-04-11 Thread Ted Yu
From SparkUI.scala :

  def getUIPort(conf: SparkConf): Int = {
conf.getInt(spark.ui.port, SparkUI.DEFAULT_PORT)
  }
Better retrieve effective UI port before probing.

Cheers

On Sat, Apr 11, 2015 at 2:38 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 So basically, to tell if the master is ready to accept slaves, just poll
 http://master-node:4040 for an HTTP 200 response?
 ​

 On Sat, Apr 11, 2015 at 2:42 PM Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

  Yeah from what I remember it was set defensively. I don't know of a good
  way to check if the master is up though. I guess we could poll the Master
  Web UI and see if we get a 200/ok response
 
  Shivaram
 
  On Fri, Apr 10, 2015 at 8:24 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  Check this out
  
 
 https://github.com/mesos/spark-ec2/blob/f0a48be1bb5aaeef508619a46065648beb8f1d92/spark-standalone/setup.sh#L26-L33
  
  (from spark-ec2):
 
  # Start Master$BIN_FOLDER/start-master.sh
 
 
  # Pause
  sleep 20
  # Start Workers$BIN_FOLDER/start-slaves.sh
 
  I know this was probably done defensively, but is there a more direct
 way
  to know when the master is ready?
 
  Nick
  ​