Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Patrick Wendell
Hey Marcelo,

Yes - I agree. That one trickled in just as I was packaging this RC.
However, I still put this out here to allow people to test the
existing fixes, etc.

- Patrick

On Wed, Mar 4, 2015 at 9:26 AM, Marcelo Vanzin van...@cloudera.com wrote:
 I haven't tested the rc2 bits yet, but I'd consider
 https://issues.apache.org/jira/browse/SPARK-6144 a serious regression
 from 1.2 (since it affects existing addFile() functionality if the
 URL is hdfs:...).

 Will test other parts separately.

 On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.0!

 The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 Staging repositories for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1074/
 (published with version '1.3.0')
 https://repository.apache.org/content/repositories/orgapachespark-1075/
 (published with version '1.3.0-rc2')

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.3.0!

 The vote is open until Saturday, March 07, at 04:17 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == How does this compare to RC1 ==
 This patch includes a variety of bug fixes found in RC1.

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.

 If you are happy with this release based on your own testing, give a +1 vote.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 --
 Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Marcelo Vanzin
I haven't tested the rc2 bits yet, but I'd consider
https://issues.apache.org/jira/browse/SPARK-6144 a serious regression
from 1.2 (since it affects existing addFile() functionality if the
URL is hdfs:...).

Will test other parts separately.

On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.0!

 The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 Staging repositories for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1074/
 (published with version '1.3.0')
 https://repository.apache.org/content/repositories/orgapachespark-1075/
 (published with version '1.3.0-rc2')

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.3.0!

 The vote is open until Saturday, March 07, at 04:17 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == How does this compare to RC1 ==
 This patch includes a variety of bug fixes found in RC1.

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.

 If you are happy with this release based on your own testing, give a +1 vote.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Marcelo Vanzin
-1 (non-binding) because of SPARK-6144.

But aside from that I ran a set of tests on top of standalone and yarn
and things look good.

On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.0!

 The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 Staging repositories for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1074/
 (published with version '1.3.0')
 https://repository.apache.org/content/repositories/orgapachespark-1075/
 (published with version '1.3.0-rc2')

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.3.0!

 The vote is open until Saturday, March 07, at 04:17 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == How does this compare to RC1 ==
 This patch includes a variety of bug fixes found in RC1.

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.

 If you are happy with this release based on your own testing, give a +1 vote.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



short jenkins 7am downtime tomorrow morning (3-5-15)

2015-03-04 Thread shane knapp
the master and workers need some system and package updates, and i'll also
be rebooting the machines as well.

this shouldn't take very long to perform, and i expect jenkins to be back
up and building by 9am at the *latest*.

important note:  i will NOT be updating jenkins or any of the plugins
during this maintenance!

as always, please let me know if you have any questions or concerns.

danke shane


Re: enum-like types in Spark

2015-03-04 Thread Michael Armbrust
#4 with a preference for CamelCaseEnums

On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com
wrote:

 another vote for #4
 People are already used to adding () in Java.


 On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote:

  #4 but with MemoryOnly (more scala-like)
 
  http://docs.scala-lang.org/style/naming-conventions.html
 
  Constants, Values, Variable and Methods
 
  Constant names should be in upper camel case. That is, if the member is
  final, immutable and it belongs to a package object or an object, it may
 be
  considered a constant (similar to Java’sstatic final members):
 
 
 1. object Container {
 2. val MyConstant = ...
 3. }
 
 
  2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com:
 
   Hi all,
  
   There are many places where we use enum-like types in Spark, but in
   different ways. Every approach has both pros and cons. I wonder
   whether there should be an “official” approach for enum-like types in
   Spark.
  
   1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)
  
   * All types show up as Enumeration.Value in Java.
  
  
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
  
   2. Java’s Enum (e.g., SaveMode, IOMode)
  
   * Implementation must be in a Java file.
   * Values doesn’t show up in the ScalaDoc:
  
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
  
   3. Static fields in Java (e.g., TripletFields)
  
   * Implementation must be in a Java file.
   * Doesn’t need “()” in Java code.
   * Values don't show up in the ScalaDoc:
  
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
  
   4. Objects in Scala. (e.g., StorageLevel)
  
   * Needs “()” in Java code.
   * Values show up in both ScalaDoc and JavaDoc:
  
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
  
  
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
  
   It would be great if we have an “official” approach for this as well
   as the naming convention for enum-like values (“MEMORY_ONLY” or
   “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts?
  
   Best,
   Xiangrui
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 



enum-like types in Spark

2015-03-04 Thread Xiangrui Meng
Hi all,

There are many places where we use enum-like types in Spark, but in
different ways. Every approach has both pros and cons. I wonder
whether there should be an “official” approach for enum-like types in
Spark.

1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)

* All types show up as Enumeration.Value in Java.
http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html

2. Java’s Enum (e.g., SaveMode, IOMode)

* Implementation must be in a Java file.
* Values doesn’t show up in the ScalaDoc:
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode

3. Static fields in Java (e.g., TripletFields)

* Implementation must be in a Java file.
* Doesn’t need “()” in Java code.
* Values don't show up in the ScalaDoc:
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields

4. Objects in Scala. (e.g., StorageLevel)

* Needs “()” in Java code.
* Values show up in both ScalaDoc and JavaDoc:
  
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
  
http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html

It would be great if we have an “official” approach for this as well
as the naming convention for enum-like values (“MEMORY_ONLY” or
“MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts?

Best,
Xiangrui

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Hey Mingyu,

I think it's broken out separately so we can record the time taken to
serialize the result. Once we serializing it once, the second
serialization should be really simple since it's just wrapping
something that has already been turned into a byte buffer. Do you see
a specific issue with serializing it twice?

I think you need to have two steps if you want to record the time
taken to serialize the result, since that needs to be sent back to the
driver when the task completes.

- Patrick

On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim m...@palantir.com wrote:
 Hi all,

 It looks like the result of task is serialized twice, once by serializer 
 (I.e. Java/Kryo depending on configuration) and once again by closure 
 serializer (I.e. Java). To link the actual code,

 The first one: 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L213
 The second one: 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L226

 This serializes the value, which is the result of task run twice, which 
 affects things like collect(), takeSample(), and toLocalIterator(). Would it 
 make sense to simply serialize the DirectTaskResult once using the regular 
 serializer (as opposed to closure serializer)? Would it cause problems when 
 the Accumulator values are not Kryo-serializable?

 Alternatively, if we can assume that Accumator values are small, we can 
 closure-serialize those, put the serialized byte array in DirectTaskResult 
 with the raw task result value, and serialize DirectTaskResult.

 What do people think?

 Thanks,
 Mingyu

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-04 Thread Joseph Bradley
another vote for #4
People are already used to adding () in Java.


On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com wrote:

 #4 but with MemoryOnly (more scala-like)

 http://docs.scala-lang.org/style/naming-conventions.html

 Constants, Values, Variable and Methods

 Constant names should be in upper camel case. That is, if the member is
 final, immutable and it belongs to a package object or an object, it may be
 considered a constant (similar to Java’sstatic final members):


1. object Container {
2. val MyConstant = ...
3. }


 2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com:

  Hi all,
 
  There are many places where we use enum-like types in Spark, but in
  different ways. Every approach has both pros and cons. I wonder
  whether there should be an “official” approach for enum-like types in
  Spark.
 
  1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)
 
  * All types show up as Enumeration.Value in Java.
 
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
 
  2. Java’s Enum (e.g., SaveMode, IOMode)
 
  * Implementation must be in a Java file.
  * Values doesn’t show up in the ScalaDoc:
 
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
 
  3. Static fields in Java (e.g., TripletFields)
 
  * Implementation must be in a Java file.
  * Doesn’t need “()” in Java code.
  * Values don't show up in the ScalaDoc:
 
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
 
  4. Objects in Scala. (e.g., StorageLevel)
 
  * Needs “()” in Java code.
  * Values show up in both ScalaDoc and JavaDoc:
 
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
 
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
 
  It would be great if we have an “official” approach for this as well
  as the naming convention for enum-like values (“MEMORY_ONLY” or
  “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts?
 
  Best,
  Xiangrui
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
The concern is really just the runtime overhead and memory footprint of
Java-serializing an already-serialized byte array again. We originally
noticed this when we were using RDD.toLocalIterator() which serializes the
entire 64MB partition. We worked around this issue by kryo-serializing and
snappy-compressing the partition on the executor side before returning it
back to the driver, but this operation just felt redundant.

Your explanation about reporting the time taken makes it clearer why it¹s
designed this way. Since the byte array for the serialized task result
shouldn¹t account for the majority of memory footprint anyways, I¹m okay
with leaving it as is, then.

Thanks,
Mingyu





On 3/4/15, 5:07 PM, Patrick Wendell pwend...@gmail.com wrote:

Hey Mingyu,

I think it's broken out separately so we can record the time taken to
serialize the result. Once we serializing it once, the second
serialization should be really simple since it's just wrapping
something that has already been turned into a byte buffer. Do you see
a specific issue with serializing it twice?

I think you need to have two steps if you want to record the time
taken to serialize the result, since that needs to be sent back to the
driver when the task completes.

- Patrick

On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim m...@palantir.com wrote:
 Hi all,

 It looks like the result of task is serialized twice, once by
serializer (I.e. Java/Kryo depending on configuration) and once again by
closure serializer (I.e. Java). To link the actual code,

 The first one: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc
ala-23L213d=AwIFAwc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJ
q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9
WMY_2Z07ulAs=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aAe=
 The second one: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc
ala-23L226d=AwIFAwc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJ
q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9
WMY_2Z07ulAs=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxMe=

 This serializes the value, which is the result of task run twice,
which affects things like collect(), takeSample(), and
toLocalIterator(). Would it make sense to simply serialize the
DirectTaskResult once using the regular serializer (as opposed to
closure serializer)? Would it cause problems when the Accumulator values
are not Kryo-serializable?

 Alternatively, if we can assume that Accumator values are small, we can
closure-serialize those, put the serialized byte array in
DirectTaskResult with the raw task result value, and serialize
DirectTaskResult.

 What do people think?

 Thanks,
 Mingyu


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
Hi all,

It looks like the result of task is serialized twice, once by serializer (I.e. 
Java/Kryo depending on configuration) and once again by closure serializer 
(I.e. Java). To link the actual code,

The first one: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L213
The second one: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L226

This serializes the “value”, which is the result of task run twice, which 
affects things like collect(), takeSample(), and toLocalIterator(). Would it 
make sense to simply serialize the DirectTaskResult once using the regular 
“serializer” (as opposed to closure serializer)? Would it cause problems when 
the Accumulator values are not Kryo-serializable?

Alternatively, if we can assume that Accumator values are small, we can 
closure-serialize those, put the serialized byte array in DirectTaskResult with 
the raw task result “value”, and serialize DirectTaskResult.

What do people think?

Thanks,
Mingyu


Re: enum-like types in Spark

2015-03-04 Thread Aaron Davidson
I'm cool with #4 as well, but make sure we dictate that the values should
be defined within an object with the same name as the enumeration (like we
do for StorageLevel). Otherwise we may pollute a higher namespace.

e.g. we SHOULD do:

trait StorageLevel
object StorageLevel {
  case object MemoryOnly extends StorageLevel
  case object DiskOnly extends StorageLevel
}

On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust mich...@databricks.com
wrote:

 #4 with a preference for CamelCaseEnums

 On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com
 wrote:

  another vote for #4
  People are already used to adding () in Java.
 
 
  On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com
 wrote:
 
   #4 but with MemoryOnly (more scala-like)
  
   http://docs.scala-lang.org/style/naming-conventions.html
  
   Constants, Values, Variable and Methods
  
   Constant names should be in upper camel case. That is, if the member is
   final, immutable and it belongs to a package object or an object, it
 may
  be
   considered a constant (similar to Java’sstatic final members):
  
  
  1. object Container {
  2. val MyConstant = ...
  3. }
  
  
   2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com:
  
Hi all,
   
There are many places where we use enum-like types in Spark, but in
different ways. Every approach has both pros and cons. I wonder
whether there should be an “official” approach for enum-like types in
Spark.
   
1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)
   
* All types show up as Enumeration.Value in Java.
   
   
  
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
   
2. Java’s Enum (e.g., SaveMode, IOMode)
   
* Implementation must be in a Java file.
* Values doesn’t show up in the ScalaDoc:
   
   
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
   
3. Static fields in Java (e.g., TripletFields)
   
* Implementation must be in a Java file.
* Doesn’t need “()” in Java code.
* Values don't show up in the ScalaDoc:
   
   
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
   
4. Objects in Scala. (e.g., StorageLevel)
   
* Needs “()” in Java code.
* Values show up in both ScalaDoc and JavaDoc:
   
   
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
   
   
  
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
   
It would be great if we have an “official” approach for this as well
as the naming convention for enum-like values (“MEMORY_ONLY” or
“MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any
 thoughts?
   
Best,
Xiangrui
   
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
   
  
 



Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Sean Owen
I think we will have to fix
https://issues.apache.org/jira/browse/SPARK-5143 as well before the
final 1.3.x.

But yes everything else checks out for me, including sigs and hashes
and building the source release.

I have been following JIRA closely and am not aware of other blockers
besides the ones already identified.

On Wed, Mar 4, 2015 at 7:09 PM, Marcelo Vanzin van...@cloudera.com wrote:
 -1 (non-binding) because of SPARK-6144.

 But aside from that I ran a set of tests on top of standalone and yarn
 and things look good.

 On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.0!

 The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 Staging repositories for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1074/
 (published with version '1.3.0')
 https://repository.apache.org/content/repositories/orgapachespark-1075/
 (published with version '1.3.0-rc2')

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/

 Please vote on releasing this package as Apache Spark 1.3.0!

 The vote is open until Saturday, March 07, at 04:17 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == How does this compare to RC1 ==
 This patch includes a variety of bug fixes found in RC1.

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.

 If you are happy with this release based on your own testing, give a +1 vote.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-04 Thread Stephen Boesch
#4 but with MemoryOnly (more scala-like)

http://docs.scala-lang.org/style/naming-conventions.html

Constants, Values, Variable and Methods

Constant names should be in upper camel case. That is, if the member is
final, immutable and it belongs to a package object or an object, it may be
considered a constant (similar to Java’sstatic final members):


   1. object Container {
   2. val MyConstant = ...
   3. }


2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com:

 Hi all,

 There are many places where we use enum-like types in Spark, but in
 different ways. Every approach has both pros and cons. I wonder
 whether there should be an “official” approach for enum-like types in
 Spark.

 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)

 * All types show up as Enumeration.Value in Java.

 http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html

 2. Java’s Enum (e.g., SaveMode, IOMode)

 * Implementation must be in a Java file.
 * Values doesn’t show up in the ScalaDoc:

 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode

 3. Static fields in Java (e.g., TripletFields)

 * Implementation must be in a Java file.
 * Doesn’t need “()” in Java code.
 * Values don't show up in the ScalaDoc:

 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields

 4. Objects in Scala. (e.g., StorageLevel)

 * Needs “()” in Java code.
 * Values show up in both ScalaDoc and JavaDoc:

 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$

 http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html

 It would be great if we have an “official” approach for this as well
 as the naming convention for enum-like values (“MEMORY_ONLY” or
 “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts?

 Best,
 Xiangrui

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: enum-like types in Spark

2015-03-04 Thread Patrick Wendell
I like #4 as well and agree with Aaron's suggestion.

- Patrick

On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson ilike...@gmail.com wrote:
 I'm cool with #4 as well, but make sure we dictate that the values should
 be defined within an object with the same name as the enumeration (like we
 do for StorageLevel). Otherwise we may pollute a higher namespace.

 e.g. we SHOULD do:

 trait StorageLevel
 object StorageLevel {
   case object MemoryOnly extends StorageLevel
   case object DiskOnly extends StorageLevel
 }

 On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust mich...@databricks.com
 wrote:

 #4 with a preference for CamelCaseEnums

 On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley jos...@databricks.com
 wrote:

  another vote for #4
  People are already used to adding () in Java.
 
 
  On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch java...@gmail.com
 wrote:
 
   #4 but with MemoryOnly (more scala-like)
  
   http://docs.scala-lang.org/style/naming-conventions.html
  
   Constants, Values, Variable and Methods
  
   Constant names should be in upper camel case. That is, if the member is
   final, immutable and it belongs to a package object or an object, it
 may
  be
   considered a constant (similar to Java'sstatic final members):
  
  
  1. object Container {
  2. val MyConstant = ...
  3. }
  
  
   2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com:
  
Hi all,
   
There are many places where we use enum-like types in Spark, but in
different ways. Every approach has both pros and cons. I wonder
whether there should be an official approach for enum-like types in
Spark.
   
1. Scala's Enumeration (e.g., SchedulingMode, WorkerState, etc)
   
* All types show up as Enumeration.Value in Java.
   
   
  
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
   
2. Java's Enum (e.g., SaveMode, IOMode)
   
* Implementation must be in a Java file.
* Values doesn't show up in the ScalaDoc:
   
   
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
   
3. Static fields in Java (e.g., TripletFields)
   
* Implementation must be in a Java file.
* Doesn't need () in Java code.
* Values don't show up in the ScalaDoc:
   
   
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
   
4. Objects in Scala. (e.g., StorageLevel)
   
* Needs () in Java code.
* Values show up in both ScalaDoc and JavaDoc:
   
   
  
 
 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
   
   
  
 
 http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
   
It would be great if we have an official approach for this as well
as the naming convention for enum-like values (MEMORY_ONLY or
MemoryOnly). Personally, I like 4) with MEMORY_ONLY. Any
 thoughts?
   
Best,
Xiangrui
   
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
   
  
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Fwd: Unable to Read/Write Avro RDD on cluster.

2015-03-04 Thread ๏̯͡๏
I am trying to read RDD avro, transform and write.
I am able to run it locally fine but when i run onto cluster, i see issues
with Avro.


export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1
export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf
export HADOOP_CONF_DIR=/apache/hadoop/conf
export YARN_CONF_DIR=/apache/hadoop/conf
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native
export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf
export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf
export
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar
export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native
export YARN_CONF_DIR=/apache/hadoop/conf/

cd $SPARK_HOME

./bin/spark-submit --master yarn-cluster --jars
/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
--num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores
1  --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp
/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-02-16 endDate=2015-02-16
epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession
subcommand=successevents
outputdir=/user/dvasthimal/epdatasets/successdetail

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing
over to rm2
15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from
ApplicationsManager (ASM), number of NodeManagers: 2221
15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark,
queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08,
  queueApplicationCount = 7, queueChildQueueCount = 0
15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single
resource in this cluster 16384
15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources
15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.


15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
7780745 for dvasthimal on 10.115.206.112:8020
15/03/04 03:20:46 INFO yarn.Client: Uploading
file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar
15/03/04 03:20:47 INFO yarn.Client: Uploading
file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
to hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar
15/03/04 03:20:52 INFO yarn.Client: Uploading
file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar
15/03/04 03:20:52 INFO yarn.Client: Uploading
file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar
15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment
15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context
15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark
ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m,
-Djava.io.tmpdir=$PWD/tmp,
-Dspark.app.name=\com.company.ep.poc.spark.reporting.SparkApp\,
 -Dlog4j.configuration=log4j-spark-container.properties,
org.apache.spark.deploy.yarn.ApplicationMaster, --class,
com.company.ep.poc.spark.reporting.SparkApp, --jar ,
file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar,  --args
 'startDate=2015-02-16'  --args  'endDate=2015-02-16'  --args
 'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession'  --args
 'subcommand=successevents'  --args
 'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory,
2048, --executor-cores, 1, --num-executors , 3, 1, LOG_DIR/stdout, 2,
LOG_DIR/stderr)
15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM
15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application
application_1425075571333_61948
15/03/04 03:20:56 INFO yarn.Client: Application report from ASM:
 application identifier: application_1425075571333_61948
 appId: 61948
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: hdmi-spark
 appMasterRpcPort: -1
 appStartTime: 1425464454263
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED
 appTrackingUrl:
https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
 appUser: dvasthimal

Re: ideas for MLlib development

2015-03-04 Thread Robert Dodier
Thanks for your reply, Evan.

 It may make sense to have a more general Gibbs sampling
 framework, but it might be good to have a few desired applications
 in mind (e.g. higher level models that rely on Gibbs) to help API
 design, parallelization strategy, etc.

I think I'm more interested in a general framework which could
be applied to a variety of models, as opposed to an implementation
tailored to a specific model such as LDA. I'm thinking that such
a framework could be used in model exploration, either as an
end in itself or perhaps to identify promising models that could
then be given optimized, custom implementations. This would
be very much in the spirit of existing packages such as BUGS.
In fact, if we were to go down this road, I would propose that
models be specified in the BUGS modeling language -- no need
to reinvent that wheel, I would say.

At a very high level, the API for this framework would specify
methods to compute conditional distributions, marginalizing
as necessary via MCMC. Other operations could include
computing the expected value of a variable or function.
All this is very reminiscent of BUGS, of course.

best,

Robert Dodier

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Yeah, it will result in a second serialized copy of the array (costing
some memory). But the computational overhead should be very small. The
absolute worst case here will be when doing a collect() or something
similar that just bundles the entire partition.

- Patrick

On Wed, Mar 4, 2015 at 5:47 PM, Mingyu Kim m...@palantir.com wrote:
 The concern is really just the runtime overhead and memory footprint of
 Java-serializing an already-serialized byte array again. We originally
 noticed this when we were using RDD.toLocalIterator() which serializes the
 entire 64MB partition. We worked around this issue by kryo-serializing and
 snappy-compressing the partition on the executor side before returning it
 back to the driver, but this operation just felt redundant.

 Your explanation about reporting the time taken makes it clearer why it¹s
 designed this way. Since the byte array for the serialized task result
 shouldn¹t account for the majority of memory footprint anyways, I¹m okay
 with leaving it as is, then.

 Thanks,
 Mingyu





 On 3/4/15, 5:07 PM, Patrick Wendell pwend...@gmail.com wrote:

Hey Mingyu,

I think it's broken out separately so we can record the time taken to
serialize the result. Once we serializing it once, the second
serialization should be really simple since it's just wrapping
something that has already been turned into a byte buffer. Do you see
a specific issue with serializing it twice?

I think you need to have two steps if you want to record the time
taken to serialize the result, since that needs to be sent back to the
driver when the task completes.

- Patrick

On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim m...@palantir.com wrote:
 Hi all,

 It looks like the result of task is serialized twice, once by
serializer (I.e. Java/Kryo depending on configuration) and once again by
closure serializer (I.e. Java). To link the actual code,

 The first one:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc
ala-23L213d=AwIFAwc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJ
q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9
WMY_2Z07ulAs=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aAe=
 The second one:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc
ala-23L226d=AwIFAwc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJ
q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9
WMY_2Z07ulAs=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxMe=

 This serializes the value, which is the result of task run twice,
which affects things like collect(), takeSample(), and
toLocalIterator(). Would it make sense to simply serialize the
DirectTaskResult once using the regular serializer (as opposed to
closure serializer)? Would it cause problems when the Accumulator values
are not Kryo-serializable?

 Alternatively, if we can assume that Accumator values are small, we can
closure-serialize those, put the serialized byte array in
DirectTaskResult with the raw task result value, and serialize
DirectTaskResult.

 What do people think?

 Thanks,
 Mingyu


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Krishna Sankar
It is the LR over car-data at https://github.com/xsankar/cloaked-ironman.
1.2.0 gives Mean Squared Error = 40.8130551358
1.3.0 gives Mean Squared Error = 105.857603953

I will verify it one more time tomorrow.

Cheers
k/

On Tue, Mar 3, 2015 at 11:28 PM, Xiangrui Meng men...@gmail.com wrote:

 On Tue, Mar 3, 2015 at 11:15 PM, Krishna Sankar ksanka...@gmail.com
 wrote:
  +1 (non-binding, of course)
 
  1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:53 min
   mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
  -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
  2. Tested pyspark, mlib - running as well as compare results with 1.1.x 
  1.2.x
  2.1. statistics (min,max,mean,Pearson,Spearman) OK
  2.2. Linear/Ridge/Laso Regression OK
  But MSE has increased from 40.81 to 105.86. Has some refactoring happened
  on SGD/Linear Models ? Or do we have some extra parameters ? or change of
  defaults ?

 Could you share the code you used? I don't remember any changes in
 linear regression. Thanks! -Xiangrui

  2.3. Decision Tree, Naive Bayes OK
  2.4. KMeans OK
 Center And Scale OK
 WSSSE has come down slightly
  2.5. rdd operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word count)
  2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
 Model evaluation/optimization (rank, numIter, lmbda) with
 itertools
  OK
  3. Scala - MLlib
  3.1. statistics (min,max,mean,Pearson,Spearman) OK
  3.2. LinearRegressionWIthSGD OK
  3.3. Decision Tree OK
  3.4. KMeans OK
  3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
  4.0. SQL from Python
  4.1. result = sqlContext.sql(SELECT * from Employees WHERE State =
 'WA')
  OK
 
  Cheers
  k/
 
  On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Please vote on releasing the following candidate as Apache Spark version
  1.3.0!
 
  The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.3.0-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  Staging repositories for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1074/
  (published with version '1.3.0')
  https://repository.apache.org/content/repositories/orgapachespark-1075/
  (published with version '1.3.0-rc2')
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.3.0!
 
  The vote is open until Saturday, March 07, at 04:17 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.3.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == How does this compare to RC1 ==
  This patch includes a variety of bug fixes found in RC1.
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.2 workload and running on this release candidate,
  then reporting any regressions.
 
  If you are happy with this release based on your own testing, give a +1
  vote.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.3 QA period,
  so -1 votes should only occur for significant regressions from 1.2.1.
  Bugs already present in 1.2.X, minor regressions, or bugs related
  to new features will not block this release.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link:
http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p
ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark
Streaming and Spark SQL.

My question is: what's the typical usage of SchemaRDD in a Spark
Streaming application? Thank you very much!


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Robin East
+1 (subject to comments on ec2 issues below)

machine 1: Macbook Air, OSX 10.10.2 (Yosemite), Java 8
machine 2: iMac, OSX 10.8.4, Java 7

1. mvn clean package -DskipTests  (33min/13min)
2. ran SVM benchmark https://github.com/insidedctm/spark-mllib-benchmark


EC2 issues:

1) Unable to successfully run servers on ec2 using the following command:
./spark-ec2 -k key -i pathtokey -s 2  launch test

Error was:
  Initializing Spark
  ERROR: Unknown Spark Version

This seems to be due to init.sh in 
https://github.com/mesos/spark-ec2/tree/branch-1.3/spark not being updated for 
version 1.3.0. 
Question: Is it expected that this repository is updated in tandem with the 
Spark release?

2) One of my machines fails to authenticate when running the ec2 scripts. Using 
spark-ec2 I get the following error:
  ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL 
routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

It seems this machine has a problem where openssl can’t verify server 
certificates. Not been a problem for me before but spark1.3.0 has updated the 
boto repository it uses for EC2 functionality to 2.34.0. This new version has 
the capability to verify server certificates when authenticating to AWS servers 
and the default behaviour is to check certificates. Clearly this is not a Spark 
issue per se however if the problem I have is reasonably widespread then there 
maybe quite a few issues coming up on the mailing lists. There is a workaround 
option that could be put into spark_ec2.py script to turn off certificate 
validation if necessary.

 On 4 Mar 2015, at 04:19, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.0!
 
 The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 Staging repositories for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1074/
 (published with version '1.3.0')
 https://repository.apache.org/content/repositories/orgapachespark-1075/
 (published with version '1.3.0-rc2')
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/
 
 Please vote on releasing this package as Apache Spark 1.3.0!
 
 The vote is open until Saturday, March 07, at 04:17 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 == How does this compare to RC1 ==
 This patch includes a variety of bug fixes found in RC1.
 
 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.
 
 If you are happy with this release based on your own testing, give a +1 vote.
 
 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Google Summer of Code - Quick Query

2015-03-04 Thread Ulrich Stärk
Hi Manoj,

this question is best asked on the Spark mailing lists (copied). From a formal 
point of view all
that counts is your proposal in Melange once applications start but your mentor 
or the project you
wish to contribute to may have additional requirements.

Cheers,

Uli

On 2015-03-03 08:54, Manoj Kumar wrote:
 Hello,
 
 I am Manoj, a prospective student from the Apache Spark project. I have
 been contributing to Spark and discussing the project idea with my mentor
 for some time now. The tentative project has a number of JIRA's associated
 with it. I would still like to know if it is necessary to create an
 umbrella JIRA with all the other JIRA's  linked to it (and tagged with
 gsoc) or is it sufficient to just upload a proposal when the registration
 opens.
 
 
 Regards
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org