date:20141209

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21511900
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry
 val functionClassName = functionInfo.getFunctionClass.getName
 
 if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveSimpleUdf(functionClassName, children)
+  HiveSimpleUdf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdf(functionClassName, children)
+  HiveGenericUdf(new HiveFunctionCache(functionClassName), children)
 } else if (
  
classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass))
 {
-  HiveGenericUdaf(functionClassName, children)
+  HiveGenericUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveUdaf(functionClassName, children)
+  HiveUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdtf(functionClassName, Nil, children)
+  HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, 
children)
 } else {
   sys.error(sNo handler for udf ${functionInfo.getFunctionClass})
 }
   }
 }
 
-private[hive] trait HiveFunctionFactory {
-  val functionClassName: String
-
-  def createFunction[UDFType]() =
-
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
-}
-
-private[hive] abstract class HiveUdf extends Expression with Logging with 
HiveFunctionFactory {
-  self: Product =
+/**
+ * This class provides the UDF creation and also the UDF instance 
serialization and
+ * de-serialization cross process boundary.
+ *
+ * We use class instead of trait, seems property variables of trait cannot 
be serialized when
+ * bundled with Case Class; in the other hand, we need to intercept the 
UDF instance ser/de.
+ * the Has-a probably better than Is-a.
+ * @param functionClassName UDF class name
+ */
+class HiveFunctionCache(var functionClassName: String) extends 
java.io.Externalizable {
--- End diff --

`HiveFunctionWrapper` might be a better name, since essentially this class 
is just used to make Hive function objects serializable. And traditionally 
Spark uses wrapper as suffix for classes with similar purposes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512172
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry
 val functionClassName = functionInfo.getFunctionClass.getName
 
 if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveSimpleUdf(functionClassName, children)
+  HiveSimpleUdf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdf(functionClassName, children)
+  HiveGenericUdf(new HiveFunctionCache(functionClassName), children)
 } else if (
  
classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass))
 {
-  HiveGenericUdaf(functionClassName, children)
+  HiveGenericUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveUdaf(functionClassName, children)
+  HiveUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdtf(functionClassName, Nil, children)
+  HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, 
children)
 } else {
   sys.error(sNo handler for udf ${functionInfo.getFunctionClass})
 }
   }
 }
 
-private[hive] trait HiveFunctionFactory {
-  val functionClassName: String
-
-  def createFunction[UDFType]() =
-
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
-}
-
-private[hive] abstract class HiveUdf extends Expression with Logging with 
HiveFunctionFactory {
-  self: Product =
+/**
+ * This class provides the UDF creation and also the UDF instance 
serialization and
+ * de-serialization cross process boundary.
+ *
+ * We use class instead of trait, seems property variables of trait cannot 
be serialized when
+ * bundled with Case Class; in the other hand, we need to intercept the 
UDF instance ser/de.
+ * the Has-a probably better than Is-a.
+ * @param functionClassName UDF class name
+ */
+class HiveFunctionCache(var functionClassName: String) extends 
java.io.Externalizable {
--- End diff --

Yes, that's a good suggestion, will update it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512252
  
--- Diff: 
sql/hive/v0.12.0/src/main/scala/org/apache/spark/sql/hive/Shim12.scala ---
@@ -49,6 +49,16 @@ import org.apache.spark.sql.catalyst.types.DecimalType
 private[hive] object HiveShim {
   val version = 0.12.0
 
+  import org.apache.hadoop.hive.ql.exec.Utilities
+
+  def deserializePlan[UDFType](is: java.io.InputStream, clazz: 
Class[UDFType]): UDFType = {
+Utilities.deserializePlan(is).asInstanceOf[UDFType]
+  }
+
+  def serializePlan(function: Any, out: java.io.OutputStream): Unit = {
+Utilities.serializePlan(function, out)
+  }
+
--- End diff --

Instead of calling `Utilities.deserializePlan`, how about mimic 
`Utilities.de/serializeObjectByKryo` methods here? Those two functions are 
private but very simple, the advantage is that we don't need the expensive 
`HiveConf` instantiation here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512352
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry
 val functionClassName = functionInfo.getFunctionClass.getName
 
 if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveSimpleUdf(functionClassName, children)
+  HiveSimpleUdf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdf(functionClassName, children)
+  HiveGenericUdf(new HiveFunctionCache(functionClassName), children)
 } else if (
  
classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass))
 {
-  HiveGenericUdaf(functionClassName, children)
+  HiveGenericUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveUdaf(functionClassName, children)
+  HiveUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdtf(functionClassName, Nil, children)
+  HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, 
children)
 } else {
   sys.error(sNo handler for udf ${functionInfo.getFunctionClass})
 }
   }
 }
 
-private[hive] trait HiveFunctionFactory {
-  val functionClassName: String
-
-  def createFunction[UDFType]() =
-
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
-}
-
-private[hive] abstract class HiveUdf extends Expression with Logging with 
HiveFunctionFactory {
-  self: Product =
+/**
+ * This class provides the UDF creation and also the UDF instance 
serialization and
+ * de-serialization cross process boundary.
+ *
+ * We use class instead of trait, seems property variables of trait cannot 
be serialized when
+ * bundled with Case Class; in the other hand, we need to intercept the 
UDF instance ser/de.
+ * the Has-a probably better than Is-a.
+ * @param functionClassName UDF class name
+ */
+class HiveFunctionCache(var functionClassName: String) extends 
java.io.Externalizable {
+  // for Serialization
+  def this() = this(null)
+
+  private var instance: Any = null
+
+  def writeExternal(out: java.io.ObjectOutput) {
+// output the function name
+out.writeUTF(functionClassName)
+
+// Write a flag if instance is null or not
+out.writeBoolean(instance != null)
--- End diff --

Does this relate to the `HiveSimpleUdf` and UDF bridge case you mentioned 
offline? Would be better to leave a comment here. I was confused when reading 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3247#issuecomment-66247996
  
  [Test build #24242 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24242/consoleFull)
 for   PR 3247 at commit 
[`a160d1a`](https://github.com/apache/spark/commit/a160d1a9785b70abb8f7d3f63d2a69717727f0ac).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512453
  
--- Diff: 
sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala ---
@@ -48,6 +48,17 @@ import scala.language.implicitConversions
 private[hive] object HiveShim {
   val version = 0.13.1
 
+  import org.apache.hadoop.hive.ql.exec.Utilities
+  import org.apache.hadoop.hive.conf.HiveConf
+
+  def deserializePlan[UDFType](is: java.io.InputStream, clazz: 
Class[UDFType]): UDFType = {
+Utilities.deserializePlan(is, clazz, new HiveConf())
+  }
+
+  def serializePlan(function: Any, out: java.io.OutputStream): Unit = {
+Utilities.serializePlan(function, out, new HiveConf())
+  }
+
--- End diff --

Instead of calling `Utilities.deserializePlan`, how about mimic 
`Utilities.de/serializeObjectByKryo` methods here? Those two functions are 
private but very simple, the advantage is that we don't need the expensive 
`HiveConf` instantiation here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3247#issuecomment-66248753
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24242/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512634
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry
 val functionClassName = functionInfo.getFunctionClass.getName
 
 if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveSimpleUdf(functionClassName, children)
+  HiveSimpleUdf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdf(functionClassName, children)
+  HiveGenericUdf(new HiveFunctionCache(functionClassName), children)
 } else if (
  
classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass))
 {
-  HiveGenericUdaf(functionClassName, children)
+  HiveGenericUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveUdaf(functionClassName, children)
+  HiveUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdtf(functionClassName, Nil, children)
+  HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, 
children)
 } else {
   sys.error(sNo handler for udf ${functionInfo.getFunctionClass})
 }
   }
 }
 
-private[hive] trait HiveFunctionFactory {
-  val functionClassName: String
-
-  def createFunction[UDFType]() =
-
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
-}
-
-private[hive] abstract class HiveUdf extends Expression with Logging with 
HiveFunctionFactory {
-  self: Product =
+/**
+ * This class provides the UDF creation and also the UDF instance 
serialization and
+ * de-serialization cross process boundary.
+ *
+ * We use class instead of trait, seems property variables of trait cannot 
be serialized when
+ * bundled with Case Class; in the other hand, we need to intercept the 
UDF instance ser/de.
+ * the Has-a probably better than Is-a.
+ * @param functionClassName UDF class name
+ */
+class HiveFunctionCache(var functionClassName: String) extends 
java.io.Externalizable {
+  // for Serialization
+  def this() = this(null)
+
+  private var instance: Any = null
+
+  def writeExternal(out: java.io.ObjectOutput) {
+// output the function name
+out.writeUTF(functionClassName)
+
+// Write a flag if instance is null or not
+out.writeBoolean(instance != null)
--- End diff --

Yes, actually I leave a comment in above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3247#issuecomment-66248750
  
  [Test build #24242 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24242/consoleFull)
 for   PR 3247 at commit 
[`a160d1a`](https://github.com/apache/spark/commit/a160d1a9785b70abb8f7d3f63d2a69717727f0ac).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class UnresolvedFunction(`
  * `trait AggregateFunction `
  * `trait AggregateExpression extends Expression `
  * `abstract class UnaryAggregateExpression extends UnaryExpression with 
AggregateExpression `
  * `case class Min(`
  * `case class Average(child: Expression, distinct: Boolean = false)`
  * `case class Max(child: Expression, distinct: Boolean = false)`
  * `case class Count(child: Expression, distinct: Boolean = false)`
  * `case class Sum(child: Expression, distinct: Boolean = false)`
  * `case class First(child: Expression, distinct: Boolean = false)`
  * `case class Last(child: Expression, distinct: Boolean = false)`
  * `case class MinFunction(aggr: BoundReference, base: Min) extends 
AggregateFunction `
  * `case class AverageFunction(count: BoundReference, sum: BoundReference, 
base: Average)`
  * `case class MaxFunction(aggr: BoundReference, base: Max) extends 
AggregateFunction `
  * `case class CountFunction(aggr: BoundReference, base: Count) extends 
AggregateFunction `
  * `case class SumFunction(aggr: BoundReference, base: Sum) extends 
AggregateFunction `
  * `case class FirstFunction(aggr: BoundReference, base: First) extends 
AggregateFunction `
  * `case class LastFunction(aggr: BoundReference, base: 
AggregateExpression) extends AggregateFunction `
  * `sealed case class AggregateFunctionBind(`
  * `sealed trait Aggregate `
  * `sealed trait PreShuffle extends Aggregate `
  * `sealed trait PostShuffle extends Aggregate `
  * `case class AggregatePreShuffle(`
  * `case class AggregatePostShuffle(`
  * `case class DistinctAggregate(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512701
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry
 val functionClassName = functionInfo.getFunctionClass.getName
 
 if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveSimpleUdf(functionClassName, children)
+  HiveSimpleUdf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdf(functionClassName, children)
+  HiveGenericUdf(new HiveFunctionCache(functionClassName), children)
 } else if (
  
classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass))
 {
-  HiveGenericUdaf(functionClassName, children)
+  HiveGenericUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveUdaf(functionClassName, children)
+  HiveUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdtf(functionClassName, Nil, children)
+  HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, 
children)
 } else {
   sys.error(sNo handler for udf ${functionInfo.getFunctionClass})
 }
   }
 }
 
-private[hive] trait HiveFunctionFactory {
-  val functionClassName: String
-
-  def createFunction[UDFType]() =
-
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
-}
-
-private[hive] abstract class HiveUdf extends Expression with Logging with 
HiveFunctionFactory {
-  self: Product =
+/**
+ * This class provides the UDF creation and also the UDF instance 
serialization and
+ * de-serialization cross process boundary.
+ *
+ * We use class instead of trait, seems property variables of trait cannot 
be serialized when
+ * bundled with Case Class; in the other hand, we need to intercept the 
UDF instance ser/de.
+ * the Has-a probably better than Is-a.
+ * @param functionClassName UDF class name
+ */
+class HiveFunctionCache(var functionClassName: String) extends 
java.io.Externalizable {
--- End diff --

Another comment, related to `HiveShim`. I was thinking to move this class 
rather than `de/serializePlan` methods into the shim layer. Hive 0.12.0 is not 
affected by SPARK-4785, thus the version in 0.12.0 shim can be very simple. We 
only need to handle 0.13.1 there. This also lowers the possibility of breaking 
0.12.0 code paths.

Also after moving this class into the shim layer, as I've mentioned in 
another comment, instead of relying on `de/serializePlan`, we can just mimic 
`Utilities.de/serializeObjectByKryo` in `read/writeExternal` methods in this 
class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512696
  
--- Diff: 
sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala ---
@@ -48,6 +48,17 @@ import scala.language.implicitConversions
 private[hive] object HiveShim {
   val version = 0.13.1
 
+  import org.apache.hadoop.hive.ql.exec.Utilities
+  import org.apache.hadoop.hive.conf.HiveConf
+
+  def deserializePlan[UDFType](is: java.io.InputStream, clazz: 
Class[UDFType]): UDFType = {
+Utilities.deserializePlan(is, clazz, new HiveConf())
+  }
+
+  def serializePlan(function: Any, out: java.io.OutputStream): Unit = {
+Utilities.serializePlan(function, out, new HiveConf())
+  }
+
--- End diff --

Yes, initializing the HiveConf is quite expensive, particularly within the 
executor, I will fix it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512868
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry
 val functionClassName = functionInfo.getFunctionClass.getName
 
 if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveSimpleUdf(functionClassName, children)
+  HiveSimpleUdf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdf(functionClassName, children)
+  HiveGenericUdf(new HiveFunctionCache(functionClassName), children)
 } else if (
  
classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass))
 {
-  HiveGenericUdaf(functionClassName, children)
+  HiveGenericUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveUdaf(functionClassName, children)
+  HiveUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdtf(functionClassName, Nil, children)
+  HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, 
children)
 } else {
   sys.error(sNo handler for udf ${functionInfo.getFunctionClass})
 }
   }
 }
 
-private[hive] trait HiveFunctionFactory {
-  val functionClassName: String
-
-  def createFunction[UDFType]() =
-
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
-}
-
-private[hive] abstract class HiveUdf extends Expression with Logging with 
HiveFunctionFactory {
-  self: Product =
+/**
+ * This class provides the UDF creation and also the UDF instance 
serialization and
+ * de-serialization cross process boundary.
+ *
+ * We use class instead of trait, seems property variables of trait cannot 
be serialized when
+ * bundled with Case Class; in the other hand, we need to intercept the 
UDF instance ser/de.
+ * the Has-a probably better than Is-a.
+ * @param functionClassName UDF class name
+ */
+class HiveFunctionCache(var functionClassName: String) extends 
java.io.Externalizable {
+  // for Serialization
+  def this() = this(null)
+
+  private var instance: Any = null
+
+  def writeExternal(out: java.io.ObjectOutput) {
+// output the function name
+out.writeUTF(functionClassName)
+
+// Write a flag if instance is null or not
+out.writeBoolean(instance != null)
+if (instance != null) {
+  // Some of the UDF are serializable, but some others are not
+  // Hive Utilities can handle both cases
+  val baos = new java.io.ByteArrayOutputStream()
+  HiveShim.serializePlan(instance, baos)
+  val functionInBytes = baos.toByteArray
+
+  // output the function bytes
+  out.writeInt(functionInBytes.length)
+  out.write(functionInBytes, 0, functionInBytes.length)
+}
+  }
 
-  type UDFType
-  type EvaluatedType = Any
+  def readExternal(in: java.io.ObjectInput) {
+// read the function name
+functionClassName = in.readUTF()
 
-  def nullable = true
+if (in.readBoolean()) {
+  // if the instance is not null
+  // read the function in bytes
+  val functionInBytesLength = in.readInt()
+  val functionInBytes = new Array[Byte](functionInBytesLength)
+  in.read(functionInBytes, 0, functionInBytesLength)
 
-  lazy val function = createFunction[UDFType]()
+  // deserialize the function object via Hive Utilities
+  instance = HiveShim.deserializePlan(new 
java.io.ByteArrayInputStream(functionInBytes),
+getContextOrSparkClassLoader.loadClass(functionClassName))
+}
+  }
 
-  override def toString = 
s$nodeName#$functionClassName(${children.mkString(,)})
+  def createFunction[UDFType](alwaysCreateNewInstance: Boolean = false) = {
+if (alwaysCreateNewInstance) {
+  
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
+} else {
+  if (instance == null) {
+instance = 
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance
+  }
+  instance.asInstanceOf[UDFType]
+}
--- End diff --

Can be simplified to:

```scala
if (!alwaysCreateNewInstance  instance == null) {
  instance = ...
}
instance.asInstanceOf[UDFType]
```


---
If your project is set up for it, you

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3640#discussion_r21512962
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry
 val functionClassName = functionInfo.getFunctionClass.getName
 
 if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveSimpleUdf(functionClassName, children)
+  HiveSimpleUdf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdf(functionClassName, children)
+  HiveGenericUdf(new HiveFunctionCache(functionClassName), children)
 } else if (
  
classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass))
 {
-  HiveGenericUdaf(functionClassName, children)
+  HiveGenericUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveUdaf(functionClassName, children)
+  HiveUdaf(new HiveFunctionCache(functionClassName), children)
 } else if 
(classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) {
-  HiveGenericUdtf(functionClassName, Nil, children)
+  HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, 
children)
 } else {
   sys.error(sNo handler for udf ${functionInfo.getFunctionClass})
 }
   }
 }
 
-private[hive] trait HiveFunctionFactory {
-  val functionClassName: String
-
-  def createFunction[UDFType]() =
-
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
-}
-
-private[hive] abstract class HiveUdf extends Expression with Logging with 
HiveFunctionFactory {
-  self: Product =
+/**
+ * This class provides the UDF creation and also the UDF instance 
serialization and
+ * de-serialization cross process boundary.
+ *
+ * We use class instead of trait, seems property variables of trait cannot 
be serialized when
+ * bundled with Case Class; in the other hand, we need to intercept the 
UDF instance ser/de.
+ * the Has-a probably better than Is-a.
+ * @param functionClassName UDF class name
+ */
+class HiveFunctionCache(var functionClassName: String) extends 
java.io.Externalizable {
+  // for Serialization
+  def this() = this(null)
+
+  private var instance: Any = null
+
+  def writeExternal(out: java.io.ObjectOutput) {
+// output the function name
+out.writeUTF(functionClassName)
+
+// Write a flag if instance is null or not
+out.writeBoolean(instance != null)
+if (instance != null) {
+  // Some of the UDF are serializable, but some others are not
+  // Hive Utilities can handle both cases
+  val baos = new java.io.ByteArrayOutputStream()
+  HiveShim.serializePlan(instance, baos)
+  val functionInBytes = baos.toByteArray
+
+  // output the function bytes
+  out.writeInt(functionInBytes.length)
+  out.write(functionInBytes, 0, functionInBytes.length)
+}
+  }
 
-  type UDFType
-  type EvaluatedType = Any
+  def readExternal(in: java.io.ObjectInput) {
+// read the function name
+functionClassName = in.readUTF()
 
-  def nullable = true
+if (in.readBoolean()) {
+  // if the instance is not null
+  // read the function in bytes
+  val functionInBytesLength = in.readInt()
+  val functionInBytes = new Array[Byte](functionInBytesLength)
+  in.read(functionInBytes, 0, functionInBytesLength)
 
-  lazy val function = createFunction[UDFType]()
+  // deserialize the function object via Hive Utilities
+  instance = HiveShim.deserializePlan(new 
java.io.ByteArrayInputStream(functionInBytes),
+getContextOrSparkClassLoader.loadClass(functionClassName))
+}
+  }
 
-  override def toString = 
s$nodeName#$functionClassName(${children.mkString(,)})
+  def createFunction[UDFType](alwaysCreateNewInstance: Boolean = false) = {
+if (alwaysCreateNewInstance) {
+  
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType]
+} else {
+  if (instance == null) {
+instance = 
getContextOrSparkClassLoader.loadClass(functionClassName).newInstance
+  }
+  instance.asInstanceOf[UDFType]
+}
--- End diff --

Actually, how about removing the `alwaysCreateNewInstance` argument (which 
is confusing), and define a new `HiveSimpleUdfWrapper` that overrides 
`createFunction`, and always return a new instance?


---
If

[GitHub] spark pull request: [SPARK-4793] [Deploy] ensure .jar at end of li...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3641#issuecomment-66250028
  
  [Test build #24241 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24241/consoleFull)
 for   PR 3641 at commit 
[`45cbfd0`](https://github.com/apache/spark/commit/45cbfd03444b2d4022dd4d564100a6bac9924d1d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4793] [Deploy] ensure .jar at end of li...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3641#issuecomment-66250034
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24241/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66251380
  
Appreciate a lot for fixing this case! The serialization wrapper class 
makes sense. However, would like to make some refactoring. A summary of my 
comments above:

1. Renaming `HiveFunctionCache` to `HiveFunctionWrapper`
2. Moving `HiveFunctionWrapper` to the shim layer, and keep Hive 0.12.0 
code path exactly the same as before.

   Considering Hive 0.12.0 is not affected by this issue, and 1.2.0 release 
is really close, I'd like to lower the possibility of breaking 0.12.0 code 
paths as much as possible. 

3. Add a `HiveSimpleUdfWrapper`, which inherits from `HiveFunctionWrapper`.

   As you've mentioned in the code, `HiveSimpleUdf` is a special case, it 
shouldn't be serialized, and should always create a fresh object on executor 
side. Currently this special case complicates `HiveFunctionWrapper` 
implementation and makes it somewhat confusing. Defining a special subclass for 
`HvieSimpleUdf` helps making `HiveFunctionWrapper` simpler (e.g. no need to 
serialize the boolean flag any more).

4. Avoid using `Utilities.de/serializePlan` by mimicking 
`Utilities.de/serializeObjectByKryo` in the `read/writeExternal` methods of the 
wrapper class, so that we can remove the expensive `HiveConf` instantiation.

In a word, we can add two classes, `HiveFunctionWrapper` and 
`HiveSImpleUdfWrapper` into the shim layer, and make sure that the 0.12.0 
version behaves exactly the same as before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread zsxwing

GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/3642

[SPARK-4795][Core] Redesign the primitive type = Writable implicit APIs 
to make them be activated automatically

Try to redesign the primitive type = Writable implicit APIs to make them 
be activated automatically and without breaking compatibility.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-4795

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3642.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3642


commit ac1e9038ce96b423885e8c6fe614162fb80c0dbf
Author: zsxwing zsxw...@gmail.com
Date:   2014-12-08T12:57:11Z

Reorganize the rest 'implicit' methods in SparkContext




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66252423
  
  [Test build #24243 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24243/consoleFull)
 for   PR 3642 at commit 
[`ac1e903`](https://github.com/apache/spark/commit/ac1e9038ce96b423885e8c6fe614162fb80c0dbf).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66253299
  
  [Test build #24244 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24244/consoleFull)
 for   PR 3642 at commit 
[`3009328`](https://github.com/apache/spark/commit/3009328ef946ce06a89ec8f05a3b831e05bf96e1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66259222
  
  [Test build #24243 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24243/consoleFull)
 for   PR 3642 at commit 
[`ac1e903`](https://github.com/apache/spark/commit/ac1e9038ce96b423885e8c6fe614162fb80c0dbf).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  m = m.getReturnType().toString != class java.lang.Object 
`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66259234
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24243/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66260184
  
  [Test build #24244 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24244/consoleFull)
 for   PR 3642 at commit 
[`3009328`](https://github.com/apache/spark/commit/3009328ef946ce06a89ec8f05a3b831e05bf96e1).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66260189
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24244/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...

2014-12-09 Thread gvramana

Github user gvramana commented on a diff in the pull request:

https://github.com/apache/spark/pull/2802#discussion_r21518099
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -172,6 +177,8 @@ private[hive] case class 
HiveGenericUdf(functionClassName: String, children: Seq
 
   override def eval(input: Row): Any = {
 returnInspector // Make sure initialized.
+if(foldable) return constantReturnValue
--- End diff --

@chenghao-intel  This PR cannot be separately merged without CreateArray, 
as percentile_approx accepts only constant array iterator and fails otherwise. 
I think we can go ahead and merge these changes as they don't break build or 
tests, and are not directly dependent on #3429 in order of merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Replace breezeSquaredDistance

2014-12-09 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/3643

Replace breezeSquaredDistance

This PR replaces slow breezeSquaredDistance. A simple calculation involving 
4 vectors of 2 dims shows:

* breezeSquaredDistance:  ~12 secs
* This PR: ~10.5 secs



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 faster_squareddistance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3643.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3643


commit 13669dbbf163bd1940d380388f94ad60237b10c0
Author: Liang-Chi Hsieh vii...@gmail.com
Date:   2014-12-09T10:13:51Z

Replace breezeSquaredDistance.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Replace breezeSquaredDistance

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3643#issuecomment-66262214
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...

2014-12-09 Thread tianyi

Github user tianyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/3375#discussion_r21519366
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala
 ---
@@ -68,62 +68,59 @@ case class HashOuterJoin(
   @transient private[this] lazy val DUMMY_LIST = Seq[Row](null)
   @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row]
 
+  @transient private[this] lazy val joinedRow = new JoinedRow()
--- End diff --

to @marmbrus 
I move `joinedRow`, `rightNullRow`, `boundCondition` to the beginning of  
`leftOuterIterator`, and ran my benchmark again, the result is : 
```
CSV: 9935 ms
CSV: 6989 ms
CSV: 7045 ms
Current Mem Usage:508094480
```
without move, the result is : 
```
CSV: 10590 ms
CSV: 7022 ms
CSV: 7086 ms
Current Mem Usage:402470616
```
(I don't know why the costs of memory is not around 208145728 in my last 
test, maybe something related to my laptop environment.)
Apparently, the costs of memory increased 25%, and the speed is almost the 
same.
Is it acceptable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...

2014-12-09 Thread tianyi

Github user tianyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/3375#discussion_r21519410
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala
 ---
@@ -68,62 +68,59 @@ case class HashOuterJoin(
   @transient private[this] lazy val DUMMY_LIST = Seq[Row](null)
   @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row]
 
+  @transient private[this] lazy val joinedRow = new JoinedRow()
--- End diff --

to @marmbrus 
I move `joinedRow`, `rightNullRow`, `boundCondition` to the beginning of  
`leftOuterIterator`, and ran my benchmark again, the result is : 
```
CSV: 9935 ms
CSV: 6989 ms
CSV: 7045 ms
Current Mem Usage:508094480
```
without move, the result is : 
```
CSV: 10590 ms
CSV: 7022 ms
CSV: 7086 ms
Current Mem Usage:402470616
```
(I don't know why the costs of memory is not around 208145728 in my last 
test, maybe something related to my laptop environment.)
Apparently, the costs of memory increased 25%, and the speed is almost the 
same.
Is it acceptable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/3642#discussion_r21520954
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/ShortestPathsSuite.scala ---
@@ -40,7 +40,7 @@ class ShortestPathsSuite extends FunSuite with 
LocalSparkContext {
   val graph = Graph.fromEdgeTuples(edges, 1)
   val landmarks = Seq(1, 4).map(_.toLong)
   val results = ShortestPaths.run(graph, 
landmarks).vertices.collect.map {
-case (v, spMap) = (v, spMap.mapValues(_.get))
--- End diff --

This is an example of breaking source compatibility, although it's used the 
implicit `intToIntWritable` occasionally. _.get = intToIntWritable(_).get.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66268347
  
  [Test build #24245 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24245/consoleFull)
 for   PR 3642 at commit 
[`39343de`](https://github.com/apache/spark/commit/39343ded29a8576388ca98b32559ff8cae1cc5b4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...

2014-12-09 Thread gvramana

Github user gvramana commented on a diff in the pull request:

https://github.com/apache/spark/pull/2802#discussion_r21524226
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -165,6 +165,11 @@ private[hive] case class 
HiveGenericUdf(functionClassName: String, children: Seq
 isUDFDeterministic  
returnInspector.isInstanceOf[ConstantObjectInspector]
 
   @transient
+  protected lazy val constantReturnValue = unwrap(
--- End diff --

Yes, fixed the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2802#issuecomment-66274239
  
  [Test build #24246 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24246/consoleFull)
 for   PR 2802 at commit 
[`a0182e5`](https://github.com/apache/spark/commit/a0182e54c5ed97bb4b85c895434b4d626671ba58).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66277314
  
  [Test build #24245 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24245/consoleFull)
 for   PR 3642 at commit 
[`39343de`](https://github.com/apache/spark/commit/39343ded29a8576388ca98b32559ff8cae1cc5b4).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-66277323
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24245/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4793] [Deploy] ensure .jar at end of li...

2014-12-09 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3641#issuecomment-66278031
  
LGTM. Are there other instances like this in other scripts? couldn't hurt 
to attach `$` elsewhere where the pattern is clearly matching at the end of a 
filename. Worth a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4792] Add error message when making loc...

2014-12-09 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3635#discussion_r21526216
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -67,11 +67,29 @@ private[spark] class DiskBlockManager(blockManager: 
BlockManager, conf: SparkCon
 if (subDir == null) {
   subDir = subDirs(dirId).synchronized {
 val old = subDirs(dirId)(subDirId)
-if (old != null) {
+if (old != null  old.exists()) {
   old
 } else {
+  var foundLocalDir = false
+  var tries = 0
   val newDir = new File(localDirs(dirId), %02x.format(subDirId))
-  newDir.mkdir()
+  while (!foundLocalDir  tries  MAX_DIR_CREATION_ATTEMPTS) {
+tries += 1
+try {
+  if (!newDir.exists()) {
+foundLocalDir = newDir.mkdir()
--- End diff --

Why keep retrying this? what's the case where it fails but then immediately 
succeeds?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4792] Add error message when making loc...

2014-12-09 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3635#discussion_r21526232
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -67,11 +67,29 @@ private[spark] class DiskBlockManager(blockManager: 
BlockManager, conf: SparkCon
 if (subDir == null) {
   subDir = subDirs(dirId).synchronized {
 val old = subDirs(dirId)(subDirId)
-if (old != null) {
+if (old != null  old.exists()) {
   old
 } else {
+  var foundLocalDir = false
+  var tries = 0
   val newDir = new File(localDirs(dirId), %02x.format(subDirId))
-  newDir.mkdir()
+  while (!foundLocalDir  tries  MAX_DIR_CREATION_ATTEMPTS) {
+tries += 1
+try {
+  if (!newDir.exists()) {
+foundLocalDir = newDir.mkdir()
+  }
+  else {
+foundLocalDir = true
+  }
+} catch {
+  case e: Exception =
+logWarning(sAttempt $tries to create local dir $newDir 
failed, e)
+}
+  }
+  if (!foundLocalDir) {
+throw new Exception(sFailed $MAX_DIR_CREATION_ATTEMPTS 
attempts to create local dir in $newDir.)
--- End diff --

`IOException` is more appropriate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2096][SQL] support dot notation on arbi...

2014-12-09 Thread sziep

Github user sziep commented on the pull request:

https://github.com/apache/spark/pull/2405#issuecomment-66280104
  
are there any plans on merging this soon? This is a pretty useful feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2096][SQL] support dot notation on arbi...

2014-12-09 Thread ayoub-benali

Github user ayoub-benali commented on the pull request:

https://github.com/apache/spark/pull/2405#issuecomment-66280190
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2802#issuecomment-66281289
  
  [Test build #24246 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24246/consoleFull)
 for   PR 2802 at commit 
[`a0182e5`](https://github.com/apache/spark/commit/a0182e54c5ed97bb4b85c895434b4d626671ba58).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2802#issuecomment-66281294
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24246/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2096][SQL] support dot notation on arbi...

2014-12-09 Thread cloud-fan

Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/2405#issuecomment-66282781
  
This PR is blocked by https://github.com/apache/spark/pull/2543. I'll 
update the code tomorrow and make it work :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4792] Add error message when making loc...

2014-12-09 Thread XuTingjun

Github user XuTingjun commented on the pull request:

https://github.com/apache/spark/pull/3635#issuecomment-66283335
  
@srowen , thank you for your suggestion, I have modified the method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-09 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r21527840
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.beans.BeanInfo
+
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.mllib.linalg.Vector
+
+/**
+ * :: AlphaComponent ::
+ * Class that represents an instance (data point) for prediction tasks.
+ *
+ * @param label Label to predict
+ * @param features List of features describing this instance
+ * @param weight Instance weight
+ */
+@AlphaComponent
+@BeanInfo
+case class LabeledPoint(label: Double, features: Vector, weight: Double) {
--- End diff --

I mean the former. Yeah, that's probably the downside. Each data element is 
at least an object, and you can't have it reduce to a `double[]` under the hood.

In the second example, I think you'd only ever really want `MixedFeatures` 
as an abstraction. There's no need to think of all `CategoricalFeatures` as a 
special case deserving a unique abstraction.

I suppose if you abstract the entire training example as an object, and 
allow accessors like `getNumericFeature(index: Int)`, 
`getCategoricalFeature(index: Int)` you can still internally optimize the 
storage while exposing a richer object representation. You get the type safety 
and optimization opportunity.

Sure, an `Array[Double]` could easily be translated into one of the more 
elaborate representations above. I suppose I myself wouldn't want to make it 
_too_ easy to not think about types!

Anyway, up to your judgment really. There are arguments several ways here. 
Worth a thought to see if the idea of a bit more abstraction appeals to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4792] Add error message when making loc...

2014-12-09 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3635#discussion_r21527997
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -67,11 +67,20 @@ private[spark] class DiskBlockManager(blockManager: 
BlockManager, conf: SparkCon
 if (subDir == null) {
   subDir = subDirs(dirId).synchronized {
 val old = subDirs(dirId)(subDirId)
-if (old != null) {
+if (old != null  old.exists()) {
   old
 } else {
+  var foundLocalDir = false
--- End diff --

OK, but this variable seems pointless now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66286027
  
  [Test build #24247 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24247/consoleFull)
 for   PR 3640 at commit 
[`74466a3`](https://github.com/apache/spark/commit/74466a3517996d837ba9f8acb6b912a82bbbd6df).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66287710
  
  [Test build #24248 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24248/consoleFull)
 for   PR 3640 at commit 
[`8e13756`](https://github.com/apache/spark/commit/8e137565e34e36361faf90670ab8e98743051410).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...

2014-12-09 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/3375#discussion_r21529898
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala
 ---
@@ -68,62 +68,59 @@ case class HashOuterJoin(
   @transient private[this] lazy val DUMMY_LIST = Seq[Row](null)
   @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row]
 
+  @transient private[this] lazy val joinedRow = new JoinedRow()
--- End diff --

I think you can try to move the `joinedRow` right after 
https://github.com/apache/spark/pull/3375/files#diff-48230fdc68c8c172d22709ed90f8817dR193
 , NOT inside of the `def leftOuterIterator` etc.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...

2014-12-09 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/2802#discussion_r21530532
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -172,6 +177,8 @@ private[hive] case class 
HiveGenericUdf(functionClassName: String, children: Seq
 
   override def eval(input: Row): Any = {
 returnInspector // Make sure initialized.
+if(foldable) return constantReturnValue
--- End diff --

I've updated the #3429, I think this PR can be more simpler after #3429 
merged. Besides, invoking the `foldable` win `eval` probably too heavy, which 
is supposed to be eliminated in `Optimizer`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4699][SQL] make caseSensitive configura...

2014-12-09 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/3558#issuecomment-66291005
  
That's a good point, Can we make another interface that SqlConf inherit 
from? Or can we move the SqlConf into the module `Catalyst`? Any other idea?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4793] [Deploy] ensure .jar at end of li...

2014-12-09 Thread adrian-wang

Github user adrian-wang commented on the pull request:

https://github.com/apache/spark/pull/3641#issuecomment-66291537
  
I have checked those files and found another on 
https://github.com/adrian-wang/spark/blob/jar/bin/compute-classpath.sh#L111
Do you think I should also change that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2014-12-09 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r21531881
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala 
---
@@ -39,23 +39,34 @@ private[spark] class ClientArguments(args: 
Array[String], sparkConf: SparkConf)
   var appName: String = Spark
   var priority = 0
 
-  // Additional memory to allocate to containers
-  // For now, use driver's memory overhead as our AM container's memory 
overhead
-  val amMemoryOverhead = 
sparkConf.getInt(spark.yarn.driver.memoryOverhead,
-math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, 
MEMORY_OVERHEAD_MIN))
-
-  val executorMemoryOverhead = 
sparkConf.getInt(spark.yarn.executor.memoryOverhead,
-math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, 
MEMORY_OVERHEAD_MIN))
-
   private val isDynamicAllocationEnabled =
 sparkConf.getBoolean(spark.dynamicAllocation.enabled, false)
 
   parseArgs(args.toList)
+
+  val isClusterMode = userClass != null
+
   loadEnvironmentArgs()
   validateArgs()
 
+  // Additional memory to allocate to containers. In different modes, we 
use different configs.
+  val amMemOverheadConf = if (isClusterMode) {
+spark.yarn.driver.memoryOverhead
+  } else {
+spark.yarn.am.memoryOverhead
--- End diff --

we should make the name of this consistent with what we decide for naming 
in https://github.com/apache/spark/pull/3409 (ie spark.yarn.clientmode.am...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich

Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66292305
  
@jkbradley 

Tests fail again... 
Stab in the dark: looks like something is changed in the testing 
environment. 

(2) Regular and Robust in the same class

It's possible to implement, but I don't want to turn class hierarchy inside 
out. It just violates OOP principles. That's like a car inheriting from SUV. 

(4) Float vs. Double and linear algebra operations

I'd prefer (a). I don't like idea to convert from Array to Breeze matrix 
and vice versa. 

(5)

Glad to read it. Unfortunately, it looks like I've mirrored spark repo a 
while ago, so I've got no mllib/features package. Should I mirror spark repo 
again and send you another pull request? Or is there a way to pull from spark 
repo? 

Looking forward to your reply. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66295835
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24247/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66295827
  
  [Test build #24247 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24247/consoleFull)
 for   PR 3640 at commit 
[`74466a3`](https://github.com/apache/spark/commit/74466a3517996d837ba9f8acb6b912a82bbbd6df).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class HiveFunctionWrapper(var functionClassName: String) extends 
java.io.Serializable `
  * `class HiveFunctionWrapper(var functionClassName: String) extends 
java.io.Externalizable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread chazchandler

Github user chazchandler commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66296113
  
re: (5)

you can add a remote:
`git remote add upstream https://github.com/apache/spark.git`

fetch the latest state:
`git fetch upstream`

you'll see the branch/tag/etc. changes scroll by.  then rebase your branch 
on upstream:
`git rebase upstream [branch_name_you_want_to_use_as_a_new_base]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66297896
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24248/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66297888
  
  [Test build #24248 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24248/consoleFull)
 for   PR 3640 at commit 
[`8e13756`](https://github.com/apache/spark/commit/8e137565e34e36361faf90670ab8e98743051410).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class HiveFunctionWrapper(var functionClassName: String) extends 
java.io.Serializable `
  * `class HiveFunctionWrapper(var functionClassName: String) extends 
java.io.Externalizable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich

Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66299321
  
@chazchandler, thank you very much for your quick reply! It did the trick. 

Now I'm a bit confused about ml/ folder. What's it for? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich

Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66306112
  
It seems like something went wrong. I've got multiple compilation errors 
like 

```
[error] 
/home/valerij/contribute/spark/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala:299:
 constructor cannot be instantiated to expected type;
[error]  found   : org.apache.spark.ui.jobs.TaskUIData
[error]  required: org.apache.spark.ui.jobs.UIData.TaskUIData
```

How can I fix it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...

2014-12-09 Thread tianyi

Github user tianyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/3375#discussion_r21538764
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala
 ---
@@ -68,62 +68,59 @@ case class HashOuterJoin(
   @transient private[this] lazy val DUMMY_LIST = Seq[Row](null)
   @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row]
 
+  @transient private[this] lazy val joinedRow = new JoinedRow()
--- End diff --

Thanks, @chenghao-intel  I will try this way tomorrow. My concerns is about 
the memory costs. I think we should use memory as fewer as possible to reduce 
the possibility of OOM occurs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/3644

[SPARK-4798][SQL] A new set of Parquet testing API and test suites

This PR provides a set Parquet testing API (see trait `ParquetTest`) that 
enables developers to write more concise test cases. A new set of Parquet test 
suites built upon this API  are added and aim to replace the old 
`ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are 
not removed yet. The following classes can be safely removed after most Parquet 
related PRs are handled:

- `ParquetQuerySuite`
- `ParquetTestData`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark parquet-tests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3644.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3644


commit 83edb00c3c88b992356718213ad1199b95187ff6
Author: Cheng Lian l...@databricks.com
Date:   2014-12-09T13:00:09Z

Adds a new set of Parquet test suites




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3644#discussion_r21538934
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -181,6 +181,10 @@ private[sql] trait SQLConf {
*/
   def getAllConfs: immutable.Map[String, String] = settings.synchronized { 
settings.toMap }
 
+  private[spark] def unsetConf(key: String) {
+settings -= key
+  }
+
--- End diff --

Used in `ParquetTest.withSparkConf`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66309267
  
  [Test build #24249 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24249/consoleFull)
 for   PR 3644 at commit 
[`83edb00`](https://github.com/apache/spark/commit/83edb00c3c88b992356718213ad1199b95187ff6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread karlhigley

Github user karlhigley commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66309244
  
Re:
 (2) Regular and Robust in the same class
It's possible to implement, but I don't want to turn class hierarchy 
inside out. It just violates OOP principles. That's like a car inheriting from 
SUV.

In an OO world, this might lead to an abstract base class (e.g. vehicle, 
which both car and SUV inherit from).  Maybe the same thing applies here?  That 
would provide a way to apply this suggestion:

Would it work to have the regular parameters inherit from the robust, 
where the regular would override certain behavior to effectively fix the value 
of the noise?

...and still maintain a sensible class hierarchy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread chazchandler

Github user chazchandler commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66311502
  
@akopich , rebasing can be tricky, especially if you've been off on a 
branch for a while.  `git reflog` can be helpful in getting back to previous 
states if you end up not being able to solve the issue.  however you'll 
probably need to solve the issue to get things merged in anyway.  my guess is 
that something about `org.apache.spark.ui.jobs` may have changed; one option 
might be to rebase on branch-1.1 or -1.2 as a first step to limit the 
possibility of an unstable master affecting your code integration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66312060
  
  [Test build #24250 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24250/consoleFull)
 for   PR 3644 at commit 
[`ee17d7b`](https://github.com/apache/spark/commit/ee17d7b1be00be44ff10f83f640d9a058562c12e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66314310
  
  [Test build #24250 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24250/consoleFull)
 for   PR 3644 at commit 
[`ee17d7b`](https://github.com/apache/spark/commit/ee17d7b1be00be44ff10f83f640d9a058562c12e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait ParquetTest `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66314319
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24250/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-09 Thread avulanov

Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66318526
  
@dbtsai 1) Could you elaborate on what kind of optimizations did you do? 
Probably, they could be applied to the broader MLlib, which is beneficial. 2) 
Do you know the reason why our ANN implementation worked faster than the MLOR 
you shared? This could also be interesting in terms of MLlib optimization. 3) 
Did you mean fitting a n-th degree polynom instead of a linear function? Thanks 
for the link, it seems very interesting! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66319557
  
The current design of this PR is derived from some background knowledges. 
I'd like to provide a brief summary here for future reference.

As mentioned in the PR description, different from Hive 0.12.0, in Hive 
0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized 
once on the driver side and then serialized to executors. However, not all 
function objects are serializable (e.g. `GenericUDF` doesn't implement 
`Serializable`). Hive 0.13.1 solves this issue with Kryo or XML serializer. 
Several utility ser/de methods are provided in class `o.a.h.h.q.e.Utilities` 
for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer 
used here is created in Hive. Spark Kryo serializer wasn't used because there's 
no available `SparkConf` instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-09 Thread avulanov

Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66320878
  
@jkbradley Thank you! They took some time. 
   - I totally agree with you, I need to perform tests on the original test 
set. It contains less attributes, i.e. 778 vs 784 in mnist8m, so one needs to 
add zeros to it to make it compatible. 
   - They both did all 40 iterations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich

Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66321327
  
@karlhigley, yes I've heard something about abstract classes. Though, I see 
no way to employ this concept here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66321392
  
  [Test build #24249 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24249/consoleFull)
 for   PR 3644 at commit 
[`83edb00`](https://github.com/apache/spark/commit/83edb00c3c88b992356718213ad1199b95187ff6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait ParquetTest `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66321407
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24249/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66324190
  
Although it passed Jenkins, the first failure is rather weird. It seems 
that partitions collected via `SchemaRDD.collect()` can sometimes be out of 
order.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66324222
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4799] Use IP address instead of local h...

2014-12-09 Thread smola

GitHub user smola opened a pull request:

https://github.com/apache/spark/pull/3645

[SPARK-4799] Use IP address instead of local hostname in ConnectionManager

See https://issues.apache.org/jira/browse/SPARK-4799



Spark fails when a node hostname is not resolvable by other nodes.

See an example trace:

```
14/12/09 17:02:41 ERROR SendingConnection: Error connecting to 
27e434cf36ac:35093
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644)
at 
org.apache.spark.network.SendingConnection.connect(Connection.scala:299)
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:278)
at 
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
```

The relevant code is here:

https://github.com/apache/spark/blob/bcb5cdad614d4fce43725dfec3ce88172d2f8c11/core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala#L170

```
val id = new ConnectionManagerId(Utils.localHostName, 
serverChannel.socket.getLocalPort)
```

This piece of code should use the host IP with Utils.localIpAddress or a 
method that acknowleges user settings (e.g. `SPARK_LOCAL_IP`). Since I cannot 
think about a use case for using hostname here, I'm creating a PR with the 
former solution, but if you think the later is better, I'm willing to create a 
new PR with a more elaborate fix.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smola/spark SPARK-4799

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3645.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3645


commit 45d0356241b852c575fb3b47a20c072c6ef3d7bf
Author: Santiago M. Mola sm...@stratio.com
Date:   2014-11-25T15:02:14Z

[SPARK-4799] Use IP address instead of local hostname in ConnectionManager.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4799] Use IP address instead of local h...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3645#issuecomment-66324983
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66325520
  
  [Test build #24251 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24251/consoleFull)
 for   PR 3644 at commit 
[`ee17d7b`](https://github.com/apache/spark/commit/ee17d7b1be00be44ff10f83f640d9a058562c12e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich

Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66325635
  
@chazchandler, thank you very much for your help. I shouldn't have rebase 
on master. Rebase on 1.2 was successful. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-09 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-66331985
  
I'd agree that cosine similarity is preferred. Can't really think of a case 
where I've *not* used cosine sim for a similar items or similar users 
computation. Of course, it could be added as an option potentially (whether to 
use cosine sim - default - or plain dot product.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3640


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...

2014-12-09 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3640#issuecomment-66332833
  
Thanks a lot guys for digging into this!  Merged to master and branch 1.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-874] adding a --wait flag

2014-12-09 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3567#issuecomment-66333686
  
Jenkins, test this please. LGTM pending tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-874] adding a --wait flag

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3567#issuecomment-66334417
  
  [Test build #24252 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24252/consoleFull)
 for   PR 3567 at commit 
[`d05c5bb`](https://github.com/apache/spark/commit/d05c5bb952fc3f29f6f89db2c5a7dba4da7a5d4f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4791] [sql] Infer schema from case clas...

2014-12-09 Thread jkbradley

GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/3646

[SPARK-4791] [sql] Infer schema from case class with multiple constructors

Modified ScalaReflection.schemaFor to take primary constructor of Product 
when there are multiple constructors.  Added test to suite which failed before 
but works now.

Needed for [https://github.com/apache/spark/pull/3637]

CC: @marmbrus

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark sql-reflection

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3646.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3646


commit 796b2e4ffc562154c0511dc660c9cd4b3f8ac07c
Author: Joseph K. Bradley jos...@databricks.com
Date:   2014-12-09T18:35:28Z

Modified ScalaReflection.schemaFor to take primary constructor of Product 
when there are multiple constructors.  Added test to suite which failed before 
but works now.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-09 Thread dbtsai

Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66336110
  
@avulanov 

1. I did the same optimization for MLlib in [my recently 
PRs](https://github.com/apache/spark/commits/master?author=dbtsai).

* Accessing the values in dense/sparse vector directly is very slow without 
having a local reference of primitive array due to the dereference. See #3577 
and #3435. There is bytecode analysis for this issue in #3435
* Breeze's foreachActive is very slow, so I implemented a 4x faster version 
in #3288 My experience is that if Breeze is used in critical code path, it has 
to be cautious.  

2. I don't check out your ANN implementation yet, but I will check today. 
I'll send you our optimized Gradient Computation code for MLOR. Will be 
interesting to see the new benchmark compared with the one you tested.

3. See page 27 at Prof. CJ Lin's slide. 
http://www.csie.ntu.edu.tw/~cjlin/talks/SFmeetup.pdf It's just doing the 
feature expansion by mapping the data into higher dimension space. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66336150
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24251/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66336134
  
  [Test build #24251 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24251/consoleFull)
 for   PR 3644 at commit 
[`ee17d7b`](https://github.com/apache/spark/commit/ee17d7b1be00be44ff10f83f640d9a058562c12e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait ParquetTest `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.

2014-12-09 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3215#issuecomment-66336371
  
We actually don't need to wait on the RC to merge this since this is only 
going into master. I'll take a quick look and will likely merge it after that 
since this is pretty straightforward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4791] [sql] Infer schema from case clas...

2014-12-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3646#issuecomment-66336585
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24253/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4791] [sql] Infer schema from case clas...

2014-12-09 Thread KirthiRaman

Github user KirthiRaman commented on the pull request:

https://github.com/apache/spark/pull/3646#issuecomment-66336768
  
unsubscribe

On Tue, Dec 9, 2014 at 1:57 PM, UCB AMPLab notificati...@github.com wrote:

 Test FAILed.
 Refer to this link for build results (access rights to CI server needed):
 https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24253/
 Test FAILed.

 --
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3646#issuecomment-66336585.




-- 
--  Kirthi Raman
http://www.linkedin.com/in/kirthiraman
(410)-302-1757 (Mobile)
(703)-444-3397 (Home)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.

2014-12-09 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3215#discussion_r21552599
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala ---
@@ -178,21 +178,25 @@ private[spark] class ClientArguments(args: 
Array[String], sparkConf: SparkConf)
 
   private def getUsageMessage(unknownParam: List[String] = null): String = 
{
 val message = if (unknownParam != null) sUnknown/unsupported param 
$unknownParam\n else 
-message +
-  Usage: org.apache.spark.deploy.yarn.Client [options] \n +
-  Options:\n +
---jar JAR_PATH Path to your application's JAR file 
(required in yarn-cluster mode)\n +
---class CLASS_NAME Name of your application's main class 
(required)\n +
---arg ARG  Argument to be passed to your 
application's main class.\n +
-   Multiple invocations are possible, 
each will be passed in order.\n +
---num-executors NUMNumber of executors to start (Default: 
2)\n +
---executor-cores NUM   Number of cores for the executors 
(Default: 1).\n +
---driver-memory MEMMemory for driver (e.g. 1000M, 2G) 
(Default: 512 Mb)\n +
---executor-memory MEM  Memory per executor (e.g. 1000M, 2G) 
(Default: 1G)\n +
---name NAMEThe name of your application (Default: 
Spark)\n +
---queue QUEUE  The hadoop queue to use for allocation 
requests (Default: 'default')\n +
---addJars jars Comma separated list of local jars 
that want SparkContext.addJar to work with.\n +
---files files  Comma separated list of files to be 
distributed with the job.\n +
---archives archivesComma separated list of archives to be 
distributed with the job.
+message + 
+  |Usage: org.apache.spark.deploy.yarn.Client [options]
+  |Options:
+  |  --jar JAR_PATH   Path to your application's JAR file 
(required in yarn-cluster
+  |   mode)
+  |  --class CLASS_NAME   Name of your application's main class 
(required)
+  |  --arg ARGArgument to be passed to your 
application's main class.
+  |   Multiple invocations are possible, each 
will be passed in order.
+  |  --num-executors NUM  Number of executors to start (Default: 2)
+  |  --executor-cores NUM Number of cores for the executors 
(Default: 1).
+  |  --driver-memory MEM  Memory for driver (e.g. 1000M, 2G) 
(Default: 512 Mb)
+  |  --executor-memory MEMMemory per executor (e.g. 1000M, 2G) 
(Default: 1G)
+  |  --name NAME  The name of your application (Default: 
Spark)
+  |  --queue QUEUEThe hadoop queue to use for allocation 
requests (Default:
+  |   'default')
+  |  --addJars jars   Comma separated list of local jars that 
want SparkContext.addJar
+  |   to work with.
+  |  --files filesComma separated list of files to be 
distributed with the job.
+  |  --archives archives  Comma separated list of archives to be 
distributed with the job.
+  
--- End diff --

do you need to do `.stripMargins` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.

2014-12-09 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3215#discussion_r21552652
  
--- Diff: 
yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClientImpl.scala 
---
@@ -1,118 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the License); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an AS IS BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.deploy.yarn
-
-import scala.collection.{Map, Set}
-import java.net.URI
-
-import org.apache.hadoop.net.NetUtils
-import org.apache.hadoop.yarn.api._
-import org.apache.hadoop.yarn.api.records._
-import org.apache.hadoop.yarn.api.protocolrecords._
-import org.apache.hadoop.yarn.conf.YarnConfiguration
-import org.apache.hadoop.yarn.ipc.YarnRPC
-import org.apache.hadoop.yarn.util.{ConverterUtils, Records}
-
-import org.apache.spark.{Logging, SecurityManager, SparkConf}
-import org.apache.spark.scheduler.SplitInfo
-import org.apache.spark.util.Utils
-
-/**
- * YarnRMClient implementation for the Yarn alpha API.
- */
-private class YarnRMClientImpl(args: ApplicationMasterArguments) extends 
YarnRMClient with Logging {
-
-  private var rpc: YarnRPC = null
-  private var resourceManager: AMRMProtocol = _
-  private var uiHistoryAddress: String = _
-  private var registered: Boolean = false
-
-  override def register(
-  conf: YarnConfiguration,
-  sparkConf: SparkConf,
-  preferredNodeLocations: Map[String, Set[SplitInfo]],
-  uiAddress: String,
-  uiHistoryAddress: String,
-  securityMgr: SecurityManager) = {
-this.rpc = YarnRPC.create(conf)
-this.uiHistoryAddress = uiHistoryAddress
-
-synchronized {
-  resourceManager = registerWithResourceManager(conf)
-  registerApplicationMaster(uiAddress)
-  registered = true
-}
-
-new YarnAllocationHandler(conf, sparkConf, resourceManager, 
getAttemptId(), args,
-  preferredNodeLocations, securityMgr)
-  }
-
-  override def getAttemptId() = {
-val envs = System.getenv()
-val containerIdString = 
envs.get(ApplicationConstants.AM_CONTAINER_ID_ENV)
-val containerId = ConverterUtils.toContainerId(containerIdString)
-val appAttemptId = containerId.getApplicationAttemptId()
-appAttemptId
-  }
-
-  override def unregister(status: FinalApplicationStatus, diagnostics: 
String = ) = synchronized {
-if (registered) {
-  val finishReq = 
Records.newRecord(classOf[FinishApplicationMasterRequest])
-.asInstanceOf[FinishApplicationMasterRequest]
-  finishReq.setAppAttemptId(getAttemptId())
-  finishReq.setFinishApplicationStatus(status)
-  finishReq.setDiagnostics(diagnostics)
-  finishReq.setTrackingUrl(uiHistoryAddress)
-  resourceManager.finishApplicationMaster(finishReq)
-}
-  }
-
-  override def getAmIpFilterParams(conf: YarnConfiguration, proxyBase: 
String) = {
-val proxy = YarnConfiguration.getProxyHostAndPort(conf)
-val parts = proxy.split(:)
-val uriBase = http://; + proxy + proxyBase
-Map(PROXY_HOST - parts(0), PROXY_URI_BASE - uriBase)
-  }
-
-  override def getMaxRegAttempts(conf: YarnConfiguration) =
-conf.getInt(YarnConfiguration.RM_AM_MAX_RETRIES, 
YarnConfiguration.DEFAULT_RM_AM_MAX_RETRIES)
-
-  private def registerWithResourceManager(conf: YarnConfiguration): 
AMRMProtocol = {
-val rmAddress = 
NetUtils.createSocketAddr(conf.get(YarnConfiguration.RM_SCHEDULER_ADDRESS,
-  YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS))
-logInfo(Connecting to ResourceManager at  + rmAddress)
-rpc.getProxy(classOf[AMRMProtocol], rmAddress, 
conf).asInstanceOf[AMRMProtocol]
-  }
-
-  private def registerApplicationMaster(uiAddress: String): 
RegisterApplicationMasterResponse = {
-val appMasterRequest = 
Records.newRecord(classOf[RegisterApplicationMasterRequest])
-  .asInstanceOf[RegisterApplicationMasterRequest]
-

[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.

2014-12-09 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3215#issuecomment-66337338
  
Hey this LGTM. I'm just gonna merge this. My comments are minor enough that 
we can slip them in later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.

2014-12-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3215


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2014-12-09 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3607#issuecomment-66338560
  
By the way you will need to rebase to master since we just removed the 
support for yarn alpha.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2014-12-09 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r21553460
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala 
---
@@ -39,23 +39,34 @@ private[spark] class ClientArguments(args: 
Array[String], sparkConf: SparkConf)
   var appName: String = Spark
   var priority = 0
 
-  // Additional memory to allocate to containers
-  // For now, use driver's memory overhead as our AM container's memory 
overhead
-  val amMemoryOverhead = 
sparkConf.getInt(spark.yarn.driver.memoryOverhead,
-math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, 
MEMORY_OVERHEAD_MIN))
-
-  val executorMemoryOverhead = 
sparkConf.getInt(spark.yarn.executor.memoryOverhead,
-math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, 
MEMORY_OVERHEAD_MIN))
-
   private val isDynamicAllocationEnabled =
 sparkConf.getBoolean(spark.dynamicAllocation.enabled, false)
 
   parseArgs(args.toList)
+
+  val isClusterMode = userClass != null
+
   loadEnvironmentArgs()
   validateArgs()
 
+  // Additional memory to allocate to containers. In different modes, we 
use different configs.
+  val amMemOverheadConf = if (isClusterMode) {
+spark.yarn.driver.memoryOverhead
+  } else {
+spark.yarn.am.memoryOverhead
--- End diff --

Hm... I agree that adding `clientmode` is clearer but it's also incredibly 
verbose. I think if we document this clearly that this is only for client mode 
then the user won't arbitrarily try to set it. I'll comment on #3409 about this 
as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-12-09 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3409#issuecomment-66339357
  
Hey regarding the naming, I would actually prefer `spark.yarn.am.*` over 
`spark.yarn.clientmode.am.*`. Although I agree that it's clearer to have 
clientmode in there, it's incredibly verbose and inconsistent with the other 
similar configs that we have (i.e. `spark.driver.extraJavaOptions`, 
`spark.yarn.executor.memoryOverhead`). If we want to discourage the user from 
setting these in cluster mode, I think the better approach is to log a warning 
or even throw an exception if this happens, instead of cramming in `clientmode` 
in a config name that we can't ever change again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 468 matches

Mail list logo