[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21511900 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry val functionClassName = functionInfo.getFunctionClass.getName if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveSimpleUdf(functionClassName, children) + HiveSimpleUdf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdf(functionClassName, children) + HiveGenericUdf(new HiveFunctionCache(functionClassName), children) } else if ( classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdaf(functionClassName, children) + HiveGenericUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveUdaf(functionClassName, children) + HiveUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdtf(functionClassName, Nil, children) + HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, children) } else { sys.error(sNo handler for udf ${functionInfo.getFunctionClass}) } } } -private[hive] trait HiveFunctionFactory { - val functionClassName: String - - def createFunction[UDFType]() = - getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] -} - -private[hive] abstract class HiveUdf extends Expression with Logging with HiveFunctionFactory { - self: Product = +/** + * This class provides the UDF creation and also the UDF instance serialization and + * de-serialization cross process boundary. + * + * We use class instead of trait, seems property variables of trait cannot be serialized when + * bundled with Case Class; in the other hand, we need to intercept the UDF instance ser/de. + * the Has-a probably better than Is-a. + * @param functionClassName UDF class name + */ +class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable { --- End diff -- `HiveFunctionWrapper` might be a better name, since essentially this class is just used to make Hive function objects serializable. And traditionally Spark uses wrapper as suffix for classes with similar purposes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512172 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry val functionClassName = functionInfo.getFunctionClass.getName if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveSimpleUdf(functionClassName, children) + HiveSimpleUdf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdf(functionClassName, children) + HiveGenericUdf(new HiveFunctionCache(functionClassName), children) } else if ( classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdaf(functionClassName, children) + HiveGenericUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveUdaf(functionClassName, children) + HiveUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdtf(functionClassName, Nil, children) + HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, children) } else { sys.error(sNo handler for udf ${functionInfo.getFunctionClass}) } } } -private[hive] trait HiveFunctionFactory { - val functionClassName: String - - def createFunction[UDFType]() = - getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] -} - -private[hive] abstract class HiveUdf extends Expression with Logging with HiveFunctionFactory { - self: Product = +/** + * This class provides the UDF creation and also the UDF instance serialization and + * de-serialization cross process boundary. + * + * We use class instead of trait, seems property variables of trait cannot be serialized when + * bundled with Case Class; in the other hand, we need to intercept the UDF instance ser/de. + * the Has-a probably better than Is-a. + * @param functionClassName UDF class name + */ +class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable { --- End diff -- Yes, that's a good suggestion, will update it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512252 --- Diff: sql/hive/v0.12.0/src/main/scala/org/apache/spark/sql/hive/Shim12.scala --- @@ -49,6 +49,16 @@ import org.apache.spark.sql.catalyst.types.DecimalType private[hive] object HiveShim { val version = 0.12.0 + import org.apache.hadoop.hive.ql.exec.Utilities + + def deserializePlan[UDFType](is: java.io.InputStream, clazz: Class[UDFType]): UDFType = { +Utilities.deserializePlan(is).asInstanceOf[UDFType] + } + + def serializePlan(function: Any, out: java.io.OutputStream): Unit = { +Utilities.serializePlan(function, out) + } + --- End diff -- Instead of calling `Utilities.deserializePlan`, how about mimic `Utilities.de/serializeObjectByKryo` methods here? Those two functions are private but very simple, the advantage is that we don't need the expensive `HiveConf` instantiation here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512352 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry val functionClassName = functionInfo.getFunctionClass.getName if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveSimpleUdf(functionClassName, children) + HiveSimpleUdf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdf(functionClassName, children) + HiveGenericUdf(new HiveFunctionCache(functionClassName), children) } else if ( classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdaf(functionClassName, children) + HiveGenericUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveUdaf(functionClassName, children) + HiveUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdtf(functionClassName, Nil, children) + HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, children) } else { sys.error(sNo handler for udf ${functionInfo.getFunctionClass}) } } } -private[hive] trait HiveFunctionFactory { - val functionClassName: String - - def createFunction[UDFType]() = - getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] -} - -private[hive] abstract class HiveUdf extends Expression with Logging with HiveFunctionFactory { - self: Product = +/** + * This class provides the UDF creation and also the UDF instance serialization and + * de-serialization cross process boundary. + * + * We use class instead of trait, seems property variables of trait cannot be serialized when + * bundled with Case Class; in the other hand, we need to intercept the UDF instance ser/de. + * the Has-a probably better than Is-a. + * @param functionClassName UDF class name + */ +class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable { + // for Serialization + def this() = this(null) + + private var instance: Any = null + + def writeExternal(out: java.io.ObjectOutput) { +// output the function name +out.writeUTF(functionClassName) + +// Write a flag if instance is null or not +out.writeBoolean(instance != null) --- End diff -- Does this relate to the `HiveSimpleUdf` and UDF bridge case you mentioned offline? Would be better to leave a comment here. I was confused when reading this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3247#issuecomment-66247996 [Test build #24242 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24242/consoleFull) for PR 3247 at commit [`a160d1a`](https://github.com/apache/spark/commit/a160d1a9785b70abb8f7d3f63d2a69717727f0ac). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512453 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala --- @@ -48,6 +48,17 @@ import scala.language.implicitConversions private[hive] object HiveShim { val version = 0.13.1 + import org.apache.hadoop.hive.ql.exec.Utilities + import org.apache.hadoop.hive.conf.HiveConf + + def deserializePlan[UDFType](is: java.io.InputStream, clazz: Class[UDFType]): UDFType = { +Utilities.deserializePlan(is, clazz, new HiveConf()) + } + + def serializePlan(function: Any, out: java.io.OutputStream): Unit = { +Utilities.serializePlan(function, out, new HiveConf()) + } + --- End diff -- Instead of calling `Utilities.deserializePlan`, how about mimic `Utilities.de/serializeObjectByKryo` methods here? Those two functions are private but very simple, the advantage is that we don't need the expensive `HiveConf` instantiation here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3247#issuecomment-66248753 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24242/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512634 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry val functionClassName = functionInfo.getFunctionClass.getName if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveSimpleUdf(functionClassName, children) + HiveSimpleUdf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdf(functionClassName, children) + HiveGenericUdf(new HiveFunctionCache(functionClassName), children) } else if ( classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdaf(functionClassName, children) + HiveGenericUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveUdaf(functionClassName, children) + HiveUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdtf(functionClassName, Nil, children) + HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, children) } else { sys.error(sNo handler for udf ${functionInfo.getFunctionClass}) } } } -private[hive] trait HiveFunctionFactory { - val functionClassName: String - - def createFunction[UDFType]() = - getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] -} - -private[hive] abstract class HiveUdf extends Expression with Logging with HiveFunctionFactory { - self: Product = +/** + * This class provides the UDF creation and also the UDF instance serialization and + * de-serialization cross process boundary. + * + * We use class instead of trait, seems property variables of trait cannot be serialized when + * bundled with Case Class; in the other hand, we need to intercept the UDF instance ser/de. + * the Has-a probably better than Is-a. + * @param functionClassName UDF class name + */ +class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable { + // for Serialization + def this() = this(null) + + private var instance: Any = null + + def writeExternal(out: java.io.ObjectOutput) { +// output the function name +out.writeUTF(functionClassName) + +// Write a flag if instance is null or not +out.writeBoolean(instance != null) --- End diff -- Yes, actually I leave a comment in above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4233] [SQL] WIP:Simplify the UDAF API (...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3247#issuecomment-66248750 [Test build #24242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24242/consoleFull) for PR 3247 at commit [`a160d1a`](https://github.com/apache/spark/commit/a160d1a9785b70abb8f7d3f63d2a69717727f0ac). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class UnresolvedFunction(` * `trait AggregateFunction ` * `trait AggregateExpression extends Expression ` * `abstract class UnaryAggregateExpression extends UnaryExpression with AggregateExpression ` * `case class Min(` * `case class Average(child: Expression, distinct: Boolean = false)` * `case class Max(child: Expression, distinct: Boolean = false)` * `case class Count(child: Expression, distinct: Boolean = false)` * `case class Sum(child: Expression, distinct: Boolean = false)` * `case class First(child: Expression, distinct: Boolean = false)` * `case class Last(child: Expression, distinct: Boolean = false)` * `case class MinFunction(aggr: BoundReference, base: Min) extends AggregateFunction ` * `case class AverageFunction(count: BoundReference, sum: BoundReference, base: Average)` * `case class MaxFunction(aggr: BoundReference, base: Max) extends AggregateFunction ` * `case class CountFunction(aggr: BoundReference, base: Count) extends AggregateFunction ` * `case class SumFunction(aggr: BoundReference, base: Sum) extends AggregateFunction ` * `case class FirstFunction(aggr: BoundReference, base: First) extends AggregateFunction ` * `case class LastFunction(aggr: BoundReference, base: AggregateExpression) extends AggregateFunction ` * `sealed case class AggregateFunctionBind(` * `sealed trait Aggregate ` * `sealed trait PreShuffle extends Aggregate ` * `sealed trait PostShuffle extends Aggregate ` * `case class AggregatePreShuffle(` * `case class AggregatePostShuffle(` * `case class DistinctAggregate(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512701 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry val functionClassName = functionInfo.getFunctionClass.getName if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveSimpleUdf(functionClassName, children) + HiveSimpleUdf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdf(functionClassName, children) + HiveGenericUdf(new HiveFunctionCache(functionClassName), children) } else if ( classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdaf(functionClassName, children) + HiveGenericUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveUdaf(functionClassName, children) + HiveUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdtf(functionClassName, Nil, children) + HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, children) } else { sys.error(sNo handler for udf ${functionInfo.getFunctionClass}) } } } -private[hive] trait HiveFunctionFactory { - val functionClassName: String - - def createFunction[UDFType]() = - getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] -} - -private[hive] abstract class HiveUdf extends Expression with Logging with HiveFunctionFactory { - self: Product = +/** + * This class provides the UDF creation and also the UDF instance serialization and + * de-serialization cross process boundary. + * + * We use class instead of trait, seems property variables of trait cannot be serialized when + * bundled with Case Class; in the other hand, we need to intercept the UDF instance ser/de. + * the Has-a probably better than Is-a. + * @param functionClassName UDF class name + */ +class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable { --- End diff -- Another comment, related to `HiveShim`. I was thinking to move this class rather than `de/serializePlan` methods into the shim layer. Hive 0.12.0 is not affected by SPARK-4785, thus the version in 0.12.0 shim can be very simple. We only need to handle 0.13.1 there. This also lowers the possibility of breaking 0.12.0 code paths. Also after moving this class into the shim layer, as I've mentioned in another comment, instead of relying on `de/serializePlan`, we can just mimic `Utilities.de/serializeObjectByKryo` in `read/writeExternal` methods in this class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512696 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala --- @@ -48,6 +48,17 @@ import scala.language.implicitConversions private[hive] object HiveShim { val version = 0.13.1 + import org.apache.hadoop.hive.ql.exec.Utilities + import org.apache.hadoop.hive.conf.HiveConf + + def deserializePlan[UDFType](is: java.io.InputStream, clazz: Class[UDFType]): UDFType = { +Utilities.deserializePlan(is, clazz, new HiveConf()) + } + + def serializePlan(function: Any, out: java.io.OutputStream): Unit = { +Utilities.serializePlan(function, out, new HiveConf()) + } + --- End diff -- Yes, initializing the HiveConf is quite expensive, particularly within the executor, I will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512868 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry val functionClassName = functionInfo.getFunctionClass.getName if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveSimpleUdf(functionClassName, children) + HiveSimpleUdf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdf(functionClassName, children) + HiveGenericUdf(new HiveFunctionCache(functionClassName), children) } else if ( classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdaf(functionClassName, children) + HiveGenericUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveUdaf(functionClassName, children) + HiveUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdtf(functionClassName, Nil, children) + HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, children) } else { sys.error(sNo handler for udf ${functionInfo.getFunctionClass}) } } } -private[hive] trait HiveFunctionFactory { - val functionClassName: String - - def createFunction[UDFType]() = - getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] -} - -private[hive] abstract class HiveUdf extends Expression with Logging with HiveFunctionFactory { - self: Product = +/** + * This class provides the UDF creation and also the UDF instance serialization and + * de-serialization cross process boundary. + * + * We use class instead of trait, seems property variables of trait cannot be serialized when + * bundled with Case Class; in the other hand, we need to intercept the UDF instance ser/de. + * the Has-a probably better than Is-a. + * @param functionClassName UDF class name + */ +class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable { + // for Serialization + def this() = this(null) + + private var instance: Any = null + + def writeExternal(out: java.io.ObjectOutput) { +// output the function name +out.writeUTF(functionClassName) + +// Write a flag if instance is null or not +out.writeBoolean(instance != null) +if (instance != null) { + // Some of the UDF are serializable, but some others are not + // Hive Utilities can handle both cases + val baos = new java.io.ByteArrayOutputStream() + HiveShim.serializePlan(instance, baos) + val functionInBytes = baos.toByteArray + + // output the function bytes + out.writeInt(functionInBytes.length) + out.write(functionInBytes, 0, functionInBytes.length) +} + } - type UDFType - type EvaluatedType = Any + def readExternal(in: java.io.ObjectInput) { +// read the function name +functionClassName = in.readUTF() - def nullable = true +if (in.readBoolean()) { + // if the instance is not null + // read the function in bytes + val functionInBytesLength = in.readInt() + val functionInBytes = new Array[Byte](functionInBytesLength) + in.read(functionInBytes, 0, functionInBytesLength) - lazy val function = createFunction[UDFType]() + // deserialize the function object via Hive Utilities + instance = HiveShim.deserializePlan(new java.io.ByteArrayInputStream(functionInBytes), +getContextOrSparkClassLoader.loadClass(functionClassName)) +} + } - override def toString = s$nodeName#$functionClassName(${children.mkString(,)}) + def createFunction[UDFType](alwaysCreateNewInstance: Boolean = false) = { +if (alwaysCreateNewInstance) { + getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] +} else { + if (instance == null) { +instance = getContextOrSparkClassLoader.loadClass(functionClassName).newInstance + } + instance.asInstanceOf[UDFType] +} --- End diff -- Can be simplified to: ```scala if (!alwaysCreateNewInstance instance == null) { instance = ... } instance.asInstanceOf[UDFType] ``` --- If your project is set up for it, you
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3640#discussion_r21512962 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -54,47 +54,95 @@ private[hive] abstract class HiveFunctionRegistry val functionClassName = functionInfo.getFunctionClass.getName if (classOf[UDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveSimpleUdf(functionClassName, children) + HiveSimpleUdf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdf(functionClassName, children) + HiveGenericUdf(new HiveFunctionCache(functionClassName), children) } else if ( classOf[AbstractGenericUDAFResolver].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdaf(functionClassName, children) + HiveGenericUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[UDAF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveUdaf(functionClassName, children) + HiveUdaf(new HiveFunctionCache(functionClassName), children) } else if (classOf[GenericUDTF].isAssignableFrom(functionInfo.getFunctionClass)) { - HiveGenericUdtf(functionClassName, Nil, children) + HiveGenericUdtf(new HiveFunctionCache(functionClassName), Nil, children) } else { sys.error(sNo handler for udf ${functionInfo.getFunctionClass}) } } } -private[hive] trait HiveFunctionFactory { - val functionClassName: String - - def createFunction[UDFType]() = - getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] -} - -private[hive] abstract class HiveUdf extends Expression with Logging with HiveFunctionFactory { - self: Product = +/** + * This class provides the UDF creation and also the UDF instance serialization and + * de-serialization cross process boundary. + * + * We use class instead of trait, seems property variables of trait cannot be serialized when + * bundled with Case Class; in the other hand, we need to intercept the UDF instance ser/de. + * the Has-a probably better than Is-a. + * @param functionClassName UDF class name + */ +class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable { + // for Serialization + def this() = this(null) + + private var instance: Any = null + + def writeExternal(out: java.io.ObjectOutput) { +// output the function name +out.writeUTF(functionClassName) + +// Write a flag if instance is null or not +out.writeBoolean(instance != null) +if (instance != null) { + // Some of the UDF are serializable, but some others are not + // Hive Utilities can handle both cases + val baos = new java.io.ByteArrayOutputStream() + HiveShim.serializePlan(instance, baos) + val functionInBytes = baos.toByteArray + + // output the function bytes + out.writeInt(functionInBytes.length) + out.write(functionInBytes, 0, functionInBytes.length) +} + } - type UDFType - type EvaluatedType = Any + def readExternal(in: java.io.ObjectInput) { +// read the function name +functionClassName = in.readUTF() - def nullable = true +if (in.readBoolean()) { + // if the instance is not null + // read the function in bytes + val functionInBytesLength = in.readInt() + val functionInBytes = new Array[Byte](functionInBytesLength) + in.read(functionInBytes, 0, functionInBytesLength) - lazy val function = createFunction[UDFType]() + // deserialize the function object via Hive Utilities + instance = HiveShim.deserializePlan(new java.io.ByteArrayInputStream(functionInBytes), +getContextOrSparkClassLoader.loadClass(functionClassName)) +} + } - override def toString = s$nodeName#$functionClassName(${children.mkString(,)}) + def createFunction[UDFType](alwaysCreateNewInstance: Boolean = false) = { +if (alwaysCreateNewInstance) { + getContextOrSparkClassLoader.loadClass(functionClassName).newInstance.asInstanceOf[UDFType] +} else { + if (instance == null) { +instance = getContextOrSparkClassLoader.loadClass(functionClassName).newInstance + } + instance.asInstanceOf[UDFType] +} --- End diff -- Actually, how about removing the `alwaysCreateNewInstance` argument (which is confusing), and define a new `HiveSimpleUdfWrapper` that overrides `createFunction`, and always return a new instance? --- If
[GitHub] spark pull request: [SPARK-4793] [Deploy] ensure .jar at end of li...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3641#issuecomment-66250028 [Test build #24241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24241/consoleFull) for PR 3641 at commit [`45cbfd0`](https://github.com/apache/spark/commit/45cbfd03444b2d4022dd4d564100a6bac9924d1d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4793] [Deploy] ensure .jar at end of li...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3641#issuecomment-66250034 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24241/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66251380 Appreciate a lot for fixing this case! The serialization wrapper class makes sense. However, would like to make some refactoring. A summary of my comments above: 1. Renaming `HiveFunctionCache` to `HiveFunctionWrapper` 2. Moving `HiveFunctionWrapper` to the shim layer, and keep Hive 0.12.0 code path exactly the same as before. Considering Hive 0.12.0 is not affected by this issue, and 1.2.0 release is really close, I'd like to lower the possibility of breaking 0.12.0 code paths as much as possible. 3. Add a `HiveSimpleUdfWrapper`, which inherits from `HiveFunctionWrapper`. As you've mentioned in the code, `HiveSimpleUdf` is a special case, it shouldn't be serialized, and should always create a fresh object on executor side. Currently this special case complicates `HiveFunctionWrapper` implementation and makes it somewhat confusing. Defining a special subclass for `HvieSimpleUdf` helps making `HiveFunctionWrapper` simpler (e.g. no need to serialize the boolean flag any more). 4. Avoid using `Utilities.de/serializePlan` by mimicking `Utilities.de/serializeObjectByKryo` in the `read/writeExternal` methods of the wrapper class, so that we can remove the expensive `HiveConf` instantiation. In a word, we can add two classes, `HiveFunctionWrapper` and `HiveSImpleUdfWrapper` into the shim layer, and make sure that the 0.12.0 version behaves exactly the same as before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/3642 [SPARK-4795][Core] Redesign the primitive type = Writable implicit APIs to make them be activated automatically Try to redesign the primitive type = Writable implicit APIs to make them be activated automatically and without breaking compatibility. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-4795 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3642.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3642 commit ac1e9038ce96b423885e8c6fe614162fb80c0dbf Author: zsxwing zsxw...@gmail.com Date: 2014-12-08T12:57:11Z Reorganize the rest 'implicit' methods in SparkContext --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66252423 [Test build #24243 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24243/consoleFull) for PR 3642 at commit [`ac1e903`](https://github.com/apache/spark/commit/ac1e9038ce96b423885e8c6fe614162fb80c0dbf). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66253299 [Test build #24244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24244/consoleFull) for PR 3642 at commit [`3009328`](https://github.com/apache/spark/commit/3009328ef946ce06a89ec8f05a3b831e05bf96e1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66259222 [Test build #24243 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24243/consoleFull) for PR 3642 at commit [`ac1e903`](https://github.com/apache/spark/commit/ac1e9038ce96b423885e8c6fe614162fb80c0dbf). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` m = m.getReturnType().toString != class java.lang.Object ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66259234 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24243/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66260184 [Test build #24244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24244/consoleFull) for PR 3642 at commit [`3009328`](https://github.com/apache/spark/commit/3009328ef946ce06a89ec8f05a3b831e05bf96e1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66260189 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24244/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...
Github user gvramana commented on a diff in the pull request: https://github.com/apache/spark/pull/2802#discussion_r21518099 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -172,6 +177,8 @@ private[hive] case class HiveGenericUdf(functionClassName: String, children: Seq override def eval(input: Row): Any = { returnInspector // Make sure initialized. +if(foldable) return constantReturnValue --- End diff -- @chenghao-intel This PR cannot be separately merged without CreateArray, as percentile_approx accepts only constant array iterator and fails otherwise. I think we can go ahead and merge these changes as they don't break build or tests, and are not directly dependent on #3429 in order of merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Replace breezeSquaredDistance
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/3643 Replace breezeSquaredDistance This PR replaces slow breezeSquaredDistance. A simple calculation involving 4 vectors of 2 dims shows: * breezeSquaredDistance: ~12 secs * This PR: ~10.5 secs You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 faster_squareddistance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3643.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3643 commit 13669dbbf163bd1940d380388f94ad60237b10c0 Author: Liang-Chi Hsieh vii...@gmail.com Date: 2014-12-09T10:13:51Z Replace breezeSquaredDistance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Replace breezeSquaredDistance
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3643#issuecomment-66262214 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...
Github user tianyi commented on a diff in the pull request: https://github.com/apache/spark/pull/3375#discussion_r21519366 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala --- @@ -68,62 +68,59 @@ case class HashOuterJoin( @transient private[this] lazy val DUMMY_LIST = Seq[Row](null) @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row] + @transient private[this] lazy val joinedRow = new JoinedRow() --- End diff -- to @marmbrus I move `joinedRow`, `rightNullRow`, `boundCondition` to the beginning of `leftOuterIterator`, and ran my benchmark again, the result is : ``` CSV: 9935 ms CSV: 6989 ms CSV: 7045 ms Current Mem Usage:508094480 ``` without move, the result is : ``` CSV: 10590 ms CSV: 7022 ms CSV: 7086 ms Current Mem Usage:402470616 ``` (I don't know why the costs of memory is not around 208145728 in my last test, maybe something related to my laptop environment.) Apparently, the costs of memory increased 25%, and the speed is almost the same. Is it acceptable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...
Github user tianyi commented on a diff in the pull request: https://github.com/apache/spark/pull/3375#discussion_r21519410 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala --- @@ -68,62 +68,59 @@ case class HashOuterJoin( @transient private[this] lazy val DUMMY_LIST = Seq[Row](null) @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row] + @transient private[this] lazy val joinedRow = new JoinedRow() --- End diff -- to @marmbrus I move `joinedRow`, `rightNullRow`, `boundCondition` to the beginning of `leftOuterIterator`, and ran my benchmark again, the result is : ``` CSV: 9935 ms CSV: 6989 ms CSV: 7045 ms Current Mem Usage:508094480 ``` without move, the result is : ``` CSV: 10590 ms CSV: 7022 ms CSV: 7086 ms Current Mem Usage:402470616 ``` (I don't know why the costs of memory is not around 208145728 in my last test, maybe something related to my laptop environment.) Apparently, the costs of memory increased 25%, and the speed is almost the same. Is it acceptable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/3642#discussion_r21520954 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/ShortestPathsSuite.scala --- @@ -40,7 +40,7 @@ class ShortestPathsSuite extends FunSuite with LocalSparkContext { val graph = Graph.fromEdgeTuples(edges, 1) val landmarks = Seq(1, 4).map(_.toLong) val results = ShortestPaths.run(graph, landmarks).vertices.collect.map { -case (v, spMap) = (v, spMap.mapValues(_.get)) --- End diff -- This is an example of breaking source compatibility, although it's used the implicit `intToIntWritable` occasionally. _.get = intToIntWritable(_).get. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66268347 [Test build #24245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24245/consoleFull) for PR 3642 at commit [`39343de`](https://github.com/apache/spark/commit/39343ded29a8576388ca98b32559ff8cae1cc5b4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...
Github user gvramana commented on a diff in the pull request: https://github.com/apache/spark/pull/2802#discussion_r21524226 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -165,6 +165,11 @@ private[hive] case class HiveGenericUdf(functionClassName: String, children: Seq isUDFDeterministic returnInspector.isInstanceOf[ConstantObjectInspector] @transient + protected lazy val constantReturnValue = unwrap( --- End diff -- Yes, fixed the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2802#issuecomment-66274239 [Test build #24246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24246/consoleFull) for PR 2802 at commit [`a0182e5`](https://github.com/apache/spark/commit/a0182e54c5ed97bb4b85c895434b4d626671ba58). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66277314 [Test build #24245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24245/consoleFull) for PR 3642 at commit [`39343de`](https://github.com/apache/spark/commit/39343ded29a8576388ca98b32559ff8cae1cc5b4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3642#issuecomment-66277323 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24245/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4793] [Deploy] ensure .jar at end of li...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3641#issuecomment-66278031 LGTM. Are there other instances like this in other scripts? couldn't hurt to attach `$` elsewhere where the pattern is clearly matching at the end of a filename. Worth a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4792] Add error message when making loc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3635#discussion_r21526216 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -67,11 +67,29 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon if (subDir == null) { subDir = subDirs(dirId).synchronized { val old = subDirs(dirId)(subDirId) -if (old != null) { +if (old != null old.exists()) { old } else { + var foundLocalDir = false + var tries = 0 val newDir = new File(localDirs(dirId), %02x.format(subDirId)) - newDir.mkdir() + while (!foundLocalDir tries MAX_DIR_CREATION_ATTEMPTS) { +tries += 1 +try { + if (!newDir.exists()) { +foundLocalDir = newDir.mkdir() --- End diff -- Why keep retrying this? what's the case where it fails but then immediately succeeds? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4792] Add error message when making loc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3635#discussion_r21526232 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -67,11 +67,29 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon if (subDir == null) { subDir = subDirs(dirId).synchronized { val old = subDirs(dirId)(subDirId) -if (old != null) { +if (old != null old.exists()) { old } else { + var foundLocalDir = false + var tries = 0 val newDir = new File(localDirs(dirId), %02x.format(subDirId)) - newDir.mkdir() + while (!foundLocalDir tries MAX_DIR_CREATION_ATTEMPTS) { +tries += 1 +try { + if (!newDir.exists()) { +foundLocalDir = newDir.mkdir() + } + else { +foundLocalDir = true + } +} catch { + case e: Exception = +logWarning(sAttempt $tries to create local dir $newDir failed, e) +} + } + if (!foundLocalDir) { +throw new Exception(sFailed $MAX_DIR_CREATION_ATTEMPTS attempts to create local dir in $newDir.) --- End diff -- `IOException` is more appropriate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2096][SQL] support dot notation on arbi...
Github user sziep commented on the pull request: https://github.com/apache/spark/pull/2405#issuecomment-66280104 are there any plans on merging this soon? This is a pretty useful feature. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2096][SQL] support dot notation on arbi...
Github user ayoub-benali commented on the pull request: https://github.com/apache/spark/pull/2405#issuecomment-66280190 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2802#issuecomment-66281289 [Test build #24246 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24246/consoleFull) for PR 2802 at commit [`a0182e5`](https://github.com/apache/spark/commit/a0182e54c5ed97bb4b85c895434b4d626671ba58). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2802#issuecomment-66281294 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24246/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2096][SQL] support dot notation on arbi...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/2405#issuecomment-66282781 This PR is blocked by https://github.com/apache/spark/pull/2543. I'll update the code tomorrow and make it work :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4792] Add error message when making loc...
Github user XuTingjun commented on the pull request: https://github.com/apache/spark/pull/3635#issuecomment-66283335 @srowen , thank you for your suggestion, I have modified the method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r21527840 --- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import scala.beans.BeanInfo + +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.mllib.linalg.Vector + +/** + * :: AlphaComponent :: + * Class that represents an instance (data point) for prediction tasks. + * + * @param label Label to predict + * @param features List of features describing this instance + * @param weight Instance weight + */ +@AlphaComponent +@BeanInfo +case class LabeledPoint(label: Double, features: Vector, weight: Double) { --- End diff -- I mean the former. Yeah, that's probably the downside. Each data element is at least an object, and you can't have it reduce to a `double[]` under the hood. In the second example, I think you'd only ever really want `MixedFeatures` as an abstraction. There's no need to think of all `CategoricalFeatures` as a special case deserving a unique abstraction. I suppose if you abstract the entire training example as an object, and allow accessors like `getNumericFeature(index: Int)`, `getCategoricalFeature(index: Int)` you can still internally optimize the storage while exposing a richer object representation. You get the type safety and optimization opportunity. Sure, an `Array[Double]` could easily be translated into one of the more elaborate representations above. I suppose I myself wouldn't want to make it _too_ easy to not think about types! Anyway, up to your judgment really. There are arguments several ways here. Worth a thought to see if the idea of a bit more abstraction appeals to you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4792] Add error message when making loc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3635#discussion_r21527997 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -67,11 +67,20 @@ private[spark] class DiskBlockManager(blockManager: BlockManager, conf: SparkCon if (subDir == null) { subDir = subDirs(dirId).synchronized { val old = subDirs(dirId)(subDirId) -if (old != null) { +if (old != null old.exists()) { old } else { + var foundLocalDir = false --- End diff -- OK, but this variable seems pointless now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66286027 [Test build #24247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24247/consoleFull) for PR 3640 at commit [`74466a3`](https://github.com/apache/spark/commit/74466a3517996d837ba9f8acb6b912a82bbbd6df). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66287710 [Test build #24248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24248/consoleFull) for PR 3640 at commit [`8e13756`](https://github.com/apache/spark/commit/8e137565e34e36361faf90670ab8e98743051410). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3375#discussion_r21529898 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala --- @@ -68,62 +68,59 @@ case class HashOuterJoin( @transient private[this] lazy val DUMMY_LIST = Seq[Row](null) @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row] + @transient private[this] lazy val joinedRow = new JoinedRow() --- End diff -- I think you can try to move the `joinedRow` right after https://github.com/apache/spark/pull/3375/files#diff-48230fdc68c8c172d22709ed90f8817dR193 , NOT inside of the `def leftOuterIterator` etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3891][SQL] Add array support to percent...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/2802#discussion_r21530532 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -172,6 +177,8 @@ private[hive] case class HiveGenericUdf(functionClassName: String, children: Seq override def eval(input: Row): Any = { returnInspector // Make sure initialized. +if(foldable) return constantReturnValue --- End diff -- I've updated the #3429, I think this PR can be more simpler after #3429 merged. Besides, invoking the `foldable` win `eval` probably too heavy, which is supposed to be eliminated in `Optimizer`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4699][SQL] make caseSensitive configura...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/3558#issuecomment-66291005 That's a good point, Can we make another interface that SqlConf inherit from? Or can we move the SqlConf into the module `Catalyst`? Any other idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4793] [Deploy] ensure .jar at end of li...
Github user adrian-wang commented on the pull request: https://github.com/apache/spark/pull/3641#issuecomment-66291537 I have checked those files and found another on https://github.com/adrian-wang/spark/blob/jar/bin/compute-classpath.sh#L111 Do you think I should also change that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r21531881 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -39,23 +39,34 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) var appName: String = Spark var priority = 0 - // Additional memory to allocate to containers - // For now, use driver's memory overhead as our AM container's memory overhead - val amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead, -math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN)) - - val executorMemoryOverhead = sparkConf.getInt(spark.yarn.executor.memoryOverhead, -math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) - private val isDynamicAllocationEnabled = sparkConf.getBoolean(spark.dynamicAllocation.enabled, false) parseArgs(args.toList) + + val isClusterMode = userClass != null + loadEnvironmentArgs() validateArgs() + // Additional memory to allocate to containers. In different modes, we use different configs. + val amMemOverheadConf = if (isClusterMode) { +spark.yarn.driver.memoryOverhead + } else { +spark.yarn.am.memoryOverhead --- End diff -- we should make the name of this consistent with what we decide for naming in https://github.com/apache/spark/pull/3409 (ie spark.yarn.clientmode.am...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66292305 @jkbradley Tests fail again... Stab in the dark: looks like something is changed in the testing environment. (2) Regular and Robust in the same class It's possible to implement, but I don't want to turn class hierarchy inside out. It just violates OOP principles. That's like a car inheriting from SUV. (4) Float vs. Double and linear algebra operations I'd prefer (a). I don't like idea to convert from Array to Breeze matrix and vice versa. (5) Glad to read it. Unfortunately, it looks like I've mirrored spark repo a while ago, so I've got no mllib/features package. Should I mirror spark repo again and send you another pull request? Or is there a way to pull from spark repo? Looking forward to your reply. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66295835 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24247/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66295827 [Test build #24247 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24247/consoleFull) for PR 3640 at commit [`74466a3`](https://github.com/apache/spark/commit/74466a3517996d837ba9f8acb6b912a82bbbd6df). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class HiveFunctionWrapper(var functionClassName: String) extends java.io.Serializable ` * `class HiveFunctionWrapper(var functionClassName: String) extends java.io.Externalizable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user chazchandler commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66296113 re: (5) you can add a remote: `git remote add upstream https://github.com/apache/spark.git` fetch the latest state: `git fetch upstream` you'll see the branch/tag/etc. changes scroll by. then rebase your branch on upstream: `git rebase upstream [branch_name_you_want_to_use_as_a_new_base]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66297896 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24248/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66297888 [Test build #24248 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24248/consoleFull) for PR 3640 at commit [`8e13756`](https://github.com/apache/spark/commit/8e137565e34e36361faf90670ab8e98743051410). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class HiveFunctionWrapper(var functionClassName: String) extends java.io.Serializable ` * `class HiveFunctionWrapper(var functionClassName: String) extends java.io.Externalizable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66299321 @chazchandler, thank you very much for your quick reply! It did the trick. Now I'm a bit confused about ml/ folder. What's it for? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66306112 It seems like something went wrong. I've got multiple compilation errors like ``` [error] /home/valerij/contribute/spark/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala:299: constructor cannot be instantiated to expected type; [error] found : org.apache.spark.ui.jobs.TaskUIData [error] required: org.apache.spark.ui.jobs.UIData.TaskUIData ``` How can I fix it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4483][SQL]Optimization about reduce mem...
Github user tianyi commented on a diff in the pull request: https://github.com/apache/spark/pull/3375#discussion_r21538764 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala --- @@ -68,62 +68,59 @@ case class HashOuterJoin( @transient private[this] lazy val DUMMY_LIST = Seq[Row](null) @transient private[this] lazy val EMPTY_LIST = Seq.empty[Row] + @transient private[this] lazy val joinedRow = new JoinedRow() --- End diff -- Thanks, @chenghao-intel I will try this way tomorrow. My concerns is about the memory costs. I think we should use memory as fewer as possible to reduce the possibility of OOM occurs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/3644 [SPARK-4798][SQL] A new set of Parquet testing API and test suites This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled: - `ParquetQuerySuite` - `ParquetTestData` You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark parquet-tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3644.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3644 commit 83edb00c3c88b992356718213ad1199b95187ff6 Author: Cheng Lian l...@databricks.com Date: 2014-12-09T13:00:09Z Adds a new set of Parquet test suites --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3644#discussion_r21538934 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -181,6 +181,10 @@ private[sql] trait SQLConf { */ def getAllConfs: immutable.Map[String, String] = settings.synchronized { settings.toMap } + private[spark] def unsetConf(key: String) { +settings -= key + } + --- End diff -- Used in `ParquetTest.withSparkConf`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66309267 [Test build #24249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24249/consoleFull) for PR 3644 at commit [`83edb00`](https://github.com/apache/spark/commit/83edb00c3c88b992356718213ad1199b95187ff6). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66309244 Re: (2) Regular and Robust in the same class It's possible to implement, but I don't want to turn class hierarchy inside out. It just violates OOP principles. That's like a car inheriting from SUV. In an OO world, this might lead to an abstract base class (e.g. vehicle, which both car and SUV inherit from). Maybe the same thing applies here? That would provide a way to apply this suggestion: Would it work to have the regular parameters inherit from the robust, where the regular would override certain behavior to effectively fix the value of the noise? ...and still maintain a sensible class hierarchy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user chazchandler commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66311502 @akopich , rebasing can be tricky, especially if you've been off on a branch for a while. `git reflog` can be helpful in getting back to previous states if you end up not being able to solve the issue. however you'll probably need to solve the issue to get things merged in anyway. my guess is that something about `org.apache.spark.ui.jobs` may have changed; one option might be to rebase on branch-1.1 or -1.2 as a first step to limit the possibility of an unstable master affecting your code integration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66312060 [Test build #24250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24250/consoleFull) for PR 3644 at commit [`ee17d7b`](https://github.com/apache/spark/commit/ee17d7b1be00be44ff10f83f640d9a058562c12e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66314310 [Test build #24250 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24250/consoleFull) for PR 3644 at commit [`ee17d7b`](https://github.com/apache/spark/commit/ee17d7b1be00be44ff10f83f640d9a058562c12e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait ParquetTest ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66314319 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24250/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-66318526 @dbtsai 1) Could you elaborate on what kind of optimizations did you do? Probably, they could be applied to the broader MLlib, which is beneficial. 2) Do you know the reason why our ANN implementation worked faster than the MLOR you shared? This could also be interesting in terms of MLlib optimization. 3) Did you mean fitting a n-th degree polynom instead of a linear function? Thanks for the link, it seems very interesting! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66319557 The current design of this PR is derived from some background knowledges. I'd like to provide a brief summary here for future reference. As mentioned in the PR description, different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. `GenericUDF` doesn't implement `Serializable`). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class `o.a.h.h.q.e.Utilities` for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available `SparkConf` instance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-66320878 @jkbradley Thank you! They took some time. - I totally agree with you, I need to perform tests on the original test set. It contains less attributes, i.e. 778 vs 784 in mnist8m, so one needs to add zeros to it to make it compatible. - They both did all 40 iterations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66321327 @karlhigley, yes I've heard something about abstract classes. Though, I see no way to employ this concept here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66321392 [Test build #24249 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24249/consoleFull) for PR 3644 at commit [`83edb00`](https://github.com/apache/spark/commit/83edb00c3c88b992356718213ad1199b95187ff6). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait ParquetTest ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66321407 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24249/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66324190 Although it passed Jenkins, the first failure is rather weird. It seems that partitions collected via `SchemaRDD.collect()` can sometimes be out of order. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66324222 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4799] Use IP address instead of local h...
GitHub user smola opened a pull request: https://github.com/apache/spark/pull/3645 [SPARK-4799] Use IP address instead of local hostname in ConnectionManager See https://issues.apache.org/jira/browse/SPARK-4799 Spark fails when a node hostname is not resolvable by other nodes. See an example trace: ``` 14/12/09 17:02:41 ERROR SendingConnection: Error connecting to 27e434cf36ac:35093 java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:127) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644) at org.apache.spark.network.SendingConnection.connect(Connection.scala:299) at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:278) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) ``` The relevant code is here: https://github.com/apache/spark/blob/bcb5cdad614d4fce43725dfec3ce88172d2f8c11/core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala#L170 ``` val id = new ConnectionManagerId(Utils.localHostName, serverChannel.socket.getLocalPort) ``` This piece of code should use the host IP with Utils.localIpAddress or a method that acknowleges user settings (e.g. `SPARK_LOCAL_IP`). Since I cannot think about a use case for using hostname here, I'm creating a PR with the former solution, but if you think the later is better, I'm willing to create a new PR with a more elaborate fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/smola/spark SPARK-4799 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3645.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3645 commit 45d0356241b852c575fb3b47a20c072c6ef3d7bf Author: Santiago M. Mola sm...@stratio.com Date: 2014-11-25T15:02:14Z [SPARK-4799] Use IP address instead of local hostname in ConnectionManager. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4799] Use IP address instead of local h...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3645#issuecomment-66324983 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66325520 [Test build #24251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24251/consoleFull) for PR 3644 at commit [`ee17d7b`](https://github.com/apache/spark/commit/ee17d7b1be00be44ff10f83f640d9a058562c12e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66325635 @chazchandler, thank you very much for your help. I shouldn't have rebase on master. Rebase on 1.2 was successful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/3536#issuecomment-66331985 I'd agree that cosine similarity is preferred. Can't really think of a case where I've *not* used cosine sim for a similar items or similar users computation. Of course, it could be added as an option potentially (whether to use cosine sim - default - or plain dot product. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3640 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4785] [SQL] Support udf instance ser/de...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3640#issuecomment-66332833 Thanks a lot guys for digging into this! Merged to master and branch 1.2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-874] adding a --wait flag
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3567#issuecomment-66333686 Jenkins, test this please. LGTM pending tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-874] adding a --wait flag
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3567#issuecomment-66334417 [Test build #24252 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24252/consoleFull) for PR 3567 at commit [`d05c5bb`](https://github.com/apache/spark/commit/d05c5bb952fc3f29f6f89db2c5a7dba4da7a5d4f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4791] [sql] Infer schema from case clas...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/3646 [SPARK-4791] [sql] Infer schema from case class with multiple constructors Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now. Needed for [https://github.com/apache/spark/pull/3637] CC: @marmbrus You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark sql-reflection Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3646.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3646 commit 796b2e4ffc562154c0511dc660c9cd4b3f8ac07c Author: Joseph K. Bradley jos...@databricks.com Date: 2014-12-09T18:35:28Z Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-66336110 @avulanov 1. I did the same optimization for MLlib in [my recently PRs](https://github.com/apache/spark/commits/master?author=dbtsai). * Accessing the values in dense/sparse vector directly is very slow without having a local reference of primitive array due to the dereference. See #3577 and #3435. There is bytecode analysis for this issue in #3435 * Breeze's foreachActive is very slow, so I implemented a 4x faster version in #3288 My experience is that if Breeze is used in critical code path, it has to be cautious. 2. I don't check out your ANN implementation yet, but I will check today. I'll send you our optimized Gradient Computation code for MLOR. Will be interesting to see the new benchmark compared with the one you tested. 3. See page 27 at Prof. CJ Lin's slide. http://www.csie.ntu.edu.tw/~cjlin/talks/SFmeetup.pdf It's just doing the feature expansion by mapping the data into higher dimension space. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66336150 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24251/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66336134 [Test build #24251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24251/consoleFull) for PR 3644 at commit [`ee17d7b`](https://github.com/apache/spark/commit/ee17d7b1be00be44ff10f83f640d9a058562c12e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait ParquetTest ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3215#issuecomment-66336371 We actually don't need to wait on the RC to merge this since this is only going into master. I'll take a quick look and will likely merge it after that since this is pretty straightforward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4791] [sql] Infer schema from case clas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3646#issuecomment-66336585 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24253/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4791] [sql] Infer schema from case clas...
Github user KirthiRaman commented on the pull request: https://github.com/apache/spark/pull/3646#issuecomment-66336768 unsubscribe On Tue, Dec 9, 2014 at 1:57 PM, UCB AMPLab notificati...@github.com wrote: Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24253/ Test FAILed. -- Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3646#issuecomment-66336585. -- -- Kirthi Raman http://www.linkedin.com/in/kirthiraman (410)-302-1757 (Mobile) (703)-444-3397 (Home) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3215#discussion_r21552599 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -178,21 +178,25 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) private def getUsageMessage(unknownParam: List[String] = null): String = { val message = if (unknownParam != null) sUnknown/unsupported param $unknownParam\n else -message + - Usage: org.apache.spark.deploy.yarn.Client [options] \n + - Options:\n + ---jar JAR_PATH Path to your application's JAR file (required in yarn-cluster mode)\n + ---class CLASS_NAME Name of your application's main class (required)\n + ---arg ARG Argument to be passed to your application's main class.\n + - Multiple invocations are possible, each will be passed in order.\n + ---num-executors NUMNumber of executors to start (Default: 2)\n + ---executor-cores NUM Number of cores for the executors (Default: 1).\n + ---driver-memory MEMMemory for driver (e.g. 1000M, 2G) (Default: 512 Mb)\n + ---executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G)\n + ---name NAMEThe name of your application (Default: Spark)\n + ---queue QUEUE The hadoop queue to use for allocation requests (Default: 'default')\n + ---addJars jars Comma separated list of local jars that want SparkContext.addJar to work with.\n + ---files files Comma separated list of files to be distributed with the job.\n + ---archives archivesComma separated list of archives to be distributed with the job. +message + + |Usage: org.apache.spark.deploy.yarn.Client [options] + |Options: + | --jar JAR_PATH Path to your application's JAR file (required in yarn-cluster + | mode) + | --class CLASS_NAME Name of your application's main class (required) + | --arg ARGArgument to be passed to your application's main class. + | Multiple invocations are possible, each will be passed in order. + | --num-executors NUM Number of executors to start (Default: 2) + | --executor-cores NUM Number of cores for the executors (Default: 1). + | --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512 Mb) + | --executor-memory MEMMemory per executor (e.g. 1000M, 2G) (Default: 1G) + | --name NAME The name of your application (Default: Spark) + | --queue QUEUEThe hadoop queue to use for allocation requests (Default: + | 'default') + | --addJars jars Comma separated list of local jars that want SparkContext.addJar + | to work with. + | --files filesComma separated list of files to be distributed with the job. + | --archives archives Comma separated list of archives to be distributed with the job. + --- End diff -- do you need to do `.stripMargins` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3215#discussion_r21552652 --- Diff: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClientImpl.scala --- @@ -1,118 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the License); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.deploy.yarn - -import scala.collection.{Map, Set} -import java.net.URI - -import org.apache.hadoop.net.NetUtils -import org.apache.hadoop.yarn.api._ -import org.apache.hadoop.yarn.api.records._ -import org.apache.hadoop.yarn.api.protocolrecords._ -import org.apache.hadoop.yarn.conf.YarnConfiguration -import org.apache.hadoop.yarn.ipc.YarnRPC -import org.apache.hadoop.yarn.util.{ConverterUtils, Records} - -import org.apache.spark.{Logging, SecurityManager, SparkConf} -import org.apache.spark.scheduler.SplitInfo -import org.apache.spark.util.Utils - -/** - * YarnRMClient implementation for the Yarn alpha API. - */ -private class YarnRMClientImpl(args: ApplicationMasterArguments) extends YarnRMClient with Logging { - - private var rpc: YarnRPC = null - private var resourceManager: AMRMProtocol = _ - private var uiHistoryAddress: String = _ - private var registered: Boolean = false - - override def register( - conf: YarnConfiguration, - sparkConf: SparkConf, - preferredNodeLocations: Map[String, Set[SplitInfo]], - uiAddress: String, - uiHistoryAddress: String, - securityMgr: SecurityManager) = { -this.rpc = YarnRPC.create(conf) -this.uiHistoryAddress = uiHistoryAddress - -synchronized { - resourceManager = registerWithResourceManager(conf) - registerApplicationMaster(uiAddress) - registered = true -} - -new YarnAllocationHandler(conf, sparkConf, resourceManager, getAttemptId(), args, - preferredNodeLocations, securityMgr) - } - - override def getAttemptId() = { -val envs = System.getenv() -val containerIdString = envs.get(ApplicationConstants.AM_CONTAINER_ID_ENV) -val containerId = ConverterUtils.toContainerId(containerIdString) -val appAttemptId = containerId.getApplicationAttemptId() -appAttemptId - } - - override def unregister(status: FinalApplicationStatus, diagnostics: String = ) = synchronized { -if (registered) { - val finishReq = Records.newRecord(classOf[FinishApplicationMasterRequest]) -.asInstanceOf[FinishApplicationMasterRequest] - finishReq.setAppAttemptId(getAttemptId()) - finishReq.setFinishApplicationStatus(status) - finishReq.setDiagnostics(diagnostics) - finishReq.setTrackingUrl(uiHistoryAddress) - resourceManager.finishApplicationMaster(finishReq) -} - } - - override def getAmIpFilterParams(conf: YarnConfiguration, proxyBase: String) = { -val proxy = YarnConfiguration.getProxyHostAndPort(conf) -val parts = proxy.split(:) -val uriBase = http://; + proxy + proxyBase -Map(PROXY_HOST - parts(0), PROXY_URI_BASE - uriBase) - } - - override def getMaxRegAttempts(conf: YarnConfiguration) = -conf.getInt(YarnConfiguration.RM_AM_MAX_RETRIES, YarnConfiguration.DEFAULT_RM_AM_MAX_RETRIES) - - private def registerWithResourceManager(conf: YarnConfiguration): AMRMProtocol = { -val rmAddress = NetUtils.createSocketAddr(conf.get(YarnConfiguration.RM_SCHEDULER_ADDRESS, - YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS)) -logInfo(Connecting to ResourceManager at + rmAddress) -rpc.getProxy(classOf[AMRMProtocol], rmAddress, conf).asInstanceOf[AMRMProtocol] - } - - private def registerApplicationMaster(uiAddress: String): RegisterApplicationMasterResponse = { -val appMasterRequest = Records.newRecord(classOf[RegisterApplicationMasterRequest]) - .asInstanceOf[RegisterApplicationMasterRequest] -
[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3215#issuecomment-66337338 Hey this LGTM. I'm just gonna merge this. My comments are minor enough that we can slip them in later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4338. [YARN] Ditch yarn-alpha.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3215 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3607#issuecomment-66338560 By the way you will need to rebase to master since we just removed the support for yarn alpha. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r21553460 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -39,23 +39,34 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) var appName: String = Spark var priority = 0 - // Additional memory to allocate to containers - // For now, use driver's memory overhead as our AM container's memory overhead - val amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead, -math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN)) - - val executorMemoryOverhead = sparkConf.getInt(spark.yarn.executor.memoryOverhead, -math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) - private val isDynamicAllocationEnabled = sparkConf.getBoolean(spark.dynamicAllocation.enabled, false) parseArgs(args.toList) + + val isClusterMode = userClass != null + loadEnvironmentArgs() validateArgs() + // Additional memory to allocate to containers. In different modes, we use different configs. + val amMemOverheadConf = if (isClusterMode) { +spark.yarn.driver.memoryOverhead + } else { +spark.yarn.am.memoryOverhead --- End diff -- Hm... I agree that adding `clientmode` is clearer but it's also incredibly verbose. I think if we document this clearly that this is only for client mode then the user won't arbitrarily try to set it. I'll comment on #3409 about this as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3409#issuecomment-66339357 Hey regarding the naming, I would actually prefer `spark.yarn.am.*` over `spark.yarn.clientmode.am.*`. Although I agree that it's clearer to have clientmode in there, it's incredibly verbose and inconsistent with the other similar configs that we have (i.e. `spark.driver.extraJavaOptions`, `spark.yarn.executor.memoryOverhead`). If we want to discourage the user from setting these in cluster mode, I think the better approach is to log a warning or even throw an exception if this happens, instead of cramming in `clientmode` in a config name that we can't ever change again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org